0% found this document useful (0 votes)

79 views42 pages

Graph Kernels: S.V. N. Vishwanathan

This document presents a unified framework for graph kernels that encompasses many existing graph kernels as special cases. It introduces efficient algorithms for computing graph kernels that generally reduce the time complexity from O(n6) to O(n3). The document first reviews relevant concepts in linear algebra and graph theory. It then describes the proposed graph kernel framework, discusses how it relates to random walk and marginalized graph kernels, and presents four efficient algorithms for computation. The rest of the document draws connections to other structured object kernels like rational kernels and R-convolution kernels, and discusses related work. Experiments on real-world graph datasets show the computational advantages of the new algorithms.

Uploaded by

De Bunge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views42 pages

Graph Kernels: S.V. N. Vishwanathan

Uploaded by

De Bunge

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Journal of Machine Learning Research 11 (2010) 1201-1242 Submitted 5/08; Revised 4/09; Published 4/10

Graph Kernels
S.V. N. Vishwanathan VISHY@STAT.PURDUE.EDU
Departments of Statistics and Computer Science
Purdue University
250 N University Street, West Lafayette, IN 47907-2066, USA
Nicol N. Schraudolph JMLR@SCHRAUDOLPH.ORG
adaptive tools AG
Canberra ACT 2602, Australia
Risi Kondor RISI@CALTECH.EDU
Center for the Mathematics of Information
California Institute of Technology
1200 E. California Blvd., MC 305-16, Pasadena, CA 91125, USA
Karsten M. Borgwardt KARSTEN.BORGWARDT@TUEBINGEN.MPG.DE
Interdepartmental Bioinformatics Group
Max Planck Institute for Developmental Biology

Spemannstr. 38, 72076 T ubingen, Germany

Editor: John Lafferty
Abstract
We present a unied framework to study graph kernels, special cases of which include the random
walk (G artner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004;
Mah e et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time
complexity of kernel computation between unlabeled graphs with n vertices from O(n
6
) to O(n
3
).
We nd a spectral decomposition approach even more efcient when computing entire kernel ma-
trices. For labeled graphs we develop conjugate gradient and xed-point methods that take O(dn
3
)
time per iteration, where d is the size of the label set. By extending the necessary linear algebra to
Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge ker-
nels, and O(n
4
) in the innite-dimensional case; on sparse graphs these algorithms only take O(n
2
)
time per iteration in all cases. Experiments on graphs from bioinformatics and other application
domains show that these techniques can speed up computation of the kernel by an order of mag-
nitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when
specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to
R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment
kernel of Fr ohlich et al. (2006) yet provably positive semi-denite.
Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics,
rational kernels, transducers, semirings, random walks
1. Introduction
Machine learning in domains such as bioinformatics (Sharan and Ideker, 2006), chemoinformatics
(Bonchev and Rouvray, 1991), drug discovery (Kubinyi, 2003), web data mining
. Also at the Max Planck Institute for Biological Cybernetics.
c _2010 S.V. N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt.
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT

Figure 1: Left: Structure of E. coli protein fragment APO-BCCP87 (Yao et al., 1997), ID 1a6x
in the Protein Data Bank (Berman et al., 2000). Right: Borgwardt et al.s (2005) graph
representation for this protein fragment. Nodes represent secondary structure elements,
and edges encode neighborhood along the amino acid chain (solid) resp. in Euclidean 3D
space (dashed).
(Washio and Motoda, 2003), and social networks (Kumar et al., 2006) involves the study of rela-
tionships between structured objects. Graphs are natural data structures to model such structures,
with nodes representing objects and edges the relations between them. In this context, one often
encounters two questions: How similar are two nodes in a given graph? and How similar are
two graphs to each other?
In protein function prediction, for instance, one might want to predict whether a given protein is
an enzyme or not. Computational approaches infer protein function by nding proteins with similar
sequence, structure, or chemical properties. A very successful recent method is to model the protein
as a graph (see Figure 1), and assign similar functions to similar graphs (Borgwardt et al., 2005).
In Section 5.2 we compute graph kernels to measure the similarity between proteins and enzymes
represented in this fashion.
Another application featured in Section 5.2 involves predicting the toxicity of chemical molecules
by comparing their three-dimensional structure. Here the molecular structure is modeled as a graph,
and the challenge is to compute the similarity between molecules of known and unknown toxicity.
Finally, consider the task of nding web pages with related content. Since documents on the
web link to each other, one can model each web page as the node of a graph, and each link as
an edge. Now the problem becomes that of computing similarities between the nodes of a graph.
Taking this one step further, detecting mirrored sets of web pages requires computing the similarity
between the graphs representing them.
Kernel methods (Sch olkopf and Smola, 2002) offer a natural framework to study these ques-
tions. Roughly speaking, a kernel k(x, x

) is a measure of similarity between objects x and x

. It
must satisfy two mathematical requirements: it must be symmetric, that is, k(x, x

) = k(x

, x), and
positive semi-denite (p.s.d.). Comparing nodes in a graph involves constructing a kernel between
nodes, while comparing graphs involves constructing a kernel between graphs. In both cases, the
challenge is to dene a kernel that captures the semantics inherent in the graph structure and is
reasonably efcient to evaluate.
The idea of constructing kernels on graphs (i.e., between the nodes of a single graph) was
rst proposed by Kondor and Lafferty (2002), and extended by Smola and Kondor (2003). In con-
1202
GRAPH KERNELS
trast, in this paper we focus on kernels between graphs. The rst such kernels were proposed by
G artner et al. (2003) and later extended by Borgwardt et al. (2005). Much at the same time, the
idea of marginalized kernels (Tsuda et al., 2002) was extended to graphs by Kashima et al. (2003,
2004), then further rened by Mah e et al. (2004). Another algebraic approach to graph kernels has
appeared recently (Kondor and Borgwardt, 2008). A seemingly independent line of research inves-
tigates the so-called rational kernels, which are kernels between nite state automata based on the
algebra of abstract semirings (Cortes et al., 2002, 2003, 2004).
The aim of this paper is twofold: on the one hand we present theoretical results showing that all
the above graph kernels are in fact closely related, on the other hand we present new algorithms for
efciently computing such kernels. We begin by establishing some notation and reviewing pertinent
concepts from linear algebra and graph theory.
1.1 Paper Outline
The rst part of this paper (Sections 25) elaborates and updates a conference publication of
Vishwanathan et al. (2007) to present a unifying framework for graph kernels encompassing many
known kernels as special cases, and to discuss connections to yet others. After dening some basic
concepts in Section 2, we describe the framework in Section 3, prove that it leads to p.s.d. ker-
nels, and discuss the random walk and marginalized graph kernels as special cases. For ease of
exposition we will work with real matrices in the main body of the paper and relegate the RKHS
extensions to Appendix A. In Section 4 we present four efcient ways to compute random walk
graph kernels, namely: 1. via reduction to a Sylvester equation, 2. with a conjugate gradient solver,
3. using xed-point iterations, and 4. via spectral decompositions. Experiments on a variety of real
and synthetic data sets in Section 5 illustrate the computational advantages of our methods, which
generally reduce the time complexity of kernel computation from O(n
6
) to O(n
3
). The experiments
of Section 5.3 were previously presented at a bioinformatics symposium (Borgwardt et al., 2007).
The second part of the paper (Sections 67) draws further connections to existing kernels on
structured objects. In Section 6 we present a simple proof that rational kernels (Cortes et al., 2002,
2003, 2004) are p.s.d., and show that specializing them to graphs yields random walk graph kernels.
In Section 7 we discuss the relation between R-convolution kernels (Haussler, 1999) and various
graph kernels, all of which can in fact be shown to be instances of R-convolution kernels. Extend-
ing the framework through the use of semirings does not always result in a p.s.d. kernel though;
a case in point is the optimal assignment kernel of Fr ohlich et al. (2006). We establish sufcient
conditions for R-convolution kernels in semirings to be p.s.d., and provide a mostly optimal as-
signment kernel that is provably p.s.d. We conclude in Section 8 with an outlook and discussion.
2. Preliminaries
Here we dene the basic concepts and notation from linear algebra and graph theory that will be
used in the remainder of the paper.
2.1 Linear Algebra Concepts
We use e
i
to denote the i
th
standard basis vector (that is, a vector of all zeros with the i
th
entry set
to one), e to denote a vector with all entries set to one, 0 to denote the vector of all zeros, and I to
1203
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
denote the identity matrix. When it is clear from the context we will not mention the dimensions of
these vectors and matrices.
Denition 1 Given real matrices A R
nm
and B R
pq
, the Kronecker product AB R
npmq
and column-stacking operator vec(A) R
nm
are dened as
AB :=
_

_
A
11
B A
12
B . . . A
1m
B
.
.
.
.
.
.
.
.
.
.
.
.
A
n1
B A
n2
B . . . A
nm
B
_

_, vec(A) :=
_

_
A
1
.
.
.
A
m
_

_,
where A
j
denotes the j
th
column of A.
The Kronecker product and vec operator are linked by the well-known property (e.g., Bernstein,
2005, Proposition 7.1.9):
vec(ABC) = (C

A)vec(B). (1)
Another well-known property of the Kronecker product which we make use of is (Bernstein, 2005,
Proposition 7.1.6):
(AB)(CD) = ACBD. (2)
Finally, the Hadamard product of two real matrices A, B R
nm
, denoted by AB R
nm
, is
obtained by element-wise multiplication. It interacts with the Kronecker product via
(AB) (CD) = (AC) (BD). (3)
All the above concepts can be extended to a Reproducing Kernel Hilbert Space (RKHS) (See Ap-
pendix A for details).
2.2 Graph Concepts
A graph G consists of an ordered set of n vertices V = v
1
, v
2
, . . . , v
n
, and a set of directed edges
E VV. A vertex v
i
is said to be a neighbor of another vertex v
j
if they are connected by an edge,
that is, if (v
i
, v
j
) E; this is also denoted v
i
v
j
. We do not allow self-loops, that is, (v
i
, v
i
) / E
for any i. A walk of length k on G is a sequence of indices i
0
, i
1
, . . . i
k
such that v
i
r1
v
i
r
for all
1 r k. A graph is said to be strongly connected if any two pairs of vertices can be connected
by a walk. In this paper we will always work with strongly connected graphs. A graph is said to be
undirected if (v
i
, v
j
) E (v
j
, v
i
) E.
In much of the following we will be dealing with weighted graphs, which are a slight gener-
alization of the above. In a weighted graph, each edge (v
i
, v
j
) has an associated weight w
i j
> 0
signifying its strength. If v
i
and v
j
are not neighbors, then w
i j
= 0. In an undirected weighted
graph w
i j
= w
ji
.
When G is unweighted, we dene its adjacency matrix as the n n matrix

A with

A
i j
= 1 if
v
j
v
i
, and 0 otherwise. For weighted graphs,

A
i j
= w
ji
. While some authors would call these
matrices the transpose of the adjacency matrix, for our purposes the present denitions will be more
convenient. For undirected graphs

A is symmetric, and the two denitions coincide. The diagonal
entries of

A are always zero.
1204
GRAPH KERNELS
The adjacency matrix has a normalized cousin, dened A :=

AD
1
, which has the property that
each of its columns sums to one, and it can therefore serve as the transition matrix for a stochastic
process. Here, D is a diagonal matrix of node degrees, that is, D
ii
= d
i
=
j

A
i j
. A random walk on
G is a process generating sequences of vertices v
i
1
, v
i
2
, v
i
3
, . . . according to P(i
k+1
[i
1
, . . . i
k
) =A
i
k+1
,i
k
,
that is, the probability at v
i
k
of picking v
i
k+1
next is proportional to the weight of the edge (v
i
k
, v
i
k+1
).
The t
th
power of A thus describes t-length walks, that is, (A
t
)
i j
is the probability of a transition
from vertex v
j
to vertex v
i
via a walk of length t. If p
0
is an initial probability distribution over
vertices, then the probability distribution p
t
describing the location of our random walker at time t
is p
t
= A
t
p
0
. The j
th
component of p
t
denotes the probability of nishing a t-length walk at vertex
v
j
.
A random walk need not continue indenitely; to model this, we associate every node v
i
k
in
the graph with a stopping probability q
i
k
. Our generalized random walk graph kernels then use
the overall probability of stopping after t steps, given by q

p
t
. Like p
0
, the vector q of stopping
probabilities is a place to embed prior knowledge into the kernel design. Since p
t
as a probability
distribution sums to one, a uniform vector q (as might be chosen in the absence of prior knowledge)
would yield the same overall stopping probability for all p
t
, thus leading to a kernel that is invariant
with respect to the graph structure it is meant to measure. In this case, the unnormalized adjacency
matrix

A (which simply counts random walks instead of measuring their probability) should be used
instead.
Let X be a set of labels which includes the special label . Every edge-labeled graph G is
associated with a label matrix X X
nn
in which X
i j
is the label of the edge (v
j
, v
i
) and X
i j
=
if (v
j
, v
i
) / E. Let H be the RKHS induced by a p.s.d. kernel : X X R, and let : X H
denote the corresponding feature map, which we assume maps to the zero element of H. We use
(X) to denote the feature matrix of G (see Appendix A for details). For ease of exposition we do
not consider labels on vertices here, though our results hold for that case as well. Henceforth we
use the term labeled graph to denote an edge-labeled graph.
Two graphs G = (V, E) and G

= (V

, E

) are isomorphic (denoted by G

= G

) if there ex-
ists a bijective mapping g : V V

(called the isomorphism function) such that (v

i
, v
j
) E iff
(g(v
i
), g(v
j
)) E

.
3. Random Walk Graph Kernels
Our generalized random walk graph kernels are based on a simple idea: given a pair of graphs,
perform random walks on both, and count the number of matching walks. We show that this simple
concept underlies both random walk and marginalized graph kernels. In order to do this, we rst
need to introduce direct product graphs.
3.1 Direct Product Graphs
Given two graphs G(V, E) and G

, E

), their direct product G

is a graph with vertex set

= (v
i
, v

r
) : v
i
V, v

r
V

, (4)
and edge set
E

= ((v
i
, v

r
), (v
j
, v

s
)) : (v
i
, v
j
) E (v

r
, v

s
) E

. (5)
1205
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Figure 2: Two graphs (top left & right) and their direct product (bottom). Each node of the direct
product graph is labeled with a pair of nodes (4); an edge exists in the direct product if
and only if the corresponding nodes are adjacent in both original graphs (5). For instance,
nodes 11

and 32

are adjacent because there is an edge between nodes 1 and 3 in the rst,
and 1

and 2

in the second graph.

In other words, G

is a graph over pairs of vertices from G and G

, and two vertices in G

are
neighbors if and only if the corresponding vertices in G and G

are both neighbors; see Figure 2 for

an illustration. If

A and

A

are the respective adjacency matrices of G and G

, then the adjacency

matrix of G

is

A

=

A

. Similarly, A

= AA

.
Performing a random walk on the direct product graph is equivalent to performing a simulta-
neous random walk on G and G

(Imrich and Klav zar, 2000). If p and p

denote initial probability

distributions over the vertices of G and G

, then the corresponding initial probability distribution on

the direct product graph is p

:= pp

. Likewise, if q and q

are stopping probabilities (that is, the

probability that a random walk ends at a given vertex), then the stopping probability on the direct
product graph is q

:= qq

.
Let [V[ =: n and [V

[ =: n

. If G and G

are edge-labeled, we can associate a weight matrix

R
nn

with G

using our extension of the Kronecker product (Denition 1) into RKHS

(Denition 11 in Appendix A):
W

= (X) (X

). (6)
1206
GRAPH KERNELS
As a consequence of the denition of (X) and (X

), the entries of W

are non-zero only if

the corresponding edge exists in the direct product graph. If we simply let H =R, (X) =

A, and
(X

) =

A

then (6) reduces to

, the adjacency matrix of the direct product graph. Normalization

can be incorporated by letting (X
i j
) = 1/d
i
if (v
j
, v
i
) E, and zero otherwise.
1
Then (X) = A
and (X

) = A

, and consequently W

= A

.
If the edges of our graphs take on labels from a nite set, without loss of generality 1, 2, . . . , d,
we can let H be R
d
endowed with the usual inner product. For each edge (v
j
, v
i
) E we set
(X
i j
) = e
l
/d
i
if the edge (v
j
, v
i
) is labeled l; all other entries of (X) are 0. Thus the weight
matrix (6) has a non-zero entry iff an edge exists in the direct product graph and the corresponding
edges in G and G

have the same label. Let

l
A denote the normalized adjacency matrix of the graph
ltered by the label l, that is,
l
A
i j
= A
i j
if X
i j
= l, and zero otherwise. Some simple algebra (omitted
for the sake of brevity) shows that the weight matrix of the direct product graph can then be written
as
W

=
d

l=1
l
A
l
A

. (7)
In Section 4 we will develop efcient methods to compute kernels dened using the weight matrix of
the direct product graph. The applicability and time complexity of a particular method will depend
on whether the graphs are unlabeled though possibly edge-weighted (W

= A

), have discrete edge

labels (7), orin the most general caseemploy an arbitrary edge kernel (6); see Table 1 for a
summary.
3.2 Kernel Denition
As stated above, performing a random walk on the direct product graph G

is equivalent to per-
forming a simultaneous random walk on the graphs G and G

(Imrich and Klav zar, 2000). There-

fore, the ((i1)n

+r, ( j1)n

+s)
th
entry of A
k

represents the probability of simultaneous length

k random walks on G (starting from vertex v
j
and ending in vertex v
i
) and G

(starting from ver-

tex v

s
and ending in vertex v

r
). The entries of W

(6) represent similarity between edges: The

((i1)n

+r, ( j 1)n

+s) entry of W
k

represents the similarity between simultaneous length k

random walks on G and G

, measured via the kernel function . Given initial and stopping proba-
bility distributions p

and q

one can compute q

W
k

, which is the expected similarity between

simultaneous length k random walks on G and G

.
To dene a kernel which computes the similarity between G and G

, one natural idea is to simply

sum up q

W
k

for all values of k. However, this sum might not converge, leaving the kernel value
undened. To overcome this problem, we introduce appropriately chosen non-negative coefcients
(k), and dene the kernel between G and G

as
k(G, G

) :=

k=0
(k)q

W
k

. (8)
This denition is very exible and offers the kernel designer many parameters to adjust in an
application-specic manner: Appropriately choosing (k) allows one to (de-)emphasize walks of
different lengths; if initial and stopping probabilities are known for a particular application, then
1. The technical problem that now(X
i j
) depends on d
i
can be addressed by making d
i
a feature of all edges (v
j
, v
i
) E.
1207
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
this knowledge can be incorporated into the kernel; and nally, appropriate kernels or similarity
measures between edges can be incorporated via the weight matrix W

. Despite its exibility, this

kernel is guaranteed to be p.s.d. andas we will see in Section 4can be computed efciently by
exploiting the special structure of W

. To show that (8) is a valid p.s.d. kernel we need the following

technical lemma:
Lemma 2 k N : W
k

= vec[(X

)
k
p

((X)
k
p)

].
Proof By induction over k. Base case: k = 0. Using (1) we nd
W
0

= p

= (pp

)vec(1) = vec(p

1 p

) = vec[(X

)
0
p

((X)
0
p)

]. (9)
Induction from k to k +1: Using the induction assumption W
k

= vec[(X

)
k
p

((X)
k
p)

] and
Lemma 12 we obtain
W
k+1

W
k

= ((X) (X

))vec[(X

)
k
p

((X)
k
p)

]
= vec[(X

)(X

)
k
p

((X)
k
p)

(X)

] (10)
= vec[(X

)
k+1
p

((X)
k+1
p)

].
Base case (9) and induction (10) together imply Lemma 2 k N
0
.
Theorem 3 If the coefcients (k) are such that (8) converges, then (8) denes a valid p.s.d. kernel.
Proof Using Lemmas 12 and 2 we can write
q

W
k

= (qq

vec[(X

)
k
p

((X)
k
p)

]
= vec[q

)
k
p

((X)
k
p)

q]
= (q

(X)
k
p)

. .

k
(G)

)
k
p

)
. .

k
(G

)
. (11)
Each individual term of (11) equals
k
(G)

k
(G

) for some function

k
, and is therefore a valid
p.s.d. kernel. The theorem follows because the class of p.s.d. kernels is closed under non-negative
linear combinations and pointwise limits (Berg et al., 1984).
3.3 Special Cases
Kashima et al. (2004) dene a kernel between labeled graphs via walks and their label sequences.
Recall that a walk of length t on G is a sequence of indices i
1
, i
2
, . . . i
t+1
such that v
i
k
v
i
k+1
for all
1 k t. In our setting (where we do not consider node labels), the label sequence h = h
1
, . . . , h
t
associated with a walk is simply the sequence of edge labels encountered during the walk. Let P
denote a transition probability matrix, where P
i j
denotes the probability of transition from node v
i
to node v
j
. For instance, P might be the normalized adjacency matrix of G. Furthermore, let p
1208
GRAPH KERNELS
and q denote starting and stopping probabilities. Then one can compute the probability of a walk
i
1
, i
2
, . . . i
t+1
and hence the label sequence h associated with it as
p(h[G) := q
i
t+1
t

j=1
P
i
j
,i
j+1
p
i
1
. (12)
Now let

denote a feature map on edge labels, and dene a kernel between label sequences of
length t by
(h, h

) :=
t

i=1
(h
i
, h

i
) =
t

i=1

(h
i
),

(h

i
)
_
(13)
if h and h

have the same length t, and zero otherwise. Using (12) and (13) we can dene a kernel
between graphs via marginalization:
k(G, G

) :=

(h, h

) p(h[G) p(h[G

). (14)
Kashima et al. (2004, Eq. 1.19) show that (14) can be written as
k(G, G

) = q

(IT

)
1
p

, (15)
where T

= [vec(P)vec(P

] [

(X)

(X

)]. (As usual, X and X

denote the edge label matrices

of G and G

, respectively, and

the corresponding feature matrices.)
Although this kernel is differently motivated, it can be obtained as a special case of our frame-
work. Towards this end, assume (k) =
k
for some > 0. We can then write
k(G, G

) =

k=0

k
q

W
k

= q

(IW

)
1
p

. (16)
To recover the marginalized graph kernels let = 1, and dene (X
i j
) = P
i j

(X
i j
), in which case
W

= T

, thus recovering (15).

Given a pair of graphs, G artner et al. (2003) also perform random walks on both, but then count
the number of matching walks. Their kernel is dened as (G artner et al., 2003, Denition 6):
k(G, G

) =
n

i=1
n

j=1

k=0

k
_

A
k

_
i j
. (17)
To obtain (17) in our framework, set (k) :=
k
, assume uniform distributions for the starting and
stopping probabilities over the vertices of G and G

(i.e., p
i
= q
i
= 1/n and p

i
= q

i
= 1/n

), and let
(X) :=

A and (X

) =

A

. Consequently, p

= q

= e/(nn

), and W

=

A

, the unnormalized
adjacency matrix of the direct product graph. This allows us to rewrite (8) to obtain
k(G, G

) =

k=0
(k)q

W
k

=
1
n
2
n
2
n

i=1
n

j=1

k=0

k
_

A
k

_
i j
, (18)
which recovers (17) to within a constant factor. G artner et al. (2003) also extend their kernels to
graphs with labels from a nite set by replacing

A

in (17) with a sum

of label-ltered (but
1209
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
sparsity dense sparse
edge labels none/scalar nite set nite-dim. -dim.
any
Method (Section) W

= AA

(7) kernel (6)

Sylvester Equation (4.1) m
2
n
3
unknown
Conjugate Gradient (4.2) m
2
rn
3
m
2
rdn
3
m
2
rn
4
m
2
rn
2
Fixed-Point Iterations (4.3) m
2
kn
3
m
2
kdn
3
m
2
kn
4
m
2
kn
2
Spectral Decomposition (4.4) (m+n)mn
2
m
2
n
6

Nearest Kron. Product (4.5) 1 m

2
k

dn
2
m
2
k

n
4
m
2
k

n
2
Table 1: Worst-case time complexity (in O() notation) of our methods for an mm graph kernel
matrix, where n = size of the graphs (number of nodes), d = size of label set resp. dimen-
sionality of feature map, r = effective rank of W

, k = number of xed-point iterations

(31), and k

= number of power iterations (37).

unnormalized) adjacency matrices, analogous to our (7). The reduction to our framework extends
to this setting in a straightforward manner.
G artner et al. (2003) discuss two cases of special interest: First, their geometric kernel employs
a xed decay factor to down-weight the contribution of long walks to the kernel, setting
k
:=
k
as
in our (16). The choice of is critical here: It must be small enough for the sum in (17) to converge,
depending on the spectrum of W

(Vishwanathan, 2002, Chapter 6). Second, their exponential

kernel is dened as
k(G, G

) =
n

i=1
n

j=1
[e

]
i j
= e

e, (19)
using the matrix exponential. This is obtained in our framework by setting
k
:=
k
/k!, so that the
right-most sum in (18) becomes the series expansion of e

.
The kernels of G artner et al. (2003) differ from our denition (8) in that they do not explicitly
model starting or stopping probabilities, and employ unnormalized adjacency matrices instead of
our more general weight matrix (6) which allows for normalization and arbitrary kernels on edges.
4. Efcient Computation
Computing a geometric random walk graph kernel with (k) =
k
amounts to inverting (IW

),
an n
2
n
2
matrix if G and G

have n vertices each. Since the complexity of inverting a matrix is

essentially cubic in its dimensions, direct computation of (16) would require O(n
6
) time. Below
we develop methods based on Sylvester equations (Section 4.1), conjugate gradients (Section 4.2),
xed-point iterations (Section 4.3), and spectral decompositions (Section 4.4) that greatly accelerate
this computation. Section 4.5 introduces an approximation that can further speed up the kernel
computation for labeled graphs.
Table 1 summarizes our results, listing the worst-case time complexity of our methods as a
function of graph density and labeling. Exact computation of the full kernel matrix between m
dense, unlabeled (but possibly edge-weighted) graphs of n nodes each (leftmost column) is generally
quadratic in the number of graphs and cubic in their size; for the iterative methods this must be
multiplied by the number of iterations, which is given by the effective rank r of the weight matrixW

1210
GRAPH KERNELS
for conjugate gradient, and by (31) for xed-point iterations. The spectral decomposition approach
(Section 4.4) is exceptional here in that it can be linear in m resp. quadratic in n (but not both) if
precomputation of the m spectral graph decompositions dominates (resp. is dominated by) the actual
kernel computations.
The cost of the iterative algorithms increases by another factor of d for graphs with edge labels
from a nite set of d symbols or an edge kernel with d-dimensional feature map; for an arbitrary
edge kernel (whose feature map may be innite-dimensional) this factor becomes n. On labeled
graphs our spectral decomposition approach offers no savings, and the Sylvester equation method
applies only if the labels come from a nite set of symbols, and then with unknown time complexity.
Anearest Kronecker product approximation can be used, however, to approximate the direct product
of labeled graphs with a weight matrix that can be handled by any of our methods for unlabeled
graphs. This approximation requires k

(37) iterations, each costing O(dn

2
) time when the labels
come from a nite set of d symbols, and O(n
4
) in general.
Finally, when the graphs are sparse (i.e., only have O(n) edges each; rightmost column in Ta-
ble 1) our iterative methods (conjugate gradient, xed-point, and nearest Kronecker product) take
only O(n
2
) time per iteration, regardless of how the graphs are labeled. We cannot authoritatively
state the time complexity for sparse graphs of solving Sylvester equations or performing spectral
decompositions. Spielman and Teng (2008) have shown that graphs can be sparsied (i.e., approxi-
mated by sparse graphs) in nearly linear time, although the constants involved are quite large.
4.1 Sylvester Equation Methods
Consider the following equation, commonly known as the Sylvester or Lyapunov equation:
M = SMT +M
0
. (20)
Here, S, T, M
0
R
nn
are given and we need to solve for M R
nn
. These equations can be readily
solved in O(n
3
) time with freely available code (Gardiner et al., 1992), such as Matlabs dlyap
method. Solving the generalized Sylvester equation
M =
d

i=1
S
i
MT
i
+M
0
(21)
involves computing generalized simultaneous Schur factorizations of d symmetric matrices
(Lathauwer et al., 2004). Although technically involved, (21) can also be solved efciently, al-
beit at a higher computational cost. The computational complexity of this generalized factorization
is at present unknown.
We now show that for graphs with discrete edge labels, whose weight matrix W

can be written
as (7), the problem of computing the graph kernel (16) can be reduced to solving the following
generalized Sylvester equation:
M =
d

i=1

i
A

M
i
A

+M
0
, (22)
where vec(M
0
) = p

. We begin by attening (22):

vec(M) =
d

i=1
vec(
i
A

M
i
A

) + p

. (23)
1211
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Using Lemma 12 (which extends (1) into an RKHS) we can rewrite (23) as
(I
d

i=1
i
A
i
A

)vec(M) = p

, (24)
use (7), and solve (24) for vec(M):
vec(M) = (IW

)
1
p

. (25)
Multiplying both sides of (25) by q

yields
q

vec(M) = q

(IW

)
1
p

. (26)
The right-hand side of (26) is the graph kernel (16). Given the solution M of the Sylvester
equation (22), the graph kernel can be obtained as q

vec(M) in O(n
2
) time. The same argument
applies for unlabeled graphs by simply setting d = 1, which turns (22) into a simple Sylvester
equation. Since solving that only takes O(n
3
) time, computing the random walk graph kernel in this
fashion is much faster than the O(n
6
) time required by the direct approach.
One drawback of this strategy is that Sylvester equation solvers are quite sophisticated and
typically available only as black-box library routines, which limits their applicability. Matlabs
dlyap solver, for instance, does not exploit sparsity, and only handles the cases d = 1 and d = 2. A
solver for the simple Sylvester equation (20) can still be used to efciently compute kernels between
labeled graphs though by employing the nearest Kronecker product approximation (Section 4.5).
4.2 Conjugate Gradient Methods
Given a matrix M and a vector b, conjugate gradient (CG) methods solve the system of equations
Mx = b efciently (Nocedal and Wright, 1999). While they are designed for symmetric p.s.d. ma-
trices, CG solvers can also be used to solve other linear systems efciently. They are particularly
efcient if the matrix is rank decient, or has a small effective rank, that is, number of distinct
eigenvalues. Furthermore, if computing matrix-vector products is cheapbecause M is sparse, for
instancethe CG solver can be sped up signicantly (Nocedal and Wright, 1999). Specically, if
computing Mv for an arbitrary vector v requires O(m) time, and the effective rank of M is r, then a
CG solver takes O(r) iterations, and hence only O(rm) time, to solve Mx = b.
The graph kernel (16) can be computed by a two-step procedure: First we solve the linear system
(IW

)x = p

, (27)
for x, then we compute q

x. We now focus on efcient ways to solve (27) with a CG solver. Recall

that if G and G

contain n vertices each then W

is an n
2
n
2
matrix. Naively, multiplying W by
some vector y inside the CG algorithm requires O(n
4
) operations. However, by our extension of the
vec-ABC formula (1) into RKHS (Lemma 12), introducing the matrix Y R
nn
with y = vec(Y),
and recalling that W

= (X) (X

), by Lemma 12 we can write

y = ((X) (X

))vec(Y) = vec((X

)Y (X)

). (28)
If () R
d
then the above matrix-vector product can be computed in O(dn
3
) time. If (X) and
(X

) are sparse, then (X

)Y (X)

can be computed yet more efciently: If there are O(n) non-

entries in (X) and (X

), then computing (28) takes only O(n

2
) time.
1212
GRAPH KERNELS
4.3 Fixed-Point Iterations
Fixed-point methods begin by rewriting (27) as
x = p

x. (29)
Now, solving for x is equivalent to nding a xed point of (29) taken as an iteration
(Nocedal and Wright, 1999). Letting x
t
denote the value of x at iteration t, we set x
0
:= p

, then
compute
x
t+1
= p

x
t
(30)
repeatedly until |x
t+1
x
t
| < , where | | denotes the Euclidean norm and some pre-dened
tolerance. This is guaranteed to converge if all eigenvalues of W

lie inside the unit disk; this can

be ensured by setting < [
1
[
1
, where
1
is the largest-magnitude eigenvalue of W

. Assuming
that each iteration of (30) contracts x to the xpoint by a factor of
1
, we converge to within of
the xpoint in k iterations, where
k = O
_
ln
ln+ln[
1
[
_
. (31)
The above is closely related to the power method used to compute the largest eigenvalue of
a matrix (Golub and Van Loan, 1996); efcient preconditioners can also be used to speed up con-
vergence (Golub and Van Loan, 1996). Since each iteration of (30) involves computation of the
matrix-vector product W

x
t
, all speed-ups for computing the matrix-vector product discussed in
Section 4.2 are applicable here. In particular, we exploit the fact that W

is a sum of Kronecker
products to reduce the worst-case time complexity to O(dn
3
) per iteration in our experiments, in
contrast to Kashima et al. (2004) who computed the matrix-vector product explicitly.
4.4 Spectral Decomposition Method
In the previous two sections we have introduced methods that are efcient for both unlabeled and
labeled graphs, but specically computed the geometric kernel (16), that is, assumed that (k) =
k
.
We now turn to a method based on spectral decompositions that can compute the general random
walk kernel (8) for any convergent choice of (k), but is only efcient for unlabeled graphs. (In
fact, it will turn out to be our most efcient method for computing an entire kernel matrix between
unlabeled graphs.)
Let W

= P

P
1

denote the spectral decomposition of W

, that is, the columns of P

are its
eigenvectors, and D

is a diagonal matrix of corresponding eigenvalues. The random walk graph

kernel (8) can then be written as
k(G, G

) :=

k=0
(k)q

P
1

)
k
p

= q

k=0
(k)D
k

_
P
1

. (32)
This simplies matters in that (32) only takes weighted powers of a diagonal matrix, which decouple
into scalar powers of its entries. An implementable graph kernel can then be obtained by employing
a power series that is known to converge to a given nonlinear function. The geometric kernel (16),
for instance, uses the fact that

k=0
x
k
=
1
1x
; setting (k) :=
k
in (32) we thus obtain
k(G, G

) := q

(ID

)
1
P
1

. (33)
1213
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
The crucial difference to (16) is that the inverse in (33) is of a diagonal matrix, hence trivial to
compute: just take the reciprocal of each entry. To give another example, setting (k) :=
k
/k! in
(32) yields the exponential kernel (19) by way of spectral decomposition:
k(G, G

) := q

e
D

P
1

, (34)
because e
x
=

k=0
x
k
/k!. Again, unlike in (19) the matrix exponential here is trivial to compute
because D

is diagonal: simply exponentiate each entry.

Thus by diagonalizing the nonlinearity central to a random walk graph kernel, spectral decom-
position can greatly expedite its computation. As described above, however, it is computationally
unattractive: Computing the spectral decomposition of a dense matrix takes time cubic in its size
(Golub and Van Loan, 1996); since W

is an n
2
n
2
matrix this would result in O(n
6
) time com-
plexity per kernel computation.
2
By leveraging the properties of the Kronecker product, however,
we can obtain a far better result for unlabeled (though possibly edge-weighted) graphs:
Theorem 4 The kernel matrix for any random walk kernel (8) between m unlabeled, possibly edge-
weighted graphs with n nodes can be computed in O((mp+n)mn
2
) time via spectral decompositions,
where computing the corresponding scalar power series takes O(p) time.
Proof Because the graphs are unlabeled, we have W

:= A

= A
i
A
j
, where A
i
and A
j
(i, j
1, 2, . . . , m) are the adjacency matrices (normalized or not) of individual graphs. Begin by pre-
computing the spectral decomposition of each graph: (i)A
i
= P
i
D
i
P
1
i
. Using Propositions 7.1.6,
7.1.7 of Bernstein (2005) we have
A
i
A
j
= (P
i
D
i
P
1
i
) (P
j
D
j
P
1
j
) = (P
i
P
j
)(D
i
D
j
)(P
i
P
j
)
1
. (35)
Proposition 7.1.10 of Bernstein (2005) tells us that in fact D
i
D
j
= D

, which implies that also

P
i
P
j
= P

and that indeed the spectral decomposition of a Kronecker product decomposes into
those of its constituents, as seen in (35). We can therefore use Propositions 7.1.6, 7.1.7 of Bernstein
(2005) again to rewrite (32) as
k(G
i
, G
j
) = (q

i
P
i
q

j
P
j
)
_

k=0
(k)(D
i
D
j
)
k
_
(P
1
i
p
i
P
1
j
p
j
). (36)
Computing the central power series here takes O(n
2
p) time just as in (32), but the cost of calculating
the two anking factors has been reduced from O(n
4
) to O(n
2
) in (36). The entire mm kernel
matrix can thus be obtained in O(m
2
n
2
p) time, plus the O(mn
3
) time it takes to precompute spectral
decompositions of the m individual adjacency matrices.
Note that in practice we will always pick a power series with known limit that is trivial to evaluate
(i.e., p = 1), as exemplied by the geometric (33) and exponential (34) kernels. Theorem 4 then
gives us a very efcient method to compute entire kernel matrices, albeit only between unlabeled
graphs. (It is tempting to try to extend the spectral approach for the exponential kernel to labeled
graphs, but this runs into a key technical difculty: a sum of (label-ltered adjacency) matrices in
the exponent cannot be separated unless those matrices commute, that is, generally e
A+B
,= e
A
e
B
unless AB = BA.)
2. Thus G artner et al. (2003) give a time complexity cubic in the O(n
2
) size of the product graph.
1214
GRAPH KERNELS
4.5 Nearest Kronecker Product Approximation
As we have seen above, some of our fast methods for computing random walk graph kernels may
become computationally expensive, or not even be available, for labeled graphs, in particular when
the number d of distinct labels is large or a general edge kernel is employed. In such cases we can
nd the nearest Kronecker product to W

, that is, compute matrices S and T such that W

ST,
then use any of our methods on S T as if it were the adjacency matrix of a direct product of
unlabeled graphs.
Finding the nearest Kronecker product approximating a matrix such as W

is a well-studied
problem in numerical linear algebra, and efcient algorithms which can exploit the sparsity of W

are available (Pitsianis, 1992; Van Loan, 2000). Formally, these methods minimize the Frobenius
norm|W

ST|
F
by computing the largest singular value of

W

, a permuted version of W

. We
employ the power method
3
for this purpose, each iteration of which entails computing the matrix-
vector product

W

vec(T

), where T

R
nn
is the current approximation of T. The result of the
matrix-vector product is then reshaped into an nn matrix to form T

for the next iteration (Pitsianis,

1992). It is easy to see that computing

W

vec(T

) requires O(n
4
) time.
If W

can be written as a sum of d Kronecker products (7), then so can

(Pitsianis, 1992;
Van Loan, 2000), and the cost per iteration hence drops to O(dn
2
). Furthermore, if the two graphs
are sparse with O(n) edges each, then W

will have O(n

2
) non-zero entries, and each iteration only
takes O(n
2
) time. The number k

of iterations required is
k

= O
_
lnn
ln[
1
[ ln[
2
[
_
, (37)
where
1
and
2
are the eigenvalues of W

with largest resp. second-largest magnitude.

As described above, the nearest Kronecker product approximation is calculated separately for
each entry of an mm kernel matrix. This causes two problems: First, the spectral decomposition
method will now take O(m
2
n
3
) time, as it is no longer possible to precompute the m graph spec-
tra. Second, like the optimal assignment kernel (Section 7.2) the resulting kernel matrix may have
negative eigenvalues, and hence fail to be p.s.d. In future work, it may be possible to address these
shortcomings by computing a simultaneous nearest Kronecker product approximation for the entire
kernel matrix. For now, we veried empirically on the MUTAG and PTC data sets (cf. Section 5.2)
that the most negative eigenvalue is relatively small: its magnitude was 4.4% resp. 0.1% of
2
. We
also found the nearest Kronecker product to provide a better approximation than simply ignoring the
graph labels: the angle between vec(W

) and its unlabeled variant was 2.2 resp. 4.7 times greater
than that between vec(W

) and vec(ST).
5. Experiments
Numerous studies have applied random walk graph kernels to problems such as protein function
prediction (Borgwardt et al., 2005) and chemoinformatics (Kashima et al., 2004). In our experi-
ments we therefore focus on the runtime of computing the kernels, rather than their utility in any
given application. We present three sets of experiments: First, we study the scaling behaviour of our
algorithms on unlabeled random graphs. Second, we assess the practical impact of our algorithmic
improvement on four real-world data sets whose size mandates fast kernel computation. Third, we
3. Lanczos iterations are typically faster but more difcult to handle numerically.
1215
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Figure 3: Time to compute a 1010 kernel matrix on random graphs with n nodes and 3n edges as
a function of the graph size n. Left: The Sylvester equation (Sylv.), conjugate gradient
(CG), xed-point iteration (FP), and spectral decomposition (spec.) approaches, com-
pared to the dense and sparse direct method. Thin straight lines indicate O(n
6
) (black
dots) resp. O(n
3
) (red dashes) scaling. Right: Kashima et al.s (2004) xed-point itera-
tion (original) compared to our version, which exploits Lemma 12 (vec-trick).
devise novel methods for protein-protein interaction (PPI) network comparison using graph kernels.
The algorithmic challenge here is to efciently compute kernels on large sparse graphs.
The baseline for comparison in all our experiments is the direct approach of G artner et al.
(2003), implemented via a sparse LU factorization; this already runs orders of magnitude faster
on our data sets than a dense (i.e., non-sparse) implementation. Our code was written in Matlab
Release 2008a, and all experiments were run under Mac OS X 10.5.5 on an Apple Mac Pro with a
3.0 GHz Intel 8-Core processor and 16 GB of main memory. We employed Lemma 12 to speed up
matrix-vector multiplication for both CG and xed-point methods (cf. Section 4.2), and used the
function dlyap from Matlabs control toolbox to solve the Sylvester equation. By default, we used a
value of = 10
4
, and set the convergence tolerance for both CG solver and xed-point iteration to
10
6
. For the real-world data sets, the value of was chosen to ensure that the random walk graph
kernel converges. Since our methods are exact and produce the same kernel values (to numerical
precision), we only report the CPU time of each algorithm.
5.1 Unlabeled Random Graphs
The aim here is to study the scaling behaviour of our algorithms as a function of graph size and
sparsity. We generated several sets of unlabeled random graphs. For the rst set we began with
an empty graph of n = 2
k
nodes, where k = 2, 3, . . . , 10, randomly added 3n edges, then checked
the graphs connectivity. For each k we repeated this process until we had collected 10 strongly
connected random graphs.
The time required to compute the 1010 kernel matrix between these graphs for each value of
n is shown in Figure 3 (left). We see that the direct approach scales asymptotically as O(n
6
) in
both the dense and the sparse implementation. For a graph of 64 nodes the direct approach already
takes over half an hour (sparse) resp. 3 hours (dense) of CPU time. Our Sylvester equation (Sylv.),
1216
GRAPH KERNELS
Figure 4: Time to compute a 1010 kernel matrix on random graphs as a function of their ll factor.
Left: The dense and sparse direct method on 32-node graphs, compared to our Sylvester
equation (Sylv.), conjugate gradient (CG), xed point iteration (FP), and spectral decom-
position (spec.) approaches. Right: Our approaches on larger graphs with 256 nodes,
where the direct method is infeasible.
conjugate gradient (CG) and xed-point iteration (FP) methods, by contrast, all scale as O(n
3
), and
can thus be applied to far larger graphs. Our spectral decomposition approach (spec.) is the fastest
method here; it too scales as O(n
3
) as n asymptotically dominates over the xed kernel matrix size
m = 10.
We also examined the impact of Lemma 12 on enhancing the runtime performance of the xed-
point iteration approach as originally proposed by Kashima et al. (2004). For this experiment, we
again computed the 1010 kernel matrix on the above random graphs, once using the original
xed-point iteration, and once using xed-point iteration enhanced by Lemma 12. As Figure 3
(right) shows, our approach consistently outperforms the original version, sometimes by over an
order of magnitude.
For the next set of experiments we xed the graph size at 32 nodes (the largest size that the
direct method could handle comfortably), and randomly added edges until the ll factor (i.e., the
number of non-zero entries in the adjacency matrix) reached x%, where x = 5, 10, 20, 30, . . . , 100.
For each x, we generated 10 such graphs and computed the 1010 kernel matrix between them.
Figure 4 (left) shows that as expected, the sparse direct method is faster than its dense counterpart
for small ll factors but slower for larger ones. Both however are consistently outperformed by our
four methods, which are up to three orders of magnitude faster, with xed-point iterations (FP) and
spectral decompositions (spec.) the most efcient.
To better understand how our algorithms take advantage of sparsity, we generated a set of larger
random graphs (with 256 nodes) by the same procedure as before, but with a geometric progression
of ll factors: x = 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100. The direct methods are infeasible here. The
CPU times taken by our algorithms to compute a 1010 kernel matrix is shown in Figure 4 (right).
Both conjugate gradient and xed point iteration methods have runtimes roughly proportional to
the ll factor. The runtime of the Sylvester equation solver, by contrast, is largely independent of
the ll factor because our black-box dlyap solver does not aggressively exploit sparsity in the adja-
1217
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
cency matrices. The same holds for our spectral decomposition approach, which however exhibits
impressive performance here: although it does not exploit sparsity at all, it is the fastest method
by far on all but the sparsest ( 2% lled) graphs. Clearly, well-designed sparse implementations
of Sylvester equation solvers, and in particular spectral decomposition, could facilitate substantial
further gains in efciency here.
5.2 Real-World Data Sets
Our next set of experiments used four real-world data sets: Two sets of molecular compounds
(MUTAG and PTC), and two data sets describing protein tertiary structure (Protein and Enzyme).
Graph kernels provide useful measures of similarity for all of these.
5.2.1 THE DATA SETS
We now briey describe each data set, and discuss how graph kernels are applicable.
Chemical Molecules. Toxicity of chemical molecules can be predicted to some degree by compar-
ing their three-dimensional structure. We employed graph kernels to measure similarity between
molecules from the MUTAG and PTC data sets (Toivonen et al., 2003). The average number of
nodes per graph in these data sets is 17.72 resp. 26.70; the average number of edges is 38.76 resp.
52.06.
Protein Graphs. A standard approach to protein function prediction involves classifying proteins
into enzymes and non-enzymes, then further assigning enzymes to one of the six top-level classes
of the Enzyme Commission (EC) hierarchy. Towards this end, Borgwardt et al. (2005) modeled a
data set of 1128 proteins as graphs in which vertices represent secondary structure elements, and
edges represent neighborhood within the 3-D structure or along the amino acid chain, as illustrated
in Figure 1.
Comparing these graphs via a modied random walk graph kernel and classifying them with a
Support Vector Machine (SVM) led to function prediction accuracies competitive with state-of-the-
art approaches (Borgwardt et al., 2005). We used Borgwardt et al.s (2005) data to test the efcacy
of our methods on a large data set. The average number of nodes and edges per graph in this data
is 38.57 resp. 143.75. We used a single label on the edges, and the delta kernel to dene similarity
between edges.
Enzyme Graphs. We repeated the above experiment on an enzyme graph data set, also due to
Borgwardt et al. (2005). This data set contains 600 graphs, with 32.63 nodes and 124.27 edges on
average. Graphs in this data set represent enzymes from the BRENDA enzyme database
(Schomburg et al., 2004). The biological challenge on this data is to correctly assign the enzymes
to one of the EC top-level classes.
5.2.2 UNLABELED GRAPHS
For this experiment, we computed kernels taking into account only the topology of the graph, that
is, we did not consider node or edge labels. Table 2 lists the CPU time required to compute the full
kernel matrix for each data set, as well asfor comparison purposesa 100100 submatrix. The
latter is also shown graphically in Figure 5 (left).
1218
GRAPH KERNELS
data set MUTAG PTC Enzyme Protein
nodes/graph 17.7 26.7 32.6 38.6
edges/node 2.2 1.9 3.8 3.7
#graphs 100 230 100 417 100 600 100 1128
Sparse 31 145 45 723 152 1h21 2323 2.1d*
Sylvester 10 54 28 733 31 2328 525 11h29
Conj. Grad. 23 129 26 429 14 1000 45 3939
Fixed-Point 8 43 15 238 5 544 43 2209
Spectral 5 27 7 154 7 432 27 2352

extrapolated number of days; run did not nish in time available.

Table 2: Time to compute kernel matrix for unlabeled graphs from various data sets.
On these unlabeled graphs, conjugate gradient, xed-point iterations, and spectral
decompositionssped up via Lemma 12are consistently faster than the sparse direct method.
The Sylvester equation approach is very competitive on smaller graphs (outperforming CG on MU-
TAG) but slows down with increasing number of nodes per graph. Even so, it still outperforms the
sparse direct method. Overall, spectral decomposition is the most efcient approach, followed by
xed-point iterations.
5.2.3 LABELED GRAPHS
For this experiment, we compared graphs with edge labels. Note that node labels can be dealt
with by concatenating them to the edge labels of adjacent edges. On the two protein data sets we
employed a linear kernel to measure similarity between edge weights representing distances (in

Angstr oms) between secondary structure elements; since d = 1 we can use all our methods for
unlabeled graphs here. On the two chemical data sets we used a delta kernel to compare edge
labels reecting types of bonds in molecules; for the Sylvester equation and spectral decomposition
approaches we then employed the nearest Kronecker product approximation. We report CPU times
for the full kernel matrix as well as a 100100 submatrix in Table 3; the latter is also shown
graphically in Figure 5 (right).
On labeled graphs, the conjugate gradient and the xed-point iteration always outperform the
sparse direct approach, more so on the larger graphs and with the linear kernel. As expected, spectral
decompositions are inefcient in combination with the nearest Kronecker product approximation,
kernel delta, d =7 delta, d =22 linear, d =1
data set MUTAG PTC Enzyme Protein
#graphs 100 230 100 417 100 600 100 1128
Sparse 42 244 107 1422 125 5743 1238 1.1d*
Sylvester 108 605 106 1820 213 7643 1920 11h19
Conj. Grad. 39 316 53 1419 20 1320 41 5735
Fixed-Point 25 217 37 755 10 646 25 3109
Spectral 120 708 140 2654 8 422 26 2123

extrapolated number of days; run did not nish in time available.

Table 3: Time to compute kernel matrix for labeled graphs from various data sets.
1219
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Figure 5: Time (in seconds on a log-scale) to compute 100100 kernel matrix for unlabeled (left)
resp. labeled (right) graphs from several data sets, comparing the conventional sparse
method to our fast Sylvester equation, conjugate gradient (CG), xed-point iteration (FP),
and spectral approaches.
but with the linear kernel they perform as well as xed-point iterations for m = 100, and better yet
on the large kernel matrices. The Sylvester equation approach (at least with the Sylvester solver we
used) cannot take advantage of sparsity, but still manages to perform almost as well as the sparse
direct method.
5.3 Protein-Protein Interaction Networks
In our third experiment, we used random walk graph kernels to tackle a large-scale problem in
bioinformatics involving the comparison of fairly large protein-protein interaction (PPI) networks.
Using a combination of human PPI and clinical microarray gene expression data, the task is to
predict the disease outcome (dead or alive, relapse or no relapse) of cancer patients. As before, we
set = 0.001 and the convergence tolerance to 10
6
for all our experiments reported below.
5.3.1 CO-INTEGRATION OF GENE EXPRESSION AND PPI DATA
We co-integrated clinical microarray gene expression data for cancer patients with known human
PPI from Rual et al. (2005). Specically, a patients gene expression prole was transformed into a
graph as follows: A node was created for every protein whichaccording to Rual et al. (2005)
participates in an interaction, and whose corresponding gene expression level was measured on this
patients microarray. We connect two proteins in this graph by an edge if Rual et al. (2005) list
these proteins as interacting, and both genes are up- resp. downregulated with respect to a reference
measurement. Each node bears the name of the corresponding protein as its label.
This approach of co-integrating PPI and gene expression data is built on the assumption that
genes with similar gene expression levels are translated into proteins that are more likely to interact.
Recent studies conrm that this assumption holds signicantly more often for co-expressed than
for random pairs of proteins (Fraser et al., 2004; Bhardwaj and Lu, 2005). To measure similarity
between these networks in a biologically meaningful manner, we compare which groups of proteins
interact and are co-regulated in each patient. For this purpose, a random walk graph kernel is the
1220
GRAPH KERNELS
data set Leukemia Breast Cancer
kernel vanilla composite vanilla composite
Sparse 24 52 39 119
Conj. Grad. 6 13 12 26
Fixed-Point 4 7 7 13
Table 4: Average time to compute kernel matrix on protein-protein interaction networks.
natural choice, as a random walk in this graph represents a group of proteins in which consecutive
proteins along the walk are co-expressed and interact. As each node bears the name of its corre-
sponding protein as its node label, the size of the product graph is at most that of the smaller of the
two input graphs.
5.3.2 COMPOSITE GRAPH KERNEL
The presence of an edge in a graph signies an interaction between the corresponding nodes. In
chemoinformatics, for instance, edges indicate chemical bonds between two atoms; in PPI net-
works, edges indicate interactions between proteins. When studying protein interactions in disease,
however, the absence of a given interaction can be as signicant as its presence. Since existing
graph kernels cannot take this into account, we propose to modify them appropriately. Key to our
approach is the notion of a complement graph:
Denition 5 Let G= (V, E) be a graph with vertex set V and edge set E. Its complement

G= (V,

E)
is a graph over the same vertices but with complementary edges

E := (V V) E.
In other words, the complement graph consists of exactly those edges not present in the original
graph. Using this notion we dene the composite graph kernel
k
comp
(G, G

) := k(G, G

) +k(

G,

G

). (38)
This deceptively simple kernel leads to substantial gains in performance in our experiments com-
paring co-integrated gene expression/protein-protein interaction networks.
5.3.3 DATA SETS
Leukemia. Bullinger et al. (2004) provide a data set of microarrays of 119 leukemia patients. Since
50 patients survived after a median follow-up time of 334 days, always predicting a lethal outcome
here would result in a baseline prediction accuracy of 1 - 50/119 = 58.0%. Co-integrating this data
with human PPI, we found 2,167 proteins from Rual et al. (2005) for which Bullinger et al. (2004)
report expression levels among the 26,260 genes they examined.
Breast Cancer. This data set consists of microarrays of 78 breast cancer patients, of which 44
had shown no relapse of metastases within 5 years after initial treatment (vant Veer et al., 2002).
Always predicting survival thus gives a baseline prediction accuracy of 44/78 = 56.4% on this data.
When generating co-integrated graphs, we found 2,429 proteins from Rual et al. (2005) for which
vant Veer et al. (2002) measure gene expression out of the 24,479 genes they studied.
5.3.4 RESULTS
In Table 4 we contrast the CPU runtimes of our conjugate gradient and xed-point approaches to
graph kernel computation on the cancer patients modeled as labeled graphs with that of the direct
1221
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
sparse method. On both data sets, our fast graph kernel computation methods yield an impressive
gain in speed.
Using either the vanilla graph kernel (16) or our composite graph kernel (38), we predict the
survivors by means of a support vector machine (SVM) in 10-fold cross-validation. The vanilla
random walk graph kernel offers slightly higher prediction accuracy than the baseline classier on
one task (Leukemia: 59.2 % vs 58.0 %), and gives identical results on the other (Breast Cancer: both
56.4 %). Our composite graph kernel attains 5 percentage points above baseline in both experiments
(Leukemia: 63.3 %; Breast cancer: 61.5 %).
The vanilla kernel suffers from its inability to measure network discrepancies, the paucity of the
graph model employed, and the fact that only a small minority of genes could be mapped to inter-
acting proteins; due to these problems, its accuracy remains close to the baseline. The composite
kernel, by contrast, also models missing interactions. With it, even our simple graph model, which
only considers 10% of the genes examined in both studies, is able to capture some relevant biolog-
ical information, which in turn leads to better classication accuracy on these challenging data sets
(Warnat et al., 2005).
6. Rational Kernels
Rational kernels (Cortes et al., 2004) were conceived to compute similarity between variable-length
sequences and, more generally, weighted automata. For instance, the output of a large-vocabulary
speech recognizer for a particular input speech utterance is typically a weighted automaton com-
pactly representing a large set of alternative sequences. The weights assigned by the system to each
sequence are used to rank different alternatives according to the models the system is based on. It
is therefore natural to compare two weighted automata by dening a kernel.
As discussed in Section 3, random walk graph kernels have a very different basis: They compute
the similarity between two random graphs by matching random walks. Here the graph itself is the
object to be compared, and we want to nd a semantically meaningful kernel. Contrast this with a
weighted automaton, whose graph is merely a compact representation of the set of variable-length
sequences which we wish to compare. Despite these differences we nd rational kernels and random
walk graph kernels to be closely related.
To understand the connection recall that every random walk on a labeled graph produces a se-
quence of edge labels encountered during the walk. Viewing the set of all label sequences generated
by random walks on a graph as a language, one can design a weighted transducer which accepts this
language, with the weight assigned to each label sequence being the probability of a random walk
generating this sequence. (This transducer can be represented by a graph whose adjacency matrix
is the normalized weight matrix of the original graph.)
In this section we formalize this observation and thus establish connections between rational
kernels on transducers (Cortes et al., 2004) and random walk graph kernels. In particular, we show
that composition of transducers is analogous to computing product graphs, and that rational ker-
nels on weighted transducers may be viewed as generalizations of random walk graph kernels to
weighted automata. In order to make these connections explicit we adopt notation commonly used
for describing algebraic path problems, wherein disparate problems related to graphs, automata,
and transducers are described in a common framework using matrices and tensors (Eilenberg, 1974;
Lehmann, 1977; Berstel, 1979; Kuich and Salomaa, 1986).
1222
GRAPH KERNELS
6.1 Semirings
At the most general level, weighted transducers are dened over semirings. In a semiring addi-
tion and multiplication are generalized to abstract operations

and

with the same distributive
properties:
Denition 6 (Mohri, 2002) A semiring is a system (K,

,

,

1) such that
1. (K,

,

0) is a commutative monoid in which

0 K is the identity element for

(i.e., for any
x, y, z K, we have x

y K, (x

y)

z = x

(y

z), x

0 =

0

x = x and x

y = y

x);
2. (K,

,

1) is a monoid in which

1 is the identity operator for

(i.e., for any x, y, z K, we
have x

y K, (x

y)

z = x

(y

z), and x

1 =

1

x = x);
3.

distributes over

, that is, for any x, y, z K,
(x

y)

z = (x

z)

(y

z)
and z

(x

y) = (z

x)

(z

y);
4.

0 is an annihilator for

: x K, x

0 =

0

x =

0.
Thus, a semiring is a ring that may lack negation. (R, +, , 0, 1) is the familiar semiring of real
numbers. Other examples include
Boolean: (FALSE, TRUE, , , FALSE, TRUE);
Logarithmic: (R,

ln
, +, , 0), where x, y K: x

ln
y := ln(e
x
+e
y
);
Tropical: (R, max, +, , 0).
Linear algebra operations such as matrix addition and multiplication as well as Kronecker products
can be carried over to a semiring in a straightforward manner. For instance, for M, M

K
nn
we
have
[M

M

]
i, j
=
n
M
k=1
M
ik

M

k j
. (39)
The (

,

) operations in some semirings can be mapped into ordinary (+, ) operations by
applying an appropriate morphism:
Denition 7 Let (K,

,

,

1) be a semiring. A function : KR is a morphism if

(x

y) = (x) +(y);
(x

y) = (x) (y);
(

0) = 0 and (

1) = 1.
1223
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
In the following, by morphism we will always mean a morphism from a semiring to the real
numbers. Not all semirings have such morphisms: For instance, the logarithmic semiring has a
morphismnamely, the exponential functionbut the tropical semiring does not have one. If the
semiring has a morphism , applying it to the matrix product (39), for instance, yields
([M

M

]
i, j
) =
_
n
M
k=1
M
ik

M

k j
_
=
n

k=1
(M
ik

M

k j
) =
n

k=1
(M
ik
) (M

k j
). (40)
As in Appendix A, we can extend the morphism to matrices (and analogously to vectors) by
dening [(M)]
i j
:= (M
i j
). We can then write (40) concisely as
(M

M

) = (M)(M

). (41)
6.2 Weighted Transducers
Loosely speaking, a transducer is a weighted automaton with an input and an output alphabet. We
will work with the following slightly specialized denition:
4
Denition 8 A weighted nite-state transducer T over a semiring (K,

,

,

1) is a 5-tuple T =
(, Q, H, p, q), where is a nite input-output alphabet, Q is a nite set of n states, p K
n
is a
vector of initial weights, q K
n
is a vector of nal weights, and H is a four-dimensional tensor in
K
n[[[[n
which encodes transitions and their corresponding weights.
For a, b we will use the shorthand H
ab
to denote the nn slice H
ab
of the transition tensor,
which represents all valid transitions on input symbol a emitting the output symbol b. The output
weight assigned by T to a pair of strings = a
1
a
2
. . . a
l
and = b
1
b
2
. . . b
l
is
[[T]](, ) = q

H
a
1
b
1

H
a
2
b
2

. . .

H
a
l
b
l

p. (42)
A transducer is said to accept a pair of strings (, ) if it assigns non-zero output weight to them,
that is, [[T]](, ) ,=

0. A transducer is said to be regulated if the output weight it assigns to any pair
of strings is well-dened in K. Since we disallow transitions, our transducers are always regulated.
The inverse of T = (, Q, H, p, q), denoted by T
1
, is obtained by transposing the input and
output labels of each transition. Formally, T
1
= (, Q, H

, p, q) where H

ab
:= H
ba
. The composi-
tion of two transducers T = (, Q, H, p, q) and T

= (, Q

, H

, p

, q

) is a transducer T

= T T

=
(, Q

, H

, p

, q

), where Q

=QQ

, p

= p

p

,
5
q

:=q

q

, and (H

)
ab
=

L
c
H
ac

H

cb
.
It can be shown that
[[T

]](, ) = [[T T

]](, ) =
M

[[T]](, )

[[T

]](, ). (43)
4. We disallow transitions, and use the same alphabet for both input and output. Furthermore, in a departure from
tradition, we represent the transition function as a four-dimensional tensor.
5. We use

to denote the Kronecker product using the semiring operation

, in order to distinguish it from the regular
Kronecker product .
1224
GRAPH KERNELS
Composing T with its inverse yields T T
1
= (, Q Q, H

, p

p, q

q), where
H

ab
=

L
c
H
ac

H
bc
. There exists a general and efcient algorithm for composing transducers
as in (43) which takes advantage of the sparseness of the input transducers (Mohri et al., 1996;
Pereira and Riley, 1997).
6.3 Weighted Automata
A weighted automaton is a transducer with identical input and output symbols. The transition matrix
of a weighted automaton is therefore a three-dimensional tensor in K
n[[n
. As before, we will use
the shorthand H
a
to denote the nn slice H
a
of the transition tensor, which represents all valid
transitions on the input symbol a emitting output symbolx a. If contains d symbols, then by
specializing (42) it is easy to see that a weighted automaton accepts a string = a
1
a
2
. . . a
l
with
weight
[[T]]() = q

H
a
1

H
a
2

. . .

H
a
l

p. (44)
The composition of two weighted automata T = (, Q, H, p, q) and T

= (, Q

, H

, p

, q

) is an
automaton T

= T T

= (, Q

, H

, p

, q

), where Q

= QQ

, p

= p

p

, q

:= q

q

, and
(H

)
a
= H
a

H

a
. The composition operation is also dened for a weighted automaton W and a
transducer T:
[[W T]](, ) = [[W]]()

[[T]](, ). (45)
Every random walk on a labeled graph results in a sequence of edge labels encountered during
the walk. The set of all label sequences generated by random walks on a given graph is a language.
One can construct a weighted automaton which accepts this language as follows: Use the standard
semiring (R, +, , 0, 1), let the alphabet consist of the labels 1, . . . , d of the graph, and identify
the nodes of the graph with the states of the weighted automaton. Let the starting and stopping
probabilities p and q on the graph equal those of the weighted automaton, and complete the con-
struction by identifying for each l the label-ltered adjacency matrix
l
A of the graph with H
l
,
the transition tensor of the weighted automaton for that symbol.
Under the above mapping (44) has a natural interpretation: The weight assigned by the au-
tomaton to a string of symbols is the probability of encountering the corresponding labels while
performing a random walk on the corresponding labeled graph. The composition of weighted au-
tomata, when specialized to labeled graphs, is equivalent to computing a direct product graph.
An unlabeled graph corresponds to a weighted automaton whose input-output alphabet contains
exactly one symbol, and which therefore only accepts strings of the forma
k
=aa. . . a. The transition
matrix of such a graph (equivalently, its adjacency matrix) is a 2-dimensional tensor in K
nn
. If A
denotes the adjacency matrix of a graph G, then the output weight assigned by G to a
k
is [[G]](a
k
) =
q

AA. . . Ap = q

A
k
p.
6.4 The Rational Kernel for Strings
Given a weighted transducer T and a function : K R, the rational kernel between two strings
= a
1
a
2
. . . a
l
and = b
1
b
2
. . . b
l
is dened as (Cortes et al., 2004):
k(, ) := ([[T]](, )). (46)
1225
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Cortes et al. (2004) show that a generic way to obtain p.s.d. rational kernels is to replace T in (46) by
T T
1
, and let be a semiring morphism. We now present an alternate proof which uses properties
of the Kronecker product. Since is a semiring morphism, by specializing (42) to T T
1
, we can
write k(, ) =
_
[[T T
1
]](, )
_
as
(q

q)

_
M
c
1
H
a
1
c
1

H
b
1
c
1
_
. . .
_
M
c
l
H
a
l
c
l

H
b
l
c
l
_
(p

p). (47)
Rules analogous to (41) give us

_
M
c
H
ac

H
bc
_
=

c
(H
ac
) (H
bc
). (48)
Using (48) we can rewrite (47) as

c
1
c
2
...c
l
(q)

(q)

((H
a
1
c
1
) (H
b
1
c
1
)). . . ((H
a
l
c
l
) (H
b
l
c
l
))(p) (p). (49)
Finally, successively applying (2) to (49) yields
k(, ) =

c
1
c
2
...c
l
_
(q)

(H
a
1
c
1
). . . (H
a
l
c
l
)(p)
_
. .
()
_
(q)

(H
b
1
c
1
). . . (H
b
l
c
l
)(p)
_
. .
()
, (50)
Each term of (50) equals ()() for some scalar function , and is therefore a valid p.s.d. kernel.
Since p.s.d. kernels are closed under addition and pointwise limits (Berg et al., 1984), k(, ) is a
valid p.s.d. kernel.
6.5 The Rational Kernel for Weighted Automata
Rational kernels on strings can be naturally extended to weighted automata S andU via (Cortes et al.,
2004):
k(S,U) =
_
_
M
,
[[S]]()

[[T]](, )

[[U]]()
_
_
=
_
_
M
,
[[ST U]](, )
_
_
, (51)
where we obtained (51) by using (45) twice. If is a semiring morphism, then we can use Deni-
tion 7 to rewrite (51) as
k(S,U) =

,
([[ST U]](, )). (52)
Since p.s.d. kernels are closed under addition and pointwise limits, if ([[ST U]](, )) is a p.s.d.
kernel for any given and , then so is (52).
1226
GRAPH KERNELS
6.6 Recovering Random Walk Graph Kernels
In order to recover random walk graph kernels we use the standard (R, +, , 0, 1) ring as our semi-
ring, and hence set to be the identity function. Next we set the transducer T to simply transform
any input string of length k into an identical output string with weight (k) 0. With these restric-
tions (52) can be written as
k(S,U) =

([[)[[SU]](), (53)
where [[ denotes the length of . Let us rearrange (53) to
k(S,U) =

k
(k)
_

a
1
,a
2
,...,a
k
[[SU]](a
1
, a
2
, . . . , a
k
)
_
. (54)
Specializing the denition of to weighted automata, and letting H
a
(resp. H

a
) denote the transition
tensor of S (resp. U), we can rewrite (54) as
k(S,U) =

k
(k)
_

a
1
,a
2
,...,a
k

k
(qq

(H
a
1
H

a
1
). . . (H
a
k
H

a
k
)(pp

)
_
=

k
(k)(qq

_

a
1
,a
2
,...,a
k

k
(H
a
1
H

a
1
). . . (H
a
k
H

a
k
)
_
(pp

)
=

k
(k)(qq

a
H
a
H

a
_
. . .
_

a
H
a
H

a
_
(pp

)
=

k
(k)(qq

a
H
a
H

a
_
k
(pp

). (55)
Next, we identify H
a
(resp. H

a
) with the label-ltered adjacency matrix
a
A (resp.
a
A

) of a graph G
(resp. G

) with discrete edge labels. It easy to see that H

:=
a
H
a
H

a
is the weight matrix (7) of
the direct product of G and G

. Letting p

= pp

and q

= qq

, (55) reduces to
k(G, G

) =

k
(k)q

H
k

, (56)
which recovers the random walk graph kernel (8) with W

= H

.
The generality of the rational kernel comes at a computational cost: Even when restricted as in
(53), it requires the composition SU of two transducers, which takes up to O(([Q
S
[ +[E
S
[)([Q
U
[ +
[E
U
[)) time, where [Q[ is the number of states and [E[ the number of transitions (Cortes et al., 2004,
Section 4.1). In our setting [Q[ = n, the number of nodes in the graph, and [E[ is the number of
its edges, which can be of O(n
2
); the worst-case time complexity of the composition operation is
therefore O(n
4
). Cortes et al. (2004) showed that the sum in (53) can be computed via a single-
source shortest distance algorithm over a semiring. If S U is acyclic, then this is linear in the
size of S U, which in the worst case is O(n
2
). In general, however, an all-pairs shortest-distance
algorithm must be employed, such as the generalization of the Floyd-Warshall algorithm due to
Lehmann (1977). This algorithm is cubic in the size of S U, thus leading to an O(n
6
) worst-
case time complexity. Since computing S U is the same as computing W

, and the Lehmann

1227
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
(1977) algorithm is a way to compute (I W

)
1
(the transitive closure of W

), our linear algebra

techniques to speed up computation of random walk graph kernels can also be applied to rational
kernels.
6
The key insight here is that we never explicitly construct the composition (i.e., direct
product graph) in order to compute the kernel.
For ease of exposition we derived (56) by setting T to be an identity transducer. Instead, one can
use a weighted transducer which allows for more exible matching between strings in the alphabet.
Basically, the transducer now plays the role of the kernel function , and this in turn leads to a more
exible similarity matrix W

.
There is one important difference between graph kernels and rational kernels. Graph kernels
can handle arbitrary edge kernels, including continuous edge labels via the weight matrix W

. In
contrast, rational kernels, which were designed to work with strings and automata, assume that the
alphabet (set of labels) is nite. As we saw above, they can incorporate exible similarity matrices
W

in this setting, but cannot handle continuous edge labels. Furthermore, to date rational kernels
have not been extended to deal with labels mapped to an RKHS.
7. R-convolution Kernels
Hausslers (1999) R-convolution kernels provide a generic way to construct kernels for discrete
compound objects. Let x X be such an object, andx := (x
1
, x
2
, . . . , x
D
) denote a decomposition of
x, with each x
i
X
i
. We can dene a boolean predicate
R : X X TRUE, FALSE, (57)
where X := X
1
. . . X
D
and R(x,x) is TRUE whenever x is a valid decomposition of x. Now
consider the inverse of (57), the set of all valid decompositions of an object:
R
1
(x) :=x[R(x,x) = TRUE. (58)
Like Haussler (1999) we assume that (58) is countable. We dene the R-convolution of the kernels

1
,
2
, . . . ,
D
with
i
: X
i
X
i
R to be
k(x, x

) =
1

2
. . .
D
(x, x

) :=

xR
1
(x)
x

R
1
(x

)
(x,x

)
D

i=1

i
(x
i
, x

i
), (59)
where denotes a set of non-negative coefcients on XX, which ensures that the sum in (59) con-
verges.
7
Haussler (1999) showed that k(x, x

) is p.s.d. and hence admissible as a kernel

(Sch olkopf and Smola, 2002), provided that all the individual
i
are. The deliberate vagueness of
this setup in regard to the nature of the underlying decomposition leads to a rich framework: Many
different kernels can be obtained by simply changing the decomposition.
7.1 Graph Kernels as R-Convolutions
To apply R-convolution kernels to graphs, one decomposes the graph into smaller substructures,
and builds the kernel based on similarities between those components. Most graph kernels are
knowingly or notbased on R-convolutions; they mainly differ in the way they decompose the
graph for comparison and the similarity measure they use to compare the components.
6. We thank an anonymous reviewer for pointing this out.
7. Haussler (1999) implicitly assumed this sum to be well-dened, hence did not use in his denition.
1228
GRAPH KERNELS
G artner et al. (2003) observed that any graph kernel whose feature map is injective could be used
to determine whether two graphs G and G

are isomorphic: Simply compute d(G, G

) := k(G, G)
2k(G, G

)+k(G

, G

); since by denition any structural difference between the graphs would yield a
non-zero d(G, G

), they are isomorphic iff d(G, G

) =0. The graph isomorphism problem, however,

is widely believed to be not solvable in polynomial time (Garey and Johnson, 1979). G artner et al.
(2003) also showed that computing inner products in a feature space constructed over all subgraphs
of a graph is NP-hard. One must therefore choose which substructures to distinguish in dening a
practical graph kernel, and this choice is generally motivated by runtime considerations.
Random walks provide a straightforward graph decomposition thatas we have seen in Sec-
tion 4leads to kernels that can be computed efciently. To see that our random walk graph kernel
(8) is indeed an R-convolution kernel, note that the denition of our weight matrix (6) and the RKHS
Kronecker product (Denition 11) imply
[W

]
(i1)n

+r, ( j1)n

+s
= [(X) (X

)]
(i1)n

+r, ( j1)n

+s
=

(v
i
, v
j
), (v

r
, v

s
)
_
H
=: ((v
i
, v
j
), (v

r
, v

s
)),
where is our edge kernel. We can thus expand (8) by explicitly taking all paths through the
repeated matrix products, giving
k(G, G

) :=

k=1
(k)q

W
k

k=1
(k)q

_
k

i=1
W

_
p

k=1
(k)

v
0
,v
1
,...v
k
V
v

0
,v

1
,...v

k
V

q
v
k
q

k
_
k

i=1
((v
i1
, v
i
), (v

i1
, v

i
))
_
p
v
0
p

0
. (60)
This is easily identied as an instance of the R-convolution kernel (59), where the decomposition is
into all equal-length sequences v,v

of nodes fromV and V

, respectively, and
(v,v

) := ([v[)q
v
[v[
q

[v[
p
v
0
p

0
, (61)
where [ [ in (61) denotes the length of a sequence. Finally, note that by denition of our edge kernel
, only pairs of sequences that are both actual walks on their respective graphs will make a non-zero
contribution to (60).
Random walk graph kernels as proposed by G artner et al. (2003) likewise decompose a graph
into random walks, but then employ a delta kernel between nodes. Borgwardt et al. (2005), on
the other hand, use a kernel dened on both nodes and edges. The marginalized graph kernels
of Kashima et al. (2004) are closely related but subtly different in that they decompose the graph
into all possible label sequences generated by a walk. Mah e et al. (2004) extend this approach in
two ways: They enrich the labels via the so-called Morgan index, and modify the kernel denition
to prevent tottering, that is, the generation of high similarity scores by multiple, similar, small
substructures. Both these extensions are particularly relevant for chemoinformatics applications.
Further aeld, Horv ath et al. (2004) decompose a graph into cyclic patterns, then count the
number of common cyclic patterns which occur in both graphs. Their kernel is plagued by com-
putational issues; in fact they show that computing the cyclic pattern kernel of a general graph is
NP-hard. They consequently restrict their attention to practical problem classes where the number
of simple cycles is bounded.
1229
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Ramon and G artner (2003) consider subtree patterns to dene graph kernels. Starting from a
given node v, a tree is created by adding all the nodes that can be reached from v in 1, . . . , h steps,
where h is the height of the tree. If more than one walk connects two nodes, then each one of these
is used to dene a distinct subtree. This means that the same node is counted several times, thus
leading to tottering. Furthermore, the number of candidate trees grows exponentially with the height
of the subtree under consideration, thus severely limiting the depth of graph structure one can probe
at reasonable computational cost.
Borgwardt and Kriegel (2005) dene a kernel (62) based on shortest paths. They represent a
graph G = (V, E) by a complete graph S = (V,

E) over the same vertices, wherein the weight of
each edge in

E equals the length of the shortest path between the corresponding nodes in G. Their
shortest path kernel is then dened as
k
sp
(G, G

) =

e

(e, e

), (62)
where is any kernel dened on the edges of S and S

.
Shervashidze et al. (2009) use subgraphs of xed size to dene kernels. Their key idea is to
represent the graph by a normalized frequency vector which counts the frequency of occurrence of
various xed-size subgraphs. The kernel is then simply computed as the dot product between these
vectors.
Other decompositions of graphs which are well suited for particular application domains include
molecular ngerprints based on various types of depth-rst searches (Ralaivola et al., 2005) and
structural elements such as rings or functional groups (Fr ohlich et al., 2006).
7.2 R-Convolutions in Abstract Semirings
There have been a few attempts to extend the R-convolution kernel (59) to abstract semirings, by
dening:
k(x, x

) :=
M
xR
1
(x)
x

R
1
(x

)
(x,x

D
K
i=1

i
(x
i
, x

i
). (63)
The optimal assignment graph kernel of Fr ohlich et al. (2006) is motivated along these lines, using
the tropical semiring. It can be dened as
k(x, x

) = max
xR
1
(x)
x

R
1
(x

)
_
(x,x

) +
D

i=1

i
(x
i
, x

i
)
_
. (64)
Unfortunately (64) is not always p.s.d. (Vert, 2008). The problem is that the class of p.s.d. kernels
is not closed under the max operation (Berg et al., 1984).
For semirings that have a morphism to the reals, however, we can rewrite (63) as
(k(x, x

)) =

xR
1
(x)
x

R
1
(x

)
(x,x

)
D

i=1
(
i
(x
i
, x

i
)). (65)
1230
GRAPH KERNELS
Comparing (65) with (59) makes it clear that k is p.s.d. and hence admissible if all
i
are.
This can be used to construct p.s.d. R-convolution kernels in such semirings.
For instance, take the logarithmic semiring (R,

ln
, +, , 0) augmented with an in-
verse temperature parameter > 0, so that x

ln
y := ln(e
x
+e
y
)/. This has the morphism
(x) = e
x
. We can thus specialize (65) to dene
k(x, x

) :=

xR
1
(x)
x

R
1
(x

)
e
(x,x

)
, where (x,x

) := (x,x

) +
D

i=1

i
(x
i
, x

i
), (66)
which is a valid p.s.d. kernel if all e

i
are. Note that if
i
is a p.s.d. kernel, then since > 0 so is

i
, and since p.s.d. kernels are closed under exponentiation (Genton, 2001, Equation 5) so is e

i
.
What makes (66) interesting is that when the temperature approaches zero ( ), the aug-
mented logarithmic semiring approaches the tropical semiring, as x

ln
y max(x, y). We thus
obtain a kernel that approximates (an exponentiated version of) the optimal assignment kernel (64)
yet is provably p.s.d. Since at low temperatures the value of (66) is dominated by the optimal
assignment, one might call it the mostly optimal assignment kernel.
The nite range of oating-point computer arithmetic unfortunately limits how low a temper-
ature (66) can be used with in practice, though this can be greatly extended via suitable software,
such as the extnum C
++
class.
8
8. Discussion and Outlook
As evidenced by the large number of recent papers, random walk and marginalized graph kernels
have received considerable research attention. Although the connections between these two kernels
were hinted at by Kashima et al. (2004), no effort was made to pursue this further. Our aim in
presenting a unied framework for random walk and marginalized graph kernels that combines
the best features of previous formulations is to highlight the similarities as well as the differences
between these approaches. Furthermore, it allows us to use extended linear algebra in an RKHS to
efciently compute all these kernels by exploiting common structure inherent in these problems.
As more and more graph-structured data (e.g., molecular structures and protein interaction net-
works) becomes available in elds such as biology, web data mining, etc., graph classication will
gain importance over the coming years. Hence there is a pressing need to speed up the computation
of similarity metrics on graphs. We have shown that sparsity, low effective rank, and Kronecker
product structure can be exploited to greatly reduce the computational cost of graph kernels; taking
advantage of other forms of structure in W

remains a computational challenge. Now that the com-

putation of random walk graph kernels is viable for practical problem sizes, it will open the doors
for their application in hitherto unexplored domains.
A major deciency of geometric random walk graph kernels is that the admissible range of
values of the decay parameter in (16) depends on the spectrum of the weight matrix W

. Since
this is typically unknown, in practice one often resorts to very low values of but this makes the
contributions of higher-order terms (corresponding to long walks) to the kernel negligible. In fact
in many applications a naive kernel which simply computes the average kernel between all pairs of
edges in the two graphs has performance comparable to the random walk graph kernel.
8. extnum can be found at http://darwin.nmsu.edu/molb_resources/bioinformatics/extnum/extnum.html.
1231
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Trying to remedy this situation by normalizing the matrices involved leads to another phe-
nomenon called tottering (Mah e et al., 2004). Roughly speaking tottering occurs when short self-
repeating walks make a disproportionately large contribution to the kernel value. Consider two
adjacent vertices v and v

in a graph. Because of tottering, contributions due to walks of the form

v v

v . . . dominate the kernel value. Unfortunately a kernel using self-avoiding walks

(walks which do not visit the same vertex twice) cannot be computed in polynomial time.
A natural question to ask is the following: Since diffusion can be viewed as a continuous time
limit of random walks, can the ideas behind the random walk kernel be extended to diffusion?
Unfortunately, the Laplacian of the product graph does not decompose into the Kronecker product
of the Laplacian matrices of the constituent graphs; this rules out a straightforward extension.
Although rational kernels have always been viewed as distinct from graph kernels, we have
shown that in fact these two research areas are closely related. It is our hope that this will facilitate
cross-pollination of ideas such as the use of semirings and transducers in dening graph kernels.
A return to the tensor and matrix notation which was commonly used to describe algebraic path
problems would help make these connections explicit.
It is fair to say that R-convolution kernels are the mother of all kernels on structured data.
It is enlightening to view various graph kernels as instances of R-convolution kernels since this
brings into focus the relevant decomposition used to dene a given kernel, and the similarities and
differences between various kernels. Extending R-convolutions to abstract semirings, however, does
not always result in a valid p.s.d. kernel. We have shown that a morphism to the reals is sufcient to
successfully transport an R-convolution kernel into a semiring; whether it is necessary remains an
open problem.
We do not believe that the last word on graph comparison has been said yet. Thus far, simple
decompositions like random walks have been used to compare graphs. This is mainly driven by
computational considerations and not by the application domain at hand. The algorithmic challenge
of the future is to integrate higher-order structures such as spanning trees in graph comparisons, and
to compute such kernels efciently.
Acknowledgments
We thank Markus Hegland and Tim Sears for enlightening discussions, Alex Smola for pointing out
to us that the optimal assignment kernel may fail to be p.s.d., and the anonymous reviewers for their
detailed comments and suggestions which greatly helped improve this paper.
This publication only reects the authors views. It was supported by NICTA, funded by the
Australian Government through the Backing Australias Ability and Centre of Excellence pro-
grams, by the IST Programme of the European Community under the PASCAL2 Network of Excel-
lence, IST-2007-216886, by the German Ministry for Education, Science, Research and Technology
(BMBF) under grant No. 031U112F within the BFAM (Bioinformatics for the Functional Analysis
of Mammalian Genomes) project, part of the German Genome Analysis Network (NGFN), by NIH
grant GM063208-05 Tools and Data Resources in Support of Structural Genomics, and NSF grant
IIS-0916686.
1232
GRAPH KERNELS
Appendix A. Extending Linear Algebra to RKHS
It is well known that any continuous, symmetric, positive denite kernel : X X R has a
corresponding Hilbert space H, called the Reproducing Kernel Hilbert Space or RKHS, which
induces a feature map : X H satisfying (x, x

) =(x), (x

))
H
. The natural extension of this
so-called feature map to matrices is : X
nm
H
nm
dened [(A)]
i j
:=(A
i j
). In what follows,
we use to lift tensor algebra from X to H, extending various matrix products to the RKHS, and
proving some of their their useful properties. Straightforward extensions via the commutativity
properties of the operators have been omitted for the sake of brevity.
A.1 Matrix Product
Denition 9 Let A X
nm
, B X
mp
, and C R
mp
. The matrix products (A)(B) R
np
and
(A)C H
np
are given by
[(A)(B)]
ik
:=

(A
i j
), (B
jk
)
_
H
and [(A)C]
ik
:=

j
(A
i j
)C
jk
.
It is straightforward to show that the usual properties of matrix multiplicationnamely associa-
tivity, transpose-commutativity, and distributivity with additionhold for Denition 9 above, with
one exception: associativity does not hold if the elements of all three matrices involved belong to
the RKHS. In other words, given A X
nm
, B X
mp
, and C X
pq
, generally [(A)(B)](C) ,=
(A)[(B)(C)]. The technical difculty is that in general

(A
i j
), (B
jk
)
_
H
(C
kl
) ,= (A
i j
)

(B
jk
), (C
kl
)
_
H
. (67)
Further examples of statements like (67), involving properties which do not hold when extended to
an RKHS, can be found for the other matrix products at (69) and (76) below.
Denition 9 allows us to state a rst RKHS extension of the vec(ABC) formula (1):
Lemma 10 If A R
nm
, B X
mp
, and C R
pq
, then
vec(A(B)C)) = (C

A)vec((B)) X
nq1
.
Proof Analogous to Lemma 12 below.
A.2 Kronecker Product
Denition 11 Let A X
nm
and B X
pq
. The Kronecker product (A) (B) R
npmq
is
dened as
[(A) (B)]
(i1)p+k,( j1)q+l
:=

(A
i j
), (B
kl
)
_
H
.
Similarly to (67) above, for matrices in an RKHS
((A) (B))((C) (D)) = ((A)(C)) ((B)(D)) (68)
1233
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
does not necessarily hold. The technical problem with (68) is that generally
(A
ir
), (B
ks
))
H

(C
r j
), (D
sl
)
_
H
,=

(A
ir
), (C
r j
)
_
H
(B
ks
), (D
sl
))
H
. (69)
In Section A.3 we show that analogous properties (Lemmas 14 and 15) do hold for the heteroge-
neous Kronecker product between RKHS and real matrices.
Denition 11 gives us a second extension of the vec(ABC) formula (1) to RKHS:
Lemma 12 If A X
nm
, B R
mp
, and C X
pq
, then
vec((A)B(C)) = ((C)

(A))vec(B) R
nq1
.
Proof We begin by rewriting the k
th
column of (A)B(C) as
[(A)B(C)]
k
= (A)

j
B
j
(C
jk
) =

j
(C
jk
)(A)B
j
= [(C
1k
)(A), (C
2k
)(A), . . . (C
nk
)(A)]
_

_
B
1
B
2
.
.
.
B
n
_

_
. .
vec(B)
= ([(C
1k
), (C
2k
), . . . (C
nk
)] (A))vec(B). (70)
To obtain Lemma 12 we stack up the columns of (70):
vec((A)B(C)) =
_
_
_
_

_
(C
11
) (C
21
) . . . (C
n1
)
.
.
.
.
.
.
.
.
.
.
.
.
(C
1n
) (C
2n
) . . . (C
nn
)
_

_(A)
_
_
_vec(B)
= ((C)

(A))vec(B).
Direct computation of the right-hand side of Lemma 12 requires nmpq kernel evaluations; when
m, p, and q are all O(n) this is O(n
4
). If H is nite-dimensional, howeverin other words, if
the feature map can be taken to be : X R
d
with d < then the left-hand side of Lemma 12
can be obtained in O(n
3
d) operations. Our efcient computation schemes in Section 4 exploit this
observation.
A.3 Heterogeneous Kronecker Product
Denition 13 Let A X
nm
and B R
pq
. The heterogeneous Kronecker product (A) B
X
npmq
is given by
[(A) B]
(i1)p+k,( j1)q+l
= (A
i j
)B
kl
.
Recall that the standard Kronecker product obeys (2); here we prove two extensions:
1234
GRAPH KERNELS
Lemma 14 If A X
nm
, B X
pq
, C R
mo
, and D R
qr
, then
((A) (B))(CD) = ((A)C) ((B)D).
Proof Using the linearity of the inner product we directly verify
[((A) (B))(CD)]
(i1)p+k,( j1)q+l
=

r,s
(A
ir
), (B
ks
))
H
C
r j
D
sl
=
_

r
(A
ir
)C
r j
,

s
(B
ks
)D
sl
_
H
=

[(A)C]
i j
, [(B)D]
kl
_
H
= [((A)C) ((B)D)]
(i1)p+k,( j1)q+l
Lemma 15 If A X
nm
, B R
pq
, C X
mo
, and D R
qr
, then
((A) B)((C) D) = ((A)(C)) (BD).
Proof Using the linearity of the inner product we directly verify
[((A) B)((C) D)]
(i1)p+k,( j1)q+l
=

r,s

(A
ir
)B
ks
, (C
r j
)D
sl
_
H
=

(A
ir
), (C
r j
)
_
H

s
B
ks
D
sl
= [(A)(C)]
i j
[BD]
kl
= [((A)(C)) (BD)]
(i1)p+k,( j1)q+l
Using the heterogeneous Kronecker product, we can state four more RKHS extensions of the
vec-ABC formula (1):
Lemma 16 If A X
nm
, B R
mp
, and C R
pq
, then
vec((A)BC) = (C

(A))vec(B) X
nq1
.
Proof Analogous to Lemma 12.
Lemma 17 If A R
nm
, B R
mp
, and C X
pq
, then
vec(AB(C)) = ((C)

A)vec(B) X
nq1
.
Proof Analogous to Lemma 12.
1235
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Lemma 18 If A X
nm
, B X
mp
, and C R
pq
, then
vec((A)(B)C) = (C

(A))vec((B)) R
nq1
.
Proof Analogous to Lemma 12.
Lemma 19 If A R
nm
, B X
mp
, and C X
pq
, then
vec(A(B)(C)) = ((C)

A)vec((B)) R
nq1
.
Proof Analogous to Lemma 12.
Note that there is no analogous lemma for vec((A)(B)(C)) since this term is not well-
dened due to non-associativity (67).
A.4 Kronecker Sum
A concept closely related to the Kronecker product is that of the Kronecker sum, which is dened
for real matrices A R
nm
and B R
pq
as
AB := AI
pq
+I
nm
B, (71)
with I
nm
(resp. I
pq
) denoting the n m (resp. p q) identity matrix. Many of its properties can
be derived from those of the Kronecker product. Unlike the Kronecker product, however, the Kro-
necker sum of two matrices in an RKHS is a matrix in the RKHS. From Denition 1 and (71) we
nd that
[AB]
(i1)p+k,( j1)q+l
:= A
i j

kl
+
i j
B
kl
. (72)
We can extend (72) to RKHS, dening analogously:
Denition 20 Let A X
nm
and B X
pq
. The Kronecker sum (A) (B) X
npmq
is dened
as
[(A) (B)]
(i1)p+k,( j1)q+l
:= (A
i j
)
kl
+
i j
(B
kl
).
In other words, in an RKHS the Kronecker sum is dened just as in (71):
(A) (B) = (A) I
B
+ I
A
(B), (73)
where I
M
denotes the real-valued identity matrix of the same dimensions (not necessarily square) as
matrix M. In accordance with Denition 13, the result of (73) is an RKHS matrix.
The equivalent of the vec-ABC formula (1) for Kronecker sums is:
(AB)vec(C) = (AI
B
+I
A
B)vec(C)
= (AI
B
)vec(C) +(I
A
B)vec(C)
= vec(I
B
CA

) +vec(BCI

A
) (74)
= vec(I
B
CA

+BCI

A
).
This also works for matrices in an RKHS:
1236
GRAPH KERNELS
Lemma 21 If A X
nm
, B X
pq
, and C X
qm
, then
((A) (B))vec((C)) = vec(I
B
(C)(A)

+(B)(C)I

A
) R
np1
.
Proof Analogous to (74), using Lemmas 18 and 19.
Furthermore, we have two valid heterogeneous forms that map into the RKHS:
Lemma 22 If A X
nm
, B X
pq
, and C R
qm
, then
((A) (B))vec(C) = vec(I
B
C(A)

+(B)CI

A
) X
np1
.
Proof Analogous to (74), using Lemmas 16 and 17.
Lemma 23 If A R
nm
, B R
pq
, and C X
qm
, then
(AB)vec((C)) = vec(I
B
(C)A

+B(C)I

A
) X
np1
.
Proof Analogous to (74), using Lemma 10.
A.5 Hadamard Product
While the extension of the Hadamard (element-wise) product to an RKHS is not required to imple-
ment our fast graph kernels, the reader may nd it interesting in its own right.
Denition 24 Let A, B X
nm
and C R
nm
. The Hadamard products (A) (B) R
nm
and
(A) C H
nm
are given by
[(A) (B)]
i j
=

(A
i j
), (B
i j
)
_
H
and [(A) C]
i j
= (A
i j
)C
i j
.
We prove two extensions of (3):
Lemma 25 If A X
nm
, B X
pq
, C R
nm
, and D R
pq
, then
((A) (B)) (CD) = ((A) C) ((B) D).
Proof Using the linearity of the inner product we directly verify
[((A) (B)) (CD)]
(i1)p+k,( j1)q+l
=

(A
i j
), (B
kl
)
_
H
C
i j
D
kl
=

(A
i j
)C
i j
, (B
kl
)D
kl
_
H
=

[(A) C]
i j
, [(B) D]
kl
_
H
= [((A) C) ((B) D)]
(i1)p+k,( j1)q+l
1237
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Lemma 26 If A X
nm
, B R
pq
, C X
nm
, and D R
pq
, then
((A) B) ((C) D) = ((A) (C)) (BD).
Proof Using the linearity of the inner product we directly verify
[((A) B) ((C) D)]
(i1)p+k,( j1)q+l
=

(A
i j
)B
kl
, (C
i j
)D
kl
_
H
=

(A
i j
), (C
i j
)
_
H
B
kl
D
kl
= [(A) (C)]
i j
[BD]
kl
= [((A) (C)) (BD)]
(i1)p+k,( j1)q+l
As before,
((A) (B)) ((C) (D)) = ((A) (C)) ((B) (D)) (75)
does not necessarily hold, the difculty with (75) being that in general,

(A
i j
), (B
kl
)
_
H

(C
i j
), (D
kl
)
_
H
,=

(A
i j
), (C
i j
)
_
H
(B
kl
), (D
kl
))
H
. (76)
References
Christian Berg, Jens P. R. Christensen, and Paul Ressel. Harmonic Analysis on Semigroups.
Springer, New York, 1984.
Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig,
Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank. Nucleic Acids Research,
28:235242, 2000.
Dennis S. Bernstein. Matrix Mathematics. Princeton University Press, 2005.
Jean Berstel. Transductions and Context-Free Languages. Teubner, 1979.
Nitin Bhardwaj and Hui Lu. Correlation between gene expression proles and protein-protein in-
teractions within and across genomes. Bioinformatics, 21(11):27302738, June 2005.
Danail Bonchev and Dennis H. Rouvray, editors. Chemical Graph Theory: Introduction and Fun-
damentals, volume 1. Gordon and Breach Science Publishers, London, UK, 1991.
Karsten M. Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In Proceedings of
the International Conference on Data Mining, pages 7481, 2005.
Karsten M. Borgwardt, Cheng Soon Ong, Stefan Sch onauer, S. V. N. Vishwanathan, Alexan-
der J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels.
In Proceedings of Intelligent Systems in Molecular Biology (ISMB), Detroit, USA, 2005.
http://www.stat.purdue.edu/

vishy/papers/BorOngSchVisetal05.pdf.
1238
GRAPH KERNELS
Karsten M. Borgwardt, Hans-Peter Kriegel, S. V. N. Vishwanathan, and Nicol N. Schraudolph.
Graph kernels for disease outcome prediction from protein-protein interaction networks. In
Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray, and Teri E. Klein, edi-
tors, Proceedings of the Pacic Symposium of Biocomputing 2007, Maui Hawaii, January 2007.
World Scientic.
Lars Bullinger, Konstanze D ohner, Eric Bair, Stefan Fr ohling, Richard F. Schlenk, Robert Tibshi-
rani, Hartmut D ohner, and Jonathan R. Pollack. Use of gene-expression proling to identify
prognostic subclasses in adult acute myeloid leukemia. New England Journal of Medicine, 350
(16):16051616, Apr 2004.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational kernels. In Suzanna Becker, Sebas-
tian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems
15, volume 14, Cambridge, MA, 2002. MIT Press.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Positive denite rational kernels. In Bernhard
Sch olkopf and Manfred K. Warmuth, editors, Procedings of the Annual Conference on Compu-
tational Learning Theory, pages 4156, 2003.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational kernels: Theory and algorithms.
Journal of Machine Learning Research, 5:10351062, 2004.
Samuel Eilenberg. Automata, Languages and Machines, volume A. Academic Press, 1974.
Hunter B. Fraser, Aaron E. Hirsh, Dennis P. Wall, and Michael B. Eisen. Coevolution of gene
expression among interacting proteins. Proceedings of the National Academy of Science USA,
101(24):90339038, Jun 2004.
Holger Fr ohlich, J org K Wegner, Florian Siker, and andreas Zell. Kernel functions for attributed
molecular graphs a new similarity based approach to ADME prediction in classication and
regression. QSAR and Combinatorial Science, 25(4):317326, 2006.
Judith D. Gardiner, Alan J. Laub, James J. Amato, and Cleve B. Moler. Solution of the Sylvester
matrix equation AXB

+CXD

= E. ACM Transactions on Mathematical Software, 18(2):223

231, 1992.
Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of
NP-Completeness. Series of Books in Mathematical Sciences. W. H. Freeman, 1979.
Thomas G artner, Peter A. Flach, and Stefan Wrobel. On graph kernels: Hardness results and ef-
cient alternatives. In Bernhard Sch olkopf and Manfred K. Warmuth, editors, Proceedings of the
Annual Conference on Computational Learning Theory, pages 129143. Springer, 2003.
Marc G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of
Machine Learning Research, 2:299312, 2001.
Gene H. Golub and Charles F. Van Loan. Matrix Computations. John Hopkins University Press,
Baltimore, MD, 3rd edition, 1996.
1239
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
David Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10,
Computer Science Department, UC Santa Cruz, 1999.
Tam as Horv ath, Thomas G artner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph
mining. In Proceedings of the International Conference on Knowledge Discovery and Data
Mining (KDD), pages 158167, 2004.
Wilfried Imrich and Sandi Klav zar. Product Graphs, Structure and Recognition. Wiley, 2000.
Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between labeled graphs.
In Proceedings of the International Conference on Machine Learning, pages 321328, San Fran-
cisco, CA, 2003. Morgan Kaufmann.
Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. Kernels for graphs. In Koji Tsuda, Bernhard
Sch olkopf, and Jean-Philippe Vert, editors, Kernels and Bioinformatics, pages 155170, Cam-
bridge, MA, 2004. MIT Press.
Risi Kondor and Karsten Borgwardt. The skew spectrum of graphs. In Proceedings of the Interna-
tional Conference on Machine Learning, pages 496503. ACM, 2008.
Risi Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete structures. In Pro-
ceedings of the International Conference on Machine Learning, pages 315322, San Francisco,
CA, 2002. Morgan Kaufmann.
Hugo Kubinyi. Drug research: Myths, hype and reality. Nature Reviews: Drug Discovery, 2(8):
665668, August 2003.
Werner Kuich and Arto Salomaa. Semirings, Automata, Languages. Number 5 in EATCS Mono-
graphs on Theoretical Computer Science. Springer-Verlag, 1986.
Ravi Kumar, Jasmine Novak, and Andrew Tomkins. Structure and evolution of online social net-
works. In Tina Eliassi-Rad, Lyle H. Ungar, Mark Craven, and Dimitrios Gunopulos, editors,
Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Philadelphia, PA, USA, August 20-23, 2006, pages 611617. ACM, 2006.
ISBN 1-59593-339-5.
Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. Computation of the canonical decom-
position by means of a simultaneous generalized Schur decomposition. SIAM Journal on Matrix
Analysis and Applications, 26(2):295327, 2004.
Daniel J. Lehmann. Algebraic structures for transitive closure. Theoretical Computer Science, 4(1):
5976, February 1977.
Pierre Mah e, Nobuhisa Ueda, Tatsuya Akutsu, Jean-Luc Perret, and Jean-Philippe Vert. Extensions
of marginalized graph kernels. In Proceedings of the Twenty-First International Conference on
Machine Learning, pages 552559, 2004.
Mehryar Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of
Automata, Languages and Combinatorics, 7(3):321350, 2002.
1240
GRAPH KERNELS
Mehryar Mohri, Fernando C. N. Pereira, and Michael D. Riley. Weighted automata in text and
speech processing. In Andr as Kornai, editor, Extended Finite State Models of Language: Pro-
ceedings of the ECAI96 Workshop, pages 4650, 1996.
Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations
Research. Springer, 1999.
Fernando C. N. Pereira and Michael D. Riley. Speech recognition by composition of weighted nite
automata. In Finite-State Language Processing, pages 431453. MIT Press, 1997.
Nikos P. Pitsianis. The Kronecker Product in Approximation and Fast Transform Generation. PhD
thesis, Department of Computer Science, Cornell University, 1992.
Liva Ralaivola, Sanjay J. Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical
informatics. Neural Networks, 18(8):10931110, October 2005.
Jan Ramon and Thomas G artner. Expressivity versus efciency of graph kernels. Techni-
cal report, First International Workshop on Mining Graphs, Trees and Sequences (held with
ECML/PKDD03), 2003.
Jean-Francois Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa, Am elie Dricot,
Ning Li, Gabriel F. Berriz, Francis D. Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou, Niels
Klitgord, Christophe Simon, Mike Boxem, Stuart Milstein, Jennifer Rosenberg, Debra S. Gold-
berg, Lan V. Zhang, Sharyl L. Wong, Giovanni Franklin, Siming Li, Joanna S. Albala, Janghoo
Lim, Carlene Fraughton, Estelle Llamosas, Sebiha Cevik, Camille Bex, Philippe Lamesch,
Robert S. Sikorski, Jean Vandenhaute, Huda Y. Zoghbi, Alex Smolyar, Stephanie Bosak, Rey-
naldo Sequerra, Lynn Doucette-Stamm, Michael E. Cusick, David E. Hill, Frederick P. Roth, and
Marc Vidal. Towards a proteome-scale map of the human protein-protein interaction network.
Nature, 437(7062):11731178, Oct 2005.
Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels. MIT Press, Cambridge, MA,
2002.
Ida Schomburg, Antje Chang, Christian Ebeling, Marion Gremse, Christian Heldt, Gregor Huhn,
and Dietmar Schomburg. BRENDA, the enzyme database: Updates and major newdevelopments.
Nucleic Acids Research, 32D:431433, Jan 2004.
Roded Sharan and Trey Ideker. Modeling cellular machinery through biological network compari-
son. Nature Biotechnology, 24(4):427433, Apr 2006.
Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt.
Efcient graphlet kernels for large graph comparison. In Max Welling and David van Dyk, edi-
tors, Proceedings of the International Workshop on Articial Intelligence and Statistics. Society
for Articial Intelligence and Statistics, 2009.
Alexander J. Smola and Risi Kondor. Kernels and regularization on graphs. In Bernhard Sch olkopf
and Manfred K. Warmuth, editors, Proceedings of the Annual Conference on Computational
Learning Theory, Lecture Notes in Computer Science, pages 144158, Heidelberg, Germany,
2003. Springer-Verlag.
1241
VISHWANATHAN, SCHRAUDOLPH, KONDOR AND BORGWARDT
Daniel A. Spielman and Shang-Hua Teng. Spectral sparsication of graphs. Technical Report
0808.4134, arXiv, 2008. http://arxiv.org/abs/0808.4134.
Hannu Toivonen, Ashwin Srinivasan, Ross D. King, Stefan Kramer, and Christoph Helma. Sta-
tistical evaluation of the predictive toxicology challenge 2000-2001. Bioinformatics, 19(10):
11831193, July 2003.
Koji Tsuda, Taishin Kin, and Kiyoshi Asai. Marginalized kernels for biological sequences. Bioin-
formatics, 18 (Suppl. 2):S268S275, 2002.
Charles F. Van Loan. The ubiquitous Kronecker product. Journal of Computational and Applied
Mathematics, 123(12):85100, 2000.
Laura J. vant Veer, Hongyue Dai, Marc J. van de Vijver, Yudong D. He, Augustinus A. M. Hart,
Mao Mao, Hans L. Peterse, Karin van der Kooy, MatthewJ. Marton, Anke T. Witteveen, George J.
Schreiber, Ron M. Kerkhoven, Chris Roberts, Peter S. Linsley, Ren e Bernards, and Stephen H.
Friend. Gene expression proling predicts clinical outcome of breast cancer. Nature, 415:530
536, 2002.
Jean-Philippe Vert. The optimal assignment kernel is not positive denite. Technical Report
0801.4061, arXiv, May 2008. http://aps.arxiv.org/abs/0801.4061.
S. V. N. Vishwanathan. Kernel Methods: Fast Algorithms and Real Life Applica-
tions. PhD thesis, Indian Institute of Science, Bangalore, India, November 2002.
http://www.stat.purdue.edu/

vishy/papers/Vishwanathan02.pdf.
S. V. N. Vishwanathan, Karsten Borgwardt, and Nicol N. Schraudolph. Fast computation of graph
kernels. In B. Sch olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information
Processing Systems 19, Cambridge MA, 2007. MIT Press.
Patrick Warnat, Roland Eils, and Benedikt Brors. Cross-platform analysis of cancer microarray data
improves gene expression based classication of phenotypes. BMC Bioinformatics, 6:265, Nov
2005.
Takashi Washio and Hiroshi Motoda. State of the art of graph-based data mining. SIGKDD Explo-
rations, 5(1):5968, 2003.
Xiang Yao, Dong Wei, Cylburn Soden Jr., Michael F. Summers, and Dorothy Beckett. Structure
of the carboxy-terminal fragment of the apo-biotin carboxyl carrier subunit of Escherichia coli
acetyl-CoA carboxylase. Biochemistry, 36:1508915100, 1997.
1242

Lauritzen - Graphical Models (1996)
No ratings yet
Lauritzen - Graphical Models (1996)
306 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Handbook of Research On Machine Learning Applications and Trends
100% (1)
Handbook of Research On Machine Learning Applications and Trends
34 pages
On The Infeasibility of Obtaining Polynomial Kernels: Junior Research Fellow
No ratings yet
On The Infeasibility of Obtaining Polynomial Kernels: Junior Research Fellow
134 pages
Kernel Methods For General Pattern Analysis PDF
No ratings yet
Kernel Methods For General Pattern Analysis PDF
77 pages
AML-V_new
No ratings yet
AML-V_new
165 pages
A Primer On Kernel Methods PDF
No ratings yet
A Primer On Kernel Methods PDF
42 pages
CSE 291 Expanders and HDX Course Notes
No ratings yet
CSE 291 Expanders and HDX Course Notes
69 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Finite Automata Theory & Formal Languages: Khawaja Mohiuddin
No ratings yet
Finite Automata Theory & Formal Languages: Khawaja Mohiuddin
17 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
Shervashidze 11 A
No ratings yet
Shervashidze 11 A
23 pages
CA10 GraphMining
No ratings yet
CA10 GraphMining
59 pages
A Survey on Graph Kernels
No ratings yet
A Survey on Graph Kernels
42 pages
Graphical Models, Exponential Families, And Variational Inference
No ratings yet
Graphical Models, Exponential Families, And Variational Inference
301 pages
Advances in Graph Algorithms Desconocido
No ratings yet
Advances in Graph Algorithms Desconocido
178 pages
Semi-Supervised Learning with Graphs
No ratings yet
Semi-Supervised Learning with Graphs
174 pages
Expressing Structure With Kernels
No ratings yet
Expressing Structure With Kernels
26 pages
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
No ratings yet
On The Nystr Om Method For Approximating A Gram Matrix For Improved Kernel-Based Learning
23 pages
Thesis Proposal: Graph Structured Statistical Inference: James Sharpnack
No ratings yet
Thesis Proposal: Graph Structured Statistical Inference: James Sharpnack
20 pages
Wasserstein Weisfeiler-Lehman Graph Kernels
No ratings yet
Wasserstein Weisfeiler-Lehman Graph Kernels
19 pages
SADMJ12
No ratings yet
SADMJ12
19 pages
10 Understanding Kernels
No ratings yet
10 Understanding Kernels
41 pages
Structure Learning in Graphical Modeling
No ratings yet
Structure Learning in Graphical Modeling
28 pages
Seminars in bio lecture6 2022 Graphنينااااا
No ratings yet
Seminars in bio lecture6 2022 Graphنينااااا
22 pages
Learning Graphs From Data A Signal Representation Perspective
No ratings yet
Learning Graphs From Data A Signal Representation Perspective
20 pages
Rozprawa
No ratings yet
Rozprawa
77 pages
NP Complere 2
No ratings yet
NP Complere 2
77 pages
"General Graph Random Features"
No ratings yet
"General Graph Random Features"
18 pages
Pattern Vectors From Algebraic Graph Theory
No ratings yet
Pattern Vectors From Algebraic Graph Theory
14 pages
2505.11622v1
No ratings yet
2505.11622v1
30 pages
BCS401 ADA m3 Notes
No ratings yet
BCS401 ADA m3 Notes
24 pages
ICS E4030 Lecture1
No ratings yet
ICS E4030 Lecture1
37 pages
Sprinklr
No ratings yet
Sprinklr
26 pages
Annals Rev Engineering Dynamic Networks
No ratings yet
Annals Rev Engineering Dynamic Networks
10 pages
Toa-Handout-24 Null Production CFG
No ratings yet
Toa-Handout-24 Null Production CFG
30 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
Module 4 - Analytical questions
No ratings yet
Module 4 - Analytical questions
10 pages
nguyen20c
No ratings yet
nguyen20c
11 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
GraKeL A Graph Kernel Library in Python
No ratings yet
GraKeL A Graph Kernel Library in Python
5 pages
Kernel Methods For Pattern Analysis
100% (3)
Kernel Methods For Pattern Analysis
478 pages
Aiml Lab Manual Upto DT
No ratings yet
Aiml Lab Manual Upto DT
40 pages
ProjectDescription
No ratings yet
ProjectDescription
5 pages
NeurIPS 2018 Kong Kernels For Ordered Neighborhood Graphs Paper
No ratings yet
NeurIPS 2018 Kong Kernels For Ordered Neighborhood Graphs Paper
10 pages
Dsa File
No ratings yet
Dsa File
17 pages
GraKeL - A Graph Kernel Library in Python
No ratings yet
GraKeL - A Graph Kernel Library in Python
5 pages
EE8591 QB PDF
No ratings yet
EE8591 QB PDF
18 pages
Applications of Kernel Methods: Gustavo Camps-Valls Manel Martínez-Ramón José Luis Rojo-Álvarez
No ratings yet
Applications of Kernel Methods: Gustavo Camps-Valls Manel Martínez-Ramón José Luis Rojo-Álvarez
1 page
Division Long Grid 2digit Divisor 6digit Dividend Remainders All.1496071772
No ratings yet
Division Long Grid 2digit Divisor 6digit Dividend Remainders All.1496071772
20 pages
Lab Manual: B.E. 6 Semester
No ratings yet
Lab Manual: B.E. 6 Semester
22 pages
Quantum_Secret_Permutating_Protocol
No ratings yet
Quantum_Secret_Permutating_Protocol
13 pages
Lab Manual
No ratings yet
Lab Manual
12 pages
The Fast Fourier Transform (FFT)
No ratings yet
The Fast Fourier Transform (FFT)
11 pages
2022 Afl End Regular Question
No ratings yet
2022 Afl End Regular Question
10 pages
COMPSCI120-2021-S1-exam Solutions
No ratings yet
COMPSCI120-2021-S1-exam Solutions
7 pages
PRIM
No ratings yet
PRIM
6 pages
Computer Project
No ratings yet
Computer Project
5 pages
Ploughman Farm Hillier 8 Ed Case 6.2
No ratings yet
Ploughman Farm Hillier 8 Ed Case 6.2
7 pages
Module 6 Topic 12 Polynomial Functions PART II - Answers
No ratings yet
Module 6 Topic 12 Polynomial Functions PART II - Answers
5 pages
CIS 200 Final Exam
No ratings yet
CIS 200 Final Exam
5 pages
Information Theory Year QSTN
No ratings yet
Information Theory Year QSTN
9 pages
Rancang Bangun Aplikasi Data Mining Pada Penjualan Distro Bloods Berbasis Web Menggunakan Algoritma Apriori
No ratings yet
Rancang Bangun Aplikasi Data Mining Pada Penjualan Distro Bloods Berbasis Web Menggunakan Algoritma Apriori
8 pages
Adsa - Lab - Manual (1) (3) - 22-24
No ratings yet
Adsa - Lab - Manual (1) (3) - 22-24
3 pages
Assignment 4 Gauss Elimination
No ratings yet
Assignment 4 Gauss Elimination
3 pages
Assignment 1 - NLP 2024
No ratings yet
Assignment 1 - NLP 2024
2 pages
Unit 2 - Assignment 2
No ratings yet
Unit 2 - Assignment 2
2 pages
Code Mania 2019
No ratings yet
Code Mania 2019
3 pages
Keras CheatSheet PGAA
No ratings yet
Keras CheatSheet PGAA
1 page
ADA MID SEM QUESTION BANK - Final
No ratings yet
ADA MID SEM QUESTION BANK - Final
2 pages
Mind, Body, World: Foundations of Cognitive Science
From Everand
Mind, Body, World: Foundations of Cognitive Science
Michael R. W. Dawson
5/5 (1)
Applied Matrix Algebra in the Statistical Sciences
From Everand
Applied Matrix Algebra in the Statistical Sciences
Alexander Basilevsky
4/5 (1)
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
From Everand
Lessons in Bioinformatics - Dot Plots: Lessons in Bioinformatics, #1
Björn Olsson
No ratings yet
Continuum Mechanics: Concise Theory and Problems
From Everand
Continuum Mechanics: Concise Theory and Problems
P. Chadwick
3/5 (1)
Introduction to Projective Geometry
From Everand
Introduction to Projective Geometry
C. R. Wylie
No ratings yet
Path Integrals and Quantum Processes
From Everand
Path Integrals and Quantum Processes
Mark S. Swanson
4/5 (1)
Fundamentals of Modern Mathematics: A Practical Review
From Everand
Fundamentals of Modern Mathematics: A Practical Review
David B. MacNeil
No ratings yet
Advanced Mathematics for Engineers and Scientists
From Everand
Advanced Mathematics for Engineers and Scientists
Paul DuChateau
4/5 (2)
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
From Everand
Stationary and Related Stochastic Processes: Sample Function Properties and Their Applications
Harald Cramér
4/5 (2)
An Introduction to Finite Projective Planes
From Everand
An Introduction to Finite Projective Planes
Abraham Adrian Albert
No ratings yet
Matrix Theory
From Everand
Matrix Theory
Joel N. Franklin
No ratings yet
Finite Element Method
From Everand
Finite Element Method
Gouri Dhatt
1/5 (1)
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
From Everand
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Linear Algebra and Matrix Theory
From Everand
Linear Algebra and Matrix Theory
Robert R. Stoll
5/5 (1)
Introduction to Differential Geometry for Engineers
From Everand
Introduction to Differential Geometry for Engineers
Brian F. Doolin
No ratings yet
Hough Transform: Unveiling the Magic of Hough Transform in Computer Vision
From Everand
Hough Transform: Unveiling the Magic of Hough Transform in Computer Vision
Fouad Sabry
No ratings yet
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
From Everand
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
Fouad Sabry
No ratings yet