Kernell Mallows Kernels For Permutations
Kernell Mallows Kernels For Permutations
Kernell Mallows Kernels For Permutations
1
MINES ParisTech, PSL Research University,
CBIO – Centre for Computational Biology,
77300 Fontainebleau, France
2
Institut Curie, 75248 Paris Cedex, France
3
INSERM, U900, 75248 Paris Cedex, France
4
Ecole Normale Supérieure,
Department of Mathematics and their Applications,
75005 Paris, France
Emails: [name].[surname]@mines-paristech.fr
Abstract
We show that the widely used Kendall tau correlation coefficient, and the related
Mallows kernel, are positive definite kernels for permutations. They offer computa-
tionally attractive alternatives to more complex kernels on the symmetric group to
learn from rankings, or learn to rank. We show how to extend these kernels to par-
tial rankings, multivariate rankings and uncertain rankings. Examples are presented
on how to formulate typical problems of learning from rankings such that they can
be solved with state-of-the-art kernel algorithms. We demonstrate promising results
on clustering heterogeneous rank data and high-dimensional classification problems in
biomedical applications.
1 Introduction
A permutation is a 1-to-1 mapping from a finite set into itself. Assuming the finite set is
ordered, a permutation can equivalently be represented by a total ranking of the elements of
the set. Permutations are ubiquitous in many applications involving preferences, rankings
or matching, such as modeling and analyzing data describing the preferences or votes of
a population [18, 52], learning or tracking correspondences between sets of objects [32],
or estimating a consensus ranking that best represents a collection of individual rankings
[19, 1, 3]. Another potentially rich source of rank data comes from real-valued vectors in
which the relative ordering of the values of multiple features is more important than their
absolute magnitude. For example, in the case of high-dimensional gene expression data,
1
[24] showed that simple classifiers based on binary comparisons between the expression of
different genes in a sample show competitive prediction accuracy with much more complex
classifiers built on quantitative gene expression levels, a line of thoughts that have been
further investigated by [71, 79, 49]. In these approaches, an n-dimensional feature vector is
first transformed into a vector of ranks by sorting its entries, which are presented as input
to training a classifier.
Working with permutations is, however, computationally challenging. There are n! per-
mutations over n items, suggesting that various simplifications or approximations are neces-
sary in pursuit of efficient algorithms to analyze or learn permutations. Such simplifications
include for example, reducing ranks to a series of binary decisions [1, 6], or estimating a
parametric distribution over permutations [47, 31, 32].
Kernel algorithms form a class of methods that have been proved successful in numerous
applications and enjoy great popularity in the machine learning community [13, 75, 60, 65].
The essential idea behind these methods is to define a symmetric positive definite kernel
K : X × X → R over an input space X , which expresses our belief of similarities between
pairs of points in the input space, and which implicitly defines an embedding Φ : X → F of
the input space X to a Hilbert space F in which the kernel becomes an inner product:
Key to kernel methods is the fact that kernel algorithms only manipulate data through
evaluation of the kernel function, allowing to work implicitly in the potentially high- or even
infinite-dimensional space F. This kernel trick is particularly interesting when K(x, x0 ) is
inexpensive to evaluate, compared to Φ(x) and Φ(x0 ). In particular, kernel methods have
found many applications where the input data are discrete or structured, such as strings or
graphs, thanks to the development of numerous kernels for these data [30, 36, 23, 65, 63, 76].
In this context, it is surprising that relatively little attention has been paid to the problem
of defining positive definite kernels between permutations, which could pave the way to ben-
efiting from computationally efficient kernel methods in problems involving permutations.
A notable exception is the work of [41, 43], who exploit the fact that the right-invariant pos-
itive definite kernels on the symmetric group are fully characterized by Bochner’s theorem
[41, 22]. They derive interesting kernels, such as a diffusion kernel for rankings or partial
rankings, and demonstrate that kernel methods are flexible to handle rank data of diverse
types. However, the kernels proposed in their papers have typically a computational com-
plexity that grows exponentially with the number of items to rank, and remain prohibitive
to compute for more than a few items.
In this paper we study new computationally attractive positive definite kernels for per-
mutations and rankings. Our main contribution is to show that two widely-used and compu-
tationally efficient measures of similarity between permutations, the Kendall tau correlation
coefficient and the Mallows kernel, are positive definite. Although these measures compare
n
two permutations of n items in terms of 2 pairwise comparisons, they can be computed in
O(n log n), which allows us to use kernel methods for problems involving rank data over a
large number of items. We show how these kernels can be extended to partial rankings, mul-
tivariate rankings, and uncertain rankings which are particularly relevant when the rankings
are obtained by sorting a real-valued vector where ties or almost-ties occur. We illustrate
2
the benefit of kernel learning with the new kernels on two applications, one concerning the
unsupervised clustering of rank data with kernel k-means, one focusing on the supervised
classification of genomic data with Support Vector Machines (SVMs), reaching in both cases
state-of-the-art performances.
The paper is organized as follows. In Section 2, we prove our main theorem showing
that the Kendall and Mallows kernels are positive definite. We extend them to partial,
multivariate and uncertain rankings respectively in Section 3, 4 and 5. We highlight the
relation to the diffusion kernel of [43] in Section 6. Finally we illustrate the relevance of
kernel methods for unsupervised (Section 7) and supervised (Section 8) tasks. Data and
R codes for generating all the plots in this paper and reproducing more experiments are
available via https://github.com/YunlongJiao/kendallkernel_demo.
xi 1 xi 2 · · · xi n , (1)
where {i1 , . . . , in } are distinct indices in {1, 2, . . . , n} =: [1, n]. A permutation is a 1-to-1
mapping from a finite set into itself, i.e., σ : [1, n] → [1, n] such that σ(i) 6= σ(j) for i 6= j.
Each total ranking can be equivalently represented by a permutation σ in the sense that
σ(i) = j indicates that a ranker assigns rank j to item i where higher rank coincides higher
preference.
For example,
the ranking x2 x4 x3 x1 is associated to the permutation
1 2 3 4
σ = , meaning σ(1) = 1, σ(2) = 4, σ(3) = 2 and σ(4) = 3. There are
1 4 2 3
n! different total rankings, and we denote by Sn the set of all permutations over n items.
Endowed with the composition operation (σ1 σ2 )(i) = σ1 (σ2 (i)), Sn is a group called the
symmetric group.
Given two permutations σ, σ 0 ∈ Sn , the number of concordant and discordant pairs
between σ and σ 0 are respectively
X
nc (σ, σ 0 ) =
1{σ(i)<σ(j)} 1{σ0 (i)<σ0 (j)} + 1{σ(i)>σ(j)} 1{σ0 (i)>σ0 (j)} ,
i<j
X
0
nd (σ, σ ) = 1{σ(i)<σ(j)} 1{σ0 (i)>σ0 (j)} + 1{σ(i)>σ(j)} 1{σ0 (i)<σ0 (j)} .
i<j
As their names suggest, nc (σ, σ 0 ) and nd (σ, σ 0 ) count how many pairs of items are respectively
in the same or opposite order in the two rankings σ and σ 0 . nd is frequently used as a distance
between permutations, often under the name Kendall tau distance, and underlies two popular
similarity measures between permutations:
3
• The Kendall kernel defined as
nc (σ, σ 0 ) − nd (σ, σ 0 )
Kτ (σ, σ 0 ) = n
. (3)
2
The Mallows kernel plays a role on the symmetric group similar to the Gaussian kernel on
Euclidean space, for example for statistical modeling of permutations [50, 15, 21, 53] or
nonparametric smoothing [47], and the Kendall kernel [38, 39] is probably the most widely-
used measure of rank correlation coefficient. In spite of their pervasiveness, to the best of
our knowledge the following property has been overlooked:
λ
Theorem 1. The Mallows kernel KM , for any λ ≥ 0, and the Kendall kernel Kτ are positive
definite.
n
Proof. Consider the mapping Φ : Sn → R( 2 ) defined by
1
Φ(σ) = q (1{σ(i)>σ(j)} − 1{σ(i)<σ(j)} ) .
n 1≤i<j≤n
2
λ
showing that nd is conditionally positive definite and therefore that KM is positive definite
for all λ ≥ 0 [59].
Although the Kendall and Mallows kernels correspond respectively to a linear and Gaus-
sian kernel on an n2 -dimensional embedding of Sn such that they can in particular be com-
4
3 Extension to partial rankings
In this section we show how the Kendall and Mallows kernels can efficiently be adapted to
partial rankings, a situation frequently encountered in practice. For example, in a movie
recommender system, each user only grades a few movies that he has watched based on
personal interest. As another example, in a chess tournament, each game results in a relative
ordering between two contestants, and one would typically like to find a single ranking of all
players that globally best represents the large collection of binary outcomes.
As opposed to a total ranking (1), partial rankings arise in diverse form which can be
generally described by
X1 X2 · · · Xk ,
where X1 , . . . , Xk are k disjoint subsets of n items {x1 , . . . , xn }. For example, {x2 , x4 }
x6 {x3 , x8 } in a social survey could represent the fact that items 2 and 4 are ranked higher
by an interviewee than item 6, which itself is ranked higher than items 3 and 8. Note that it
is uninformative of the relative order between items 2 and 4, nor of how item 1 is rated. For
ease of analysis, a partial ranking is often associated with a subset R ⊂ Sn of permutations
which are compatible with all partial orders described by the partial ranking. In this study,
two particularly interesting types are:
(i) Interleaving partial rankings. Such a partial ranking is of the form
xi 1 xi 2 · · · xi k , k ≤ n,
where we have a total ranking for k out of n items. This type of partial ranking is frequently
encountered in real life, for example in a social survey an interviewer is inexperienced to rank
all items listed so that there exist interleaved inaccessible values. The interleaving partial
ranking corresponds to the set of permutations compatible with it:
Ai1 ,...,ik = {σ ∈ Sn |σ(ia ) > σ(ib ) if a < b, a, b ∈ [1, k]}. (5)
(ii) Top-k partial rankings. Such a partial ranking is of the form
xi1 xi2 · · · xik Xrest , k ≤ n,
where we have a total ranking for k out of n items and also know that these k items are
ranked higher than all the other items. For example, the top k hits returned by a search
engine leads to a top k partial ranking; under a voting system in election, voters express
their vote by ranking some (or all) of the candidates in order of preference. The top-k partial
ranking corresponds to the set of compatible permutations:
Bi1 ,...,ik = {σ ∈ Sn |σ(ia ) = n + 1 − a, a ∈ [1, k]}. (6)
To extend any kernel K over Sn to a kernel over the set of partial rankings, we propose
to represent a partial ranking by its compatible subset R ⊂ Sn of permutations, and define
a kernel between two partial rankings R and R0 by adopting the convolution kernel, written
with a slight abuse of notations as
1 XX
K(R, R0 ) = K(σ, σ 0 ). (7)
|R||R0 | σ∈R σ0 ∈R0
5
As a convolution kernel, it is positive definite as long as K is positive definite [30]. However,
a naive implementation to compute (7) typically requires O((n − k)!(n − k 0 )!) operations
when the number of observed items in partial rankings R, R0 is respectively k, k 0 < n, which
can quickly become prohibitive. Fortunately Theorem 2 guarantees that we can circumvent
the computational burden of naively implementing (7) with the Kendall kernel Kτ on at
least the two particular cases of partial rankings (5) or (6).
Theorem 2. The Kendall kernel Kτ between two interleaving partial rankings of respectively
k and m observed items, or between a top-k partial ranking and a top-m partial ranking, of
form (7) can be computed in O(k log k + m log m) operations.
Proof and explicit algorithms are postponed to the appendices. Note that the convolution
λ
kernel (7) taking the Mallows kernel KM is not straightforward to evaluate, which will be
further discussed in Section 6. However, since we have extended the Kendall kernel to
partial rankings, an exponential kernel can be constructed trivially following (4), for which
the computation remains just as simple as the extended Kendall kernel. Since this technique
also applies in following sections, we focus mainly on extending Kendall kernel henceforth.
where a kernel K for partial rankings has been defined in (7). A practically simple approach
would be to set the weights µj = 1/p for 1 ≤ j ≤ p in (8), but the weights can be learned as
well through multiple kernel learning under appropriate setting [45, 4, 69, 26].
6
5 Extension to uncertain rankings
When data to analyze are n-dimensional real-valued quantitative vectors, converting them
to permutations in Sn by ranking their entries can be beneficial in cases where we trust more
the relative ordering of the values than their absolute magnitudes. For example in social
surveys or recommender systems, users are sometimes asked to rate a score for each item
individually instead of providing a preference order on the item set. The scale of ratings
usually varies according to personal preference of each user and it can therefore be safer
to adopt ranking-based methods to analyze such score-based rating data [35]. As another
example, an interesting line of work in the analysis of gene expression data promotes the
development of classifiers based on relative gene expression within a sample, based on the
observations that gene expression measurements are subject to various measurement errors
such as technological biases and normalization issues, while assessing whether a gene is more
expressed than another gene is generally a more robust task [24, 71, 79, 49]. This suggests
that the Kendall kernel can be relevant for analyzing quantitative vectors.
The Kendall kernel for quantitative vectors now takes exactly the same form as for
permutations, i.e.,
Kτ (x, x0 ) = Φ(x)> Φ(x0 ) , (9)
n
where Φ : Rn → R( 2 ) is defined for x = (x1 , . . . , xn )> ∈ Rn by
1
Φ(x) = q (1{xi >xj } − 1{xi <xj } ) . (10)
n 1≤i<j≤n
2
In this case, the interpretation of the Kendall kernel in terms of concordant and discordant
pairs (3) is still valid, with the caveats that in the presence of ties between entries of x, say
two coordinates i and j such that xi = xj , the tied pair {xi , xj } will be neither concordant nor
discordant. This implies in particular that if x has ties or so does x0 , then |Kτ (x, x0 )| < 1
strictly. Notably in the presence of ties, the fast implementation of Kendall kernel still
applies to quantitative vectors in O(n log n) time [40]. However, feature mapping (10) is by
construction very sensitive to the presence of entry pairs that are ties or almost-ties. In fact,
each entry of Φ(x) is, up to a normalization constant, the Heaviside step function which takes
discrete values in {−1, 0, +1}, and thus can change abruptly even when x changes slightly
but reverses the ordering of two entries whose values are close. In addition to pairwise
relative ordering as defined in (10), it can be wise to also exploit the information given by
pairwise absolute difference in the feature values.
We propose to make the mapping more robust by assuming a random noise ∼ P added
to the feature vector x and checking where Φ(x + ) is on average (similarly to, e.g., [55]).
n
In other words, we consider a smoother mapping Ψ : Rn → R( 2 ) defined by
7
where x̃ and x̃0 are independently noise-perturbed versions of x and x0 . In fact, we can
deduce from (10) that Ψ is equivalently written as
1
Ψ(x) = q (P (x̃i > x̃j ) − P (x̃i < x̃j )) .
n 1≤i<j≤n
2
Depending on the noise distribution, various kernels are thus obtained. For example, as-
suming specifically that ∼ (U[− a2 , a2 ])n the n-dimensional uniform noise of window size a
centered at 0, the (i, j)-th entry of Ψ(x) for all i < j becomes
1
Ψij (x) = q ga (xi − xj ) , (13)
n
2
where
1 t≥a
t
2( a ) − ( at )2 0≤t≤a
ga (t) := .
2( at ) + ( at )2 −a ≤ t ≤ 0
−1 t ≤ −a
Note that ga is odd, continuous, piecewise quadratic between [−a, a] and constant elsewhere
at ±1, and thus can be viewed as smoothed version of the Heaviside step function to compare
any two entries xi and xj from their difference xi − xj (Figure 1).
1.0
Φij
Ψij
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
xi − xj
Figure 1: Smooth approximation (in red) of the Heaviside function (in black) used to define
the mapping (13) for a = 1.
Although the smoothed kernel (12) can be an interesting alternative to the Kendall
kernel (9), we unfortunately lose for G the computational trick that allows to compute Kτ
in O(n log n). Specifically, we have two ways to compute G:
(i) Exact evaluation. The first alternative is to compute explicitly the n2 -vector repre-
sentation Ψ in the feature space, and then take the dot product to obtain G. While the
kernel evaluation is exact, an analytic form of the smoothed mapping (11) is required and
the computational cost is linear with the dimension of the feature space, i.e., O(n2 ).
8
(ii) Monte Carlo approximation. The second alternative requires the observation that
the smoothed mapping Ψ(x) = EΦ(x̃) appears in the form of expectation and can thus be
approximated by a D-sample mean of jittered points mapped by Φ into the feature space:
D
1 X
ΨD (x) = Φ(x̃j ) ,
D j=1
where x̃1 , . . . , x̃D are i.i.d. noisy versions of x. The dot product induces a kernel:
D D
0 > 0 1 XX
Kτ x̃i , x̃0j ,
GD (x, x ) = ΨD (x) ΨD (x ) = 2 (14)
D i=1 j=1
which is a D2 -sample empirical estimate of G(x, x0 ) = EKτ (x̃, x̃0 ) when x, x0 are indepen-
dently jittered with identically distributed noise. Since Kτ is of computational complexity
O(n log n), computing GD requires O(D2 n log n).
Note that the second alternative is faster to compute than the first one as long as, up to
constants, D2 < n/ log n, and small values of D are thus favored on account of computational
consideration. In that case, however, the approximation performance can be unappealing.
To better understand the trade-off between the two alternatives, the question should be
addressed upon how large D should be so that the approximation error is not detrimental
to the performance of a learning algorithm if we use the approximate kernel GD instead of
G. Lemma 1 provides a first answer to this
√ question, showing that the approximation error
of the kernel is upper bounded by O(1/ D) with high probability:
9
suppose that instead of training the SVM with smoothed feature mapping on the original
points {Ψ(xi )}i=1,...,m , we first randomly jitter {xi }i=1,...,m D times at each point, resulting
in {x̃ji }i=1,...,m;j=1,...,D , and then replace each Ψ(xi ) by the D-sample empirical average of
jittered points mapped by Φ into the feature space, that is
D
1 X
ΨD (xi ) := Φ(x̃ji ) .
D j=1
Note that ΨD (xi )> ΨD (xj ) = GD (xi , xj ), hence training an SVM with the Monte Carlo ap-
proximate GD instead of exact version G is equivalent to solving (15) with {ΨD (xi )}i=1,...,m
in the hinge loss instead of {Ψ(xi )}i=1,...,m . Theorem 3 quantifies the approximation perfor-
mance in terms of objective function F which helps to answer the question on the trade-off
between G and GD in computational complexity and learning accuracy.
Theorem 3. For any 0 ≤ δ ≤ 1, the solution w b D of the SVM trained with the Monte Carlo
approximation (14) with D random-jittered samples for each training point satisfies, with
probability greater than 1 − δ,
r r
8 m
b D ) ≤ min F (w) +
F (w 2 + 8 log .
w λD δ
Proof is referred to the appendices. It is known that compared to the exact solution of
(15), an O(m−1/2 )-approximate solution is sufficient to reach the optimal statistical accuracy
[9]. This accuracy can be attained in our analysis when D = O(m/λ), and since typically
λ ∼ m−1/2 [70], this suggests that it is sufficient to take D of order m3/2 . Going back to
the comparison strategy of the two alternatives G and GD , we see that the computational
cost of computing the full m × m Gram matrix with the exact evaluation is O(m2 n2 ), while
the cost of computing the approximate Gram matrix with D = O(m3/2 ) random samples is
O(m2 D2 n log n) = O(m5 n log n). This shows that, up to constants and logarithmic terms,
the Monte Carlo approximation is interesting when m = o(n1/3 ), otherwise the exact evalu-
ation using explicit computation in the feature space is preferable.
Interestingly we can look at the extended Kendall kernel (12) to uncertain rankings from
the perspective of Hilbert space embeddings of probability distributions [68]. In fact, for x
fixed, the smoothed mapping Ψ(x) = EΦ(x + ) is exactly an embedding for the distribution
P of an additive noise in the reproducing kernel Hilbert space (RKHS) associated with
Kendall kernel. As a consequence, the idea of smoothed kernel G(x, x0 ) for x, x ∈ X is
essentially equivalent to how [55, Lemma 4] defines kernels on two probability distributions
from {P + x|x ∈ X } using the Kendall kernel as the level-1 embedding kernel and linear
inner product as the level-2 kernel in the feature space. As a result, given a fixed training set
D, training an SVM with G in place of Kτ is equivalent to training a Flex-SVM instead of an
ordinary SVM with Kτ [55]. In this case, Theorem 3 provides an error bound in terms of the
optimal accuracy for cases when training a Flex-SVM if exact evaluation of G is intractable
and its Monte Carlo approximate GD is employed. This serves to obtain a trade-off between
computation complexity and approximation accuracy which is particularly interesting when
we are working in high dimensions.
10
6 Relation to the diffusion kernel on Sn
It is interesting to relate the Mallows kernel (2) to the diffusion kernel on the symmetric
group proposed by [43], which is the diffusion kernel [42] on the Cayley graph of Sn generated
by adjacent transpositions with left-multiplication. This graph, illustrated for a specific case
of n = 4 in Figure 2, is defined by G = (V, E) with V = Sn as vertices, and undirected edge
set E = {σ, πσ} : σ ∈ Sn , π ∈ Q , where Q = {(i, i + 1)|i = 1, . . . , n − 1} the set of all
adjacent transpositions. Note Q is symmetric in the sense that π ∈ Q ⇔ π −1 ∈ Q, and
the graph adjacency relation is a right-invariant relation, that is σ ∼ σ 0 ⇔ σ 0 σ −1 ∈ Q. The
corresponding graph Laplacian is the matrix ∆ with
1 if σ ∼ σ 0
∆σ,σ0 = −(n − 1) if σ = σ 0 ,
0 otherwise
where n − 1 is the degree of vertex σ (number of edges connected with vertex σ), and the
diffusion kernel on Sn is finally defined as
β
Kdif (σ, σ 0 ) = [eβ∆ ]σ,σ0 (16)
β
for some diffusion parameter β ∈ R, where eβ∆ is the matrix exponential. Kdif is a right-
β
invariant kernel on the symmetric group [43, Proposition 2], and we denote by κdif the positive
β β
definite function induced by Kdif such that Kdif (σ, σ 0 ) = κβdif (σ 0 σ −1 ). Since the Mallows
kernel KM is straightforwardly right-invariant, we denote by κλM the positive definite function
λ
nd (σ, σ 0 ) = dG (σ, σ 0 ) .
Different from the diffusion kernel for which communication between permutations is a dif-
fusion process over the graph, the Mallows kernel
0 0
λ
KM (σ, σ 0 ) = e−λnd (σ,σ ) = e−λdG (σ,σ )
considers exclusively the shortest path over the graph when expressing the similarity between
permutations σ, σ 0 .
11
Figure 2: Cayley graph of S4 , generated by the transpositions (1 2) in blue, (2 3) in green,
and (3 4) in red.
A notable advantage of the Mallows kernel over the diffusion kernel is that the Mallows
kernel enjoys faster evaluation. On one hand if data instances are total rankings, i.e. σ, σ 0 ∈
β
Sn , evaluating Kdif (σ, σ 0 ) would require exponentiating an n!-dimensional Laplacian matrix
by naive implementation, and can reduce to exponentiating matrices of smaller sizes by
careful analysis in the Fourier space, which still remains problematic if working dimension
n is large [43]. However, evaluating KM λ
(σ, σ 0 ) only takes O(n log n) time. On the other
hand if data instances are partial ranking of size k n, i.e. R, R0 ⊂ Sn , and we take
convolution kernel (7) to extend the two kernels, the analysis of exploring the sparsity of
the Fourier coefficients of the group algebra of partial rankings R, R0 of size k reduces the
evaluation of both the diffusion kernel and the Mallows kernel to O((2k)2k+3 ) time, provided
that the exponential kernel Fourier matrices [κ̂(µ)]≥[... ]n−k are precomputed before any kernel
evaluations take place [43, Theorem 13].
12
explore the heterogeneity and identify typical sub-groups of voters with a common behavior
to understand, for example, their political support for various parties [28, 29, 52].
Note that (17) reduces to a single-ranking aggregation problem when c = 1, where the
center π is commonly known as Kemeny consensus [37] which is NP-hard to find [7]. With
the objective in (17) being non convex, Lloyd’s algorithm is usually employed to find local
minima in an iterative manner consisting of two steps: the assignment step assigns each point
to its closest cluster, and the update step updates each of the c cluster centers using the points
assigned to that cluster; the algorithm repeats until all the cluster centers remain unchanged
in an iteration. While the assignment step is usually fast, the update step isP indeed equivalent
to solving a Kemeny consensus problem for each cluster, i.e., arg minπj ∈Sn i:σi ∈Sj nd (σi , πj ).
Since the exact Kemeny-optimal ranking is difficult to find, approximate techniques are
usually employed in practice such as Borda Count [16] or Copeland’s method [11].
As the Kendall tau distance is conditionally positive definite, we can propose as an
alternative to use the kernel k-means approach [25, 80] that relaxes the assumption that
the cluster center are permutations, and instead works implicitly in the feature space where
n
cluster centers can be any vector in R( 2 ) by considering the problem:
c
X X
arg min kΦ(σi ) − µj k2 ,
n
Sj ,µj ∈R( 2 )
j=1 i:σi ∈Sj
for which local minima can be found efficiently by Algorithm 1. Note that µj does not
match a true permutation πj ∈ Sn in general, and the Kemeny consensus problem in the
update step is thus bypassed. It is worthwhile to note that the algorithm is not exclu-
sive for clustering permutations, kernel k-means clustering can be applied respectively to
total/partial/multivariate/uncertain rankings with appropriate kernels defined.
13
Algorithm 1 Kernel k-means for clustering heterogeneous rank data.
Input: a collection of permutations {σi }m
i=1 and a kernel function K over Sn , or a kernel ma-
trix evaluated between pairwise data points K = (K(σi , σj ))1≤i,j≤m ; the number of clusters
c ≤ m.
1: Randomly initialize cluster assignment for each data points and form c clusters S1 , . . . , Sc .
2: For each data point, find its new cluster assignment, i.e., for i = 1, . . . , m,
where
2
1 X
dij := Φ(σi ) − Φ(σ` )
|Sj | σ
` ∈Sj
2 X 1 X
= K(σi , σi ) − K(σi , σ` ) + K(σv , σ` ) .
|Sj | σ ∈S |Sj |2 σv ,σ` ∈Sj
` j
Sj = {σi : j = j ∗ (σi ), i = 1, . . . , m} .
14
for rankings in each homogeneous sub-population. Mixture models not only allow to cluster
data, but more generally to estimate a distribution on the space of permutation that can then
be used for other purposes, such as combining evidences. One popular choice of probabilistic
distribution over Sn is the Mallows model [50], which takes the form in expressing the
occurring probability of σ by
f (σ|π, λ) = C(λ) exp[−λnd (σ, π)] , (18)
where the central ranking π ∈ Sn and P the precision λ ≥ 0 are model parameters, and the
normalization constant C(λ) = 1/ σ0 ∈Sn exp[−λnd (σ 0 , π)] is chosen so that f (·|π, λ) is a
valid probability mass function over Sn . Notably, C(λ) does not depend on the center π due
to the symmetry of Sn .
We follow the mixture modeling setup in [56]. Now suppose that a population consists
of c sub-populations, a Mallows mixture model assumes that an observation comes from
group j with probability pj ≥ 0 for j = 1, . . . , c and, given that the observation belongs to
sub-population j, it is generated from a Mallows model with central ranking πj and precision
λj , i.e., the occurring probability of σ in the Mallows mixture model is written as
c
X c
X
f (σ) = pj f (σ|πj , λj ) = pj C(λj ) exp[−λj nd (σ, πj )] . (19)
j=1 j=1
Denoting π = {πj }cj=1 , λ = {λj }cj=1 , p = {pj }cj=1 such that cj=1 pj = 1, the log-likelihood
P
m m
( c
)
X X X
L(π, λ, p|σ) = log f (σi ) = log pj C(λj ) exp[−λj nd (σi , πj )] . (20)
i=1 i=1 j=1
The Mallows mixture model is usually fitted by maximum likelihood using the EM algorithm.
Specifically, by introducing latent (membership) variables z = {zij : i = 1, . . . , m, j = 1, . . . , c}
where zij = 1 if σi belongs to group j and 0 otherwise, the complete log-likelihood of data is
m X
X c
LC (π, λ, p|σ, z) = zij [log pj + log C(λj ) − λj nd (σi , πj )] .
i=1 j=1
The EM algorithm can be implemented to find local maximum likelihood estimates following
two steps iteratively until convergence: the E-step calculates the expected value of member-
ship variables ẑ conditioned on the current estimates of the model parameters π, λ, p, and
the M-step updates the model parameters π, λ, p by maximizing the expected complete log-
likelihood L̂C = LC (π, λ, p|σ, ẑ) where membership variables are replaced by their expected
values. The final estimate ẑij amounts to our belief of σi belonging to group j, and can thus
be used to form clusters {Sj }cj=1 serving a partition of data where
n o
Sj = σi : ẑij = max ẑi` , i = 1, . . . , m . (21)
`
A closer look at the EM algorithm reveals that optimizing L̂C with respect to π alone in
the M-step is indeed equivalent to finding a (weighted) Kemeny consensus for each group,
15
i.e., solving arg minπj ∈Sn m
P
i=1 ẑij nd (σi , πj ), for which exact solution is difficult as above-
mentioned in the context of k-means clustering. Similarly to the idea of kernel k-means
in contrast to classic k-means, we propose to seek ways to bypass the Kemeny consensus
problem by working in the feature space instead. Note that the Mallows probability mass
2
function (18) is equivalently written as f (σ|π, λ) ∝ exp −λ kΦ(σ) − Φ(π)k up to a con-
stant scaling on λ by using (4), we propose to relax the constraint that the center has to
match a true permutation π ∈ Sn and consider the following two alternatives in place of f
following the mixture modeling approach stated above:
(i) Kernel Mallows. The Mallows probability mass function over Sn (18) is generalized to
n
admit any point in the feature space µ ∈ R( 2 ) to be the population center, i.e.,
g(σ|µ, λ) = C(µ, λ) exp −λ kΦ(σ) − µk2 ,
(22)
where the normalization constant C(µ, λ) = 1/ σ0 ∈Sn exp −λ kΦ(σ 0 ) − µk2 is chosen so
P
that g(·|µ, λ) is a valid probability mass function over Sn . Notably, C(µ, λ) now depends on
the center µ as well.
If we replace the probability mass function of classic Mallows f in (20) by that of kernel
Mallows g, the Kemeny consensus problem is averted when the EM algorithm is used to fit
a local maximum likelihood estimate. However, another computational setback arises that
the expected complete log-likelihood L̂C to maximize in the M-step of the EM algorithm
is separately concave with respect to µ or λ, but not jointly concave. Hence alternating
optimization is often used in practice with the caveats of intensive computation and no
guarantee to attain global optima for the M-step optimization at each iteration.
(ii) Kernel Gaussian. Note that (22) has a similar form to the Gaussian density, therefore
we consider for σ ∈ Sn ,
s
(n2 )
λ
†
exp −λ kΦ(σ) − µk2 ,
g (σ|µ, λ) = (23)
π
which is exactly N (Φ(σ)|µ, (2λ)−1 I), i.e., the n2 -dimensional Gaussian distribution with
mean µ and isotropic covariance matrix (2λ)−1 I injected by Φ(σ). Notably, g † (·|µ, λ) is not
a valid probability mass function over Sn .
The mixture modeling approach stated above using g † instead of f is in fact equiva-
lently stated in Algorithm 2. It is worthwhile to note that the algorithm also applies to
total/partial/multivariate/uncertain rankings with appropriate kernels defined as [77, Table
2] provides the counterpart of Algorithm 2 in case that a kernel matrix evaluated between
data points is given instead. However, since g † itself is not a valid probability mass func-
tion over Sn , an evident drawback is that we now lose the probabilistic interpretation of the
mixture distribution as in (19).
7.3 Experiments
Clustering 1980 APA election data. In the 1980 American Psychological Association
(APA) presidential election, voters were asked to rank 5 candidates in order of preference,
and 5738 votes in form of total rankings were reported and thus used in our experiment.
The dataset was thoroughly studied by [18].
16
Algorithm 2 Kernel trick embedded Gaussian mixture model for clustering heterogeneous
rank data.
Input: a collection of permutations {σi }mi=1 and a kernel function K over Sn ; the number of
clusters c ≤ m.
n
1: Compute feature points Φ(σi ) ∈ R( 2 ) mapped by the Kendall embedding.
n
2: Fit a Gaussian mixture model for {Φ(σ )}
m
in R( 2 ) using maximum likelihood with the
i i=1
EM algorithm under the constraint of isotropic covariance matrix, i.e., Σ = (2λ)−1 I.
3: Use the membership estimates ẑ to form clusters by (21).
Output: Cluster assignments {Sj }cj=1 .
We first use k-means approaches to cluster the data. We compare the proposed kernel
k-means algorithm (Algorithm 1 with Kendall kernel Kτ ) to the classic k-means algorithm
formulated as (17). For the classic k-means where cluster centers are required to be a pro-
totype permutation, three methods are employed in the center-update step for each itera-
tion: brute-force search of Kemeny-optimal ranking, approximate ranking induced by Borda
Count and Copeland’s method. In each case, we vary the number of clusters ranging from
2 to 10 and the algorithm is repeated 50 times with randomly initialized configurations for
each fixed number of clusters. We observe from Figure 3 (Left) that the kernel k-means or
classic k-means with approximate centers runs much faster than optimal k-means for which
the Kemeny-optimal ranking is time-consuming to find by a brute-force search. Further,
Figure 3 (Middle) shows that the kernel k-means outperforms all three methods based on
classic k-means in terms of the average silhouette scores of the clustering results, which jus-
tifies that the kernel k-means splits the data into more consistent sub-groups in the sense
that instances, measured by Kendall tau distance on average, are more similar in the same
cluster and more dissimilar in different clusters. We also observe that kernel k-means returns
more robust clusters in case of perturbation in data (see appendices).
0.375
● bruteKmeans ● ● bordaKmeans
0.20 copelandKmeans bruteKmeans 0.3
bordaKmeans copelandKmeans ● bordaMallows
kernelKmeans kernelKmeans bruteMallows
0.350 copelandMallows
●
0.2 kernelGaussian
0.15 kernelMallows
silhouette
silhouette
●
time
● 0.325
● ●
0.1
0.10 ●
● ●
●
0.300
●
0.0
0.05 ●
● ●
0.275
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
nclust nclust nclust
Figure 3: Across different number of clusters: Left: Computational time (in seconds) of k-
means algorithms per run. Middle: Average silhouette scores of k-means methods. Right:
Average silhouette scores of Mallows mixture modeling methods.
Mixture modeling is then used to fit the data and a partition of the votes is converted
from the fitted models forming a clustering result. Baseline models are the Mallows mixture
models fitted by the EM algorithm [56] using three different center-update algorithms at each
17
iteration: brute-force search for Kemeny-optimal ranking, approximate ranking induced by
Borda Count and Copeland’s method. As proposed in this paper, we embed the kernel trick
in Mallows mixture modeling with two alternatives g (22) and g † (23) in place of f (18).
In each case, we vary the number of clusters ranging from 2 to 10 and the algorithm is
repeated 50 times with randomly initialized configurations for each fixed number of clusters.
As shown in Figure 3 (Right), modeling a Gaussian mixture to data in the feature space,
or equivalently using g † instead of f , provides a preferable split of the data into sub-groups
with higher average silhouette scores across different number of clusters selected a priori.
Clustering ESC voting data. We finally perform clustering on a dataset of multivariate
partial rankings. In the finale of Eurovision Song Contest (ESC), each participating country
casts one top-k vote over the finalists who represent their home country. Taken from [34],
the dataset consists of 34 multivariate ranking instances, each being a series of 6 partial
votes over top 8 finalists from 2007 to 2012 respectively.
In comparison with the mixture of Insertion Sorting Rank (ISR) model for clustering
multivariate partial rank data proposed by [34], we implement the kernel k-means algorithm
(Algorithm 1) with the extended Kendall kernel to multivariate rankings (8) and equal
weights µj = 1/p where p = 6 corresponding to the six contests across years. For each
fixed number of clusters, the kernel k-means algorithm is repeated 100 times with randomly
initialized configurations while 10 times for the ISR mixture modeling approach. We vary
the number of clusters ranging from 2 to 6, and the optimal number is selected to be 2
for kernel k-means with respect to highest average silhouette score while 5 for the ISR
mixture model with respect to highest BIC value. It consumes 2 hours in total to fit the
ISR mixture model in order for clustering while it only takes less than 10 seconds to form
the partition of data with kernel k-means. Although it is beyond the scope of this study
to further explore the meaningful voting blocs, the colored map of Asia-Europe in terms of
clustering results of participating countries to the ESC according to their voting behavior
(Figure 4, Left) depicts that there exists interesting geographical alliances between countries
in the voting data. For example, country-clusters returned by both algorithms present a
sign of strong amity within Eastern Europe. Silhouette plots for both algorithms are shown
in Figure 4 (Right). Despite a relatively small number of clusters selected for the kernel
k-means, the silhouette plot (Figure 4a, Right) attests that the underlying clusters are well
formed. Note that both silhouette plots opt for the same distance used by kernel k-means,
which may show bias against a clustering scheme based on probabilistic modeling with ISR
mixtures. However, the two approaches behave distinctly in identifying subgroups. For
example, the ISR mixture model distinguishes Portugal as a singleton among all countries,
while interpreting such clustering results remains to be studied. On the other hand, the
k-means based approach tends to find more evenly distributed subgroups, in the sense that
the number of individuals in each subgroup is more consistent. Therefore kernel k-means
clustering is favored if the study of interest lies in populous behaviors in voting despite of
potential outlier individuals. Notably the detection of outliers can be done by other kernel
algorithms (Section 9).
18
Latvia
Lithuania
Ireland
Greece
Moldova
Norway
Israel
Belarus
Spain
Estonia
F.Y.R. Macedonia
Malta
Portugal
Croatia
Ukraine
Slovenia
Denmark
Turkey
Finland
Russia
Romania
Belgium
United Kingdom
Albania
Bosnia & Herzegovina
Bulgaria
Cyprus
Germany
Switzerland
The Netherlands
Serbia
Sweden
France
Iceland
(a) Country-clusters returned by kernel k-means, where the number of clusters 2 is selected with
respect to highest silhouette score averaged over all countries.
Serbia
Cyprus
Bulgaria
Germany
Turkey
Russia
Ukraine
Moldova
F.Y.R. Macedonia
Belarus
Romania
Greece
Israel
France
Belgium
Switzerland
United Kingdom
Bosnia & Herzegovina
Albania
Finland
Norway
Latvia
Estonia
Ireland
Lithuania
Slovenia
Denmark
Croatia
Spain
Malta
Iceland
Sweden
The Netherlands
Portugal
(b) Country-clusters returned by ISR mixture modeling, where the number of clusters 5 (including
in particular “Portugal” as a singleton) is selected with respect to highest fitted BIC value.
Figure 4: Clustering results of participating countries to the ESC according to their voting
behavior illustrated by geographic map (Left) and silhouette plot (Right).
19
Table 1: Summary of biomedial datasets.
preprocessing to remove missing values as follows: first we removed two samples (both labeled
“relapse”) from the training set that have around 10% and 45% of missing gene values; next
we discarded any gene whose value was missing in at least one sample, amounting to a total
of 3.5% of all genes.
We compare the Kendall kernel to three standard kernels, namely linear kernel, homoge-
neous 2nd-order polynomial kernel and Gaussian RBF kernel with bandwidth set with “me-
dian trick”, using SVM (with regularization parameter C) and Kernel Fisher Discriminant
(KFD, without tuning parameter) as classifiers. In addition, we include in the benchmark
classifiers based on Top ScoringPairs (TSP) [24], namely (1-)TSP, k-TSP [71]1 and APMV
(all-pairs majority votes, i.e. n2 -TSP). Finally we also test SVM with various kernels using
as input only top features selected by TSP [66].
In all experiments, each kernel is centered (on the training set) and scaled to unit norm
in the feature space. For KFD-based models, we add 10−3 on the diagonal of the centered
and scaled kernel matrix, as suggested by [54]. The Kendall kernel we use in practice is a
soft version to (9) in the sense that the extremes ±1 can still be attained in the presence of
ties, specifically we use
nc (x, x0 ) − nd (x, x0 )
Kτ (x, x0 ) = p ,
(n0 − n1 )(n0 − n2 )
where n0 = n2 and n1 , n2 are the number of tied pairs in x, x0 respectively.
For the three datasets that are split into training and test sets, we report the performance
on the test set; otherwise we perform a 5-fold cross-validation repeated 10 times and report
the mean performance over the 5×10 = 50 splits to evaluate the performance of the different
methods. In addition, on each training set, an internal 5-fold cross-validation is performed
to tune parameters, namely the C parameter of SVM-based models (optimized over a grid
ranging from 10−2 to 103 in log scale), and the number k of k-TSP in case of feature selection
(ranging from 1 to 5000 in log scale).
Table 2 and Figure 5 (Left) summarize the performance of each model across the datasets.
1
While the original k-TSP algorithm selects only top k disjoint pairs with the constraint that k is less
than 10, we do not restrict ourselves to any of these two conditions since we consider k-TSP in this study
essentially a feature pair scoring algorithm.
20
Table 2: Prediction accuracy (%) of different methods across biomedical datasets. Models are
named after candiate methods (SVM or KFD) and candiate kernels, namely linear kernel
(linear), 2nd-order homogeneous polynomial kernel (poly), Gaussian RBF kernel (rbf) or
Kendall kernel (kdt), and whether feature selection is combined (TOP) or not (ALL).
An SVM with the Kendall kernel achieves the highest average prediction accuracy overall
(79.39%), followed by a linear SVM trained on a subset of features selected from the top
scoring pairs (77.16%) and a standard linear SVM (76.09%). The SVM with Kendall kernel
outperforms all the other methods at a P-value of 0.07 according to a Wilcoxon rank test.
We note that even though models based on KFD generally are less accurate than those
based on SVMs, the relative order of the different kernels is consistent between KFD and
SVM, adding evidence that the Kendall kernel provides an interesting alternative to other
kernels in this context. The performance of TSP and k-TSP, based on majority vote rules,
are comparatively worse than SVMs using the same features, as already observed by [66].
Figure 5 further shows how the performance of different kernels depends on the choice
of the C parameter or the SVM (Middle), and on the number of features used (Right), on
some representative datasets. We observe that compared to other kernels, an SVM with
the Kendall kernel is relatively insensitive to hyper-parameter C especially when C is large,
which corresponds to a hard-margin SVM. This may explain in part the success of SVMs in
this setting, since the risk of choosing a bad C during training is reduced. Regarding the
number of features used in case of feature selection, we notice that it does not seem to be
beneficial to perform feature selection in this problem, explaining why the Kendall kernel
which uses all pairwise comparisons between features outperforms other kernels restricted to
a subset of these pairs. In particular, the feature space of the Kendall and Mallows kernels
is precisely the space of binary pairwise comparisons defined by [24], and the results show
that instead of selecting a few features in this space as the Top Scoring Pairs (TSP)-family
classifiers do [24, 71, 79, 49], one can simply work with all pairs with the kernel trick.
Finally, as a proof of concept we empirically compare on one dataset the smooth alter-
native (12) and its Monte Carlo approximate (14) with the original Kendall kernel. Figure
6 shows how the performance varies with the amount of noise added to the samples (Left),
21
1.0 ● BC1 PC1
●
0.9
1.0
0.9
0.8
0.8
● ●
0.8
0.7
acc
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.6
0.7
● ● ● ● ● ●
acc
acc
0.6
0.6
SVMlinearTOP
0.5
0.4
SVMlinearALL SVMkdtTOP
0.5
SVMkdtALL SVMpolynomialTOP
SVMpolynomialALL kTSP
0.4
● SVMrbfALL ● SVMlinearALL
0.4
● KFDlinearALL SVMkdtALL
KFDkdtALL SVMpolynomialALL
SVMkdtALL
SVMlinearTOP
SVMlinearALL
SVMkdtTOP
SVMpolynomialALL
KFDkdtALL
kTSP
SVMpolynomialTOP
KFDlinearALL
KFDpolynomialALL
TSP
SVMrbf
KFDrbf
APMV
KFDpolynomialALL TSP
0.3
0.3
KFDrbfALL APMV
and how the performance varies with the number of samples in the Monte Carlo scheme for
a given amount of noise (Right). It confirms that the smooth alternative (12) can improve
the performance of the Kendall kernel, and that the amount of noise (window size) should
be considered as a parameter of the kernel to be optimized. Although the D2 -sample Monte
Carlo approximate kernel (14) mainly serves as a fast estimate to the exact evaluation of
(12), it shows that the idea of jittered input with specific noise can also bring a tempting
benefit for data analysis with Kendall kernel, even when D is small. This also justifies the
motivation of our proposed smooth alternative (12). Last but not least, despite the fact that
the convergence rate of D2 -sample Monte Carlo approximate to the exact kernel evaluation
is guaranteed by Theorem 3, experiments show that the convergence in practice is typically
faster than the theoretical bound, and even faster in case that the window size a is small.
This is due to the fact that the convergence rate is also dependent of the observed data
distribution in the input space, for which we have not made any specific assumption in our
analysis.
22
MB MB
0.70
0.70
● SVMkdtALLalt−−exact ● SVMkdtALLalt−−exact (a=10204)
SVMkdtALLalt−−MCapprox (D=1) SVMkdtALLalt−−MCapprox (a=10204)
SVMkdtALLalt−−MCapprox (D=3) SVMkdtALL
SVMkdtALLalt−−MCapprox (D=5)
0.68
0.68
SVMkdtALLalt−−MCapprox (D=7) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
SVMkdtALLalt−−MCapprox (D=9)
SVMkdtALL
0.66
0.66
●
cvacc
cvacc
●
●
0.64
0.64
●
●
● ●
● ●
●
● ● ●
●
0.62
0.62
●
●
●
0.60
0.60
●
heterogeneous rank data and classifying biomedical data demonstrating that for highly noisy
data, the Kendall kernel is competitive or even outperforms other state-of-the-art kernels.
We believe that computationally efficient kernels over the symmetric group pave the way
to numerous applications beyond the ones we pursued in this paper. In unsupervised data
mining, kernel density estimation for example can be applied to modeling the distribution
over a collection of rankings, and by the representer theorem the resulting distribution de-
pends solely on the observed data points circumventing the exponentially large cardinality
of the symmetric group, from which a consensus ranking that best represents the data is the
one with the highest probability. As more complicated cases, there is much interest beyond
finding a single consensus ranking typically in the context of political votes or social choices:
groups of homogeneous sub-populations in data can be clustered by algorithms such as kernel
k-means or spectral clustering [20]; dependencies or principle structural factors in data can
be found by kernel canonical correlation analysis [44] or kernel principle component analysis
[61]; outliers in a collection of rank data can be detected with one-class SVMs [62, 72]. In a
more predictive setting, Support Vector Machines and kernel ridge regression are representa-
tive delegates for solving classification and regression problems amongst many other kernel
algorithms [60]. Notably, the input/output kernels formalism allows us to predict rankings
as well as learn from rankings where a wealth of algorithms such as multi-class SVMs or
structural SVMs [14, 73, 5] are ready to suit the problem at hand.
Deeper understanding of the Kendall and Mallows kernels calls for more theoretical work
of the proposed kernels. In particular, a detailed analysis of the Fourier spectra of the Kendall
and Mallows kernels is provided in [51]. Those authors also introduced a tractable family of
normalized polynomial kernels of degree p that interpolates between Kendall (degree one)
and Mallows (infinite degree) kernels.
There are many interesting extensions of the current work. One direction would be to
include high-order comparisons in measuring the similarity between permutations. Since the
fast computation of the Kendall and Mallows kernels is balanced by the fact that they only
rely on pairwise statistics between the ranks, computationally tractable extension to higher-
23
order statistics, such as three-way comparisons, could potentially enhance the discriminative
power of the proposed kernels. Another interesting direction would be to extend the proposed
kernels to rankings on partially ordered set. In fact, the current work lies on the assumption
that a (strict) total order can be associated with the (finite) set of items given to rank
{x1 , . . . , xn }, which is implicitly presumed when we label the items by the subscripts [1, n]
and then define the Kendall and Mallows kernels by comparing all item pairs (i, j) for i < j
(Section 2). However, there are cases when the item set is intrinsically associated with a
(strict) partial order such that some item pairs are conceptually incomparable. In that case,
we can collect all comparable item pairs into a set denoted by E and define the kernels
by comparing only those item pairs (i, j) in E. Notably evaluating the extended kernels is
still fast as we can simply replace the Merge Sort algorithm for total orders (Section 2) by a
topological sort algorithm for partial orders [12, Section 22.4]. We leave further investigations
of this generalization to future work.
Acknowledgments
This work was supported by the European Union 7th Framework Program through the
Marie Curie ITN MLPM grant No 316861, the European Research Council grant ERC-
SMAC-280032, the Miller Institute for Basic Research in Science [to JPV]; and the Fulbright
Foundation [to JPV]. We thank anonymous reviewers for interesting comments, in particular
the possibility to extend the kernels to partial orders.
References
[1] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking
and clustering. J. ACM, 55(5):23:1–23:27, 2008.
[3] K. J. Arrow. Social Choice and Individual Values, volume 12. Yale Univ Press, 2012.
[4] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality,
and the SMO algorithm. In Proceedings of the Twenty-First International Conference
on Machine Learning, page 6, New York, NY, USA, 2004. ACM.
24
[7] J. Bartholdi III, C. A. Tovey, and M. A. Trick. Voting schemes for which it can be
difficult to tell who won the election. Social Choice and welfare, 6(2):157–165, 1989.
[9] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller,
Y. Singer, and S. Roweis, editors, Adv. Neural Inform. Process. Syst., volume 20, pages
161–168. Curran Associates, Inc., 2008.
[13] C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273–297, 1995.
ISSN 0885-6125.
[15] D. E. Critchlow. Metric methods for analyzing partially ranked data. Springer, 1985.
[16] J. C. de Borda. Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des
Sciences, 1781.
[19] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the
web. In Proceedings of the Tenth International Conference on World Wide Web, pages
613–622. ACM, 2001.
[20] M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A survey of kernel and spectral
methods for clustering. Pattern Recogn., 41(1):176–190, 2008.
25
[21] M. A. Fligner and J. S. Verducci. Distance based ranking models. J. R. Stat. Soc. Ser.
B, 48(3):359–369, 1986.
[23] T. Gärtner, J.W. Lloyd, and P.A. Flach. Kernels and distances for structured data.
Mach. Learn., 57(3):205–232, 2004.
[25] M. Girolami. Mercer kernel-based clustering in feature space. IEEE Trans. Neural
Network, 13(3):780–784, 2002.
[26] M. Gönen and E. Alpaydın. Multiple kernel learning algorithms. J. Mach. Learn. Res.,
12:2211–2268, 2011.
[28] I. C. Gormley and T. B. Murphy. Analysis of irish third-level college applications data.
J. Roy. Stat. Soc. Stat. Soc., 169(2):361–379, 2006.
[29] I. C. Gormley and T. B. Murphy. Exploring voting blocs within the irish electorate: A
mixture modeling approach. J. Am. Stat. Assoc., 103(483):1014–1027, 2008.
[32] J. Huang, C. Guestrin, and L. Guibas. Fourier theoretic probabilistic inference over
permutations. J. Mach. Learn. Res., 10:997–1070, 2009.
[34] J. Jacques and C. Biernacki. Model-based clustering for multivariate partial ranking
data. J. Stat. Plann. Infer., 149:201–217, 2014.
26
[36] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs.
In T. Faucett and N. Mishra, editors, Proceedings of the Twentieth International Con-
ference on Machine Learning, pages 321–328, New York, NY, USA, 2003. AAAI Press.
[37] J. G. Kemeny and J. L. Snell. Mathematical models in the social sciences, volume 9.
Ginn New York, 1962.
[40] W. R. Knight. A computer method for calculating Kendall’s tau with ungrouped data.
J. Am. Stat. Assoc., 61(314):436–439, 1966.
[41] I. R. Kondor. Group theoretical methods in machine learning. PhD thesis, Columbia
University, 2008.
[42] I. R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input
spaces. In Proceedings of the Nineteenth International Conference on Machine Learning,
volume 2, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers
Inc.
[43] R. I. Kondor and M. S. Barbosa. Ranking with kernels in fourier space. In A. T. Kalai
and M. Mohri, editors, COLT 2010 - The 23rd Conference on Learning Theory, Haifa,
Israel, June 27-29, 2010, pages 451–463. Omnipress, 2010.
[44] P. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. Int. J.
Neural Syst., 10(05):365–377, 2000.
[47] G. Lebanon and Y. Mao. Non-parametric modeling of partially ranked data. J. Mach.
Learn. Res., 9:2401–2429, 2008.
[48] J. Li, H. Liu, and L. Wong. Mean-entropy discretized features are effective for classify-
ing high-dimensional biomedical data. In M. J. Zaki, J. T.-L. Wang, and H. Toivonen,
editors, Proceedings of the 3nd ACM SIGKDD Workshop on Data Mining in Bioin-
formatics (BIOKDD 2003), August 27th, 2003, Washington, DC, USA, pages 17–24,
2003.
27
[50] C. L. Mallows. Non-null ranking models. i. Biometrika, 44(1/2):114–130, 1957.
[52] J. I. Marden. Analyzing and modeling rank data. CRC Press, 1996.
[53] M. Meilă, K. Phadnis, A. Patterson, and J. Bilmes. Consensus ranking under the
exponential model. In Proceedings of the Twenty-Third Conference Annual Conference
on Uncertainty in Artificial Intelligence (UAI-07), pages 285–294, Corvallis, Oregon,
2007. AUAI Press.
[54] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.R. Müller. Fisher discriminant
analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors,
Neural Networks for Signal Processing IX, pages 41–48. IEEE, 1999.
[56] T. B. Murphy and D. Martin. Mixtures of distance-based models for ranking data.
Comput. Stat. Data Anal., 41(3):645–655, 2003.
[59] I. J. Schoenberg. Metric spaces and positive definite functions. Trans. Am. Math. Soc.,
44(3):522–536, 1938.
[60] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002.
28
[63] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT
Press, The MIT Press, Cambridge, Massachussetts, 2004.
[65] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, New York, NY, USA, 2004.
[66] P. Shi, S. Ray, Q. Zhu, and M. A. Kon. Top scoring pairs for feature selection in machine
learning and applications to cancer outcome prediction. BMC Bioinformatics, 12:375,
2011.
[68] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert Space Embedding for
Distributions, pages 13–31. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN
978-3-540-75225-7. doi: 10.1007/978-3-540-75225-7 5.
[69] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel
learning. J. Mach. Learn. Res., 7:1531–1565, 2006.
[70] I. Steinwart. Consistency of support vector machines and other regularized kernel clas-
sifiers. IEEE Trans. Inform. Theory, 51(1):128–142, 2005.
[71] A. C. Tan, D. Q. Naiman, L. Xu, R. L. Winslow, and D. Geman. Simple decision rules
for classifying human cancers from gene expression profiles. Bioinformatics, 21(20):
3896–3904, 2005.
[72] D. M. Tax and R. P. Duin. Support vector data description. Mach. Learn., 54(1):45–66,
2004.
[73] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for
structured and interdependent output variables. J. Mach. Learn. Res., 6:1453–1484,
2005.
29
[77] J. Wang, J. Lee, and C. Zhang. Kernel trick embedded gaussian mixture model. In
Algorithmic Learning Theory, pages 159–174. Springer, 2003.
[80] R. Zhang and A. Rudnicky. A large scale clustering scheme for kernel k-means. In
Proceedings of the Sixteenth International Conference on Pattern Recognition, volume 4,
pages 289–292. IEEE, 2002.
A Proof of theorems
Proof of Theorem 2. The proof is constructive. We show here explicitly how to compute
the Kendall kernel between two interleaving partial rankings while the idea remains similar
for the case of top-k partial rankings. Denote by [1, n] the item set to be ranked and
by Ai1 ,...,ik , Aj1 ,...,jm ⊂ Sn two interleaving partial rankings of size k, m respectively, whose
subsets of item indices are denoted by I := {i1 , . . . , ik } and J := {j1 , . . . , jm }. We will lighten
the notation by writing AI := Ai1 ,...,ik and AJ := Aj1 ,...,jm and recall that by definition,
are subsets of Sn compatible with the two partial rankings respectively. In particular, |AI | =
n!/k! and |AJ | = n!/m!. Note that every item that does not appear in the partial ranking
corresponding to AI (or AJ ) can be interleaved at any possible order with the other items
for some permutation in that set.
Key observation to our proof is the “symmetry” of AI (or AJ ) in the sense that (i) for
every item pair {i, j} such that i, j ∈ I, all permutations in AI are identical on the relative
order of items i and j; (ii) for every item pair {i, j} such that i, j ∈ I { , there exists a unique
permutation ρ = (i, j) ◦ π ∈ AI for each π ∈ AI by swapping the ranks of items i, j in π such
that (π(i) − π(j))(ρ(i) − ρ(j)) < 0 and ρ is identical with π on the absolute ranks of all the
other items.
By the definition of convolution kernel and Theorem 1, we have
1 X 1 X
Kτ (AI , AJ ) = n
sign(π(i) − π(j))sign(π 0 (i) − π 0 (j))
|AI ||AJ | π∈A 2 1≤i<j≤n
I
π 0 ∈AJ
X k!m! X
= 2 n
sign(π(i) − π(j))sign(π 0 (i) − π 0 (j)) . (24)
1≤i<j≤n
(n!) 2 π∈A I
π 0 ∈AJ
30
As we will always regard the item set [1, n] as the universe, we will write the complement of
set S ⊂ [1, n] as S { := [1, n] \ S. Since the item set can be divided into four disjoint subsets
that are [1, n] = (I ∩ J) t (I \ J) t (J \ I) t (I ∪ J){ , any (unordered) item pair {i, j} can
be categorized uniquely into one out of ten cases:
1 both items in I ∩ J.
6 both items in I \ J.
7 both items in J \ I.
Now we can split and case-by-case regroup the additive terms in (24) into ten parts. We
denote by s1 to s10 the subtotal corresponding to cases 1 to 10, i.e.,
10 10 n X
X X k!m!
Kτ (AI , AJ ) = sl :=
(n!)2 n2
l=1 l=1 in
{i,j}
case l o
X
× sign(π(i) − π(j))sign(π 0 (i) − π 0 (j)) .
π∈AI
π 0 ∈AJ
It is straightforward to see that s6 to s10 are all equal to 0 due to the symmetry of AI
and/or AJ . For example for every item pair {i, j} in case 6, since both items i and j appear
in I, their relative order is fixed in the sense that sign(π(i) − π(j)) remains constant for all
π ∈ AI ; since both items are absent from J, we can pair up permutations π 0 , ρ0 ∈ AJ such
that sign(π 0 (i) − π 0 (j)) = −sign(ρ0 (i) − ρ0 (j)). As a result all additive terms in s6 cancel out
each other and thus s6 = 0.
Now we will take efforts to compute s1 to s5 . For every item pair {i, j} in case 1 such
that i, j ∈ I ∩ J, since i, j ∈ I, their relative order remains unchanged for all π ∈ AI and let
us denote by τ ∈ S|I∩J| the total ranking of the observed items indexed by I ∩ J with respect
to AI . Since also i, j ∈ J, we can denote by τ 0 ∈ S|I∩J| the total ranking of the observed
31
items indexed by I ∩ J with respect to AJ . Therefore we have
X k!m! X
s1 = 2 n
sign(π(i) − π(j))sign(π 0 (i) − π 0 (j))
1≤i<j≤n
(n!) 2 π∈A
I
i,j∈I∩J π 0 ∈AJ
1 X
= n
sign(τ (i) − τ (j))sign(τ 0 (i) − τ 0 (j))
2 1≤i<j≤n
i,j∈I∩J
|I∩J|
= 2
n Kτ (τ, τ 0 ) ,
2
where the last line is by the definition of Kendall kernel between τ and τ 0 on the common
items in I ∩ J.
For every item pair {i, j} in case 2, we may assume without loss of generality that
i ∈ I ∩ J, j ∈ I \ J, or equivalently i, j ∈ I and i ∈ J, j ∈/ J. The relative order of i, j in
0
π ∈ AI is thus determined by τ but not fixed for all π ∈ AJ . Let us denote by σ ∈ Sk the
total ranking corresponding to the k observed items in AI and by σ 0 ∈ Sm the total ranking
of the m observed items in AJ . In fact, there are (m + 1) possible positions for j to interleave
in some π 0 ∈ AJ and the number of positions with a lower relative order of j to i is σ 0 (i).
Therefore we have
X k!m! X
s2 = 2 n
sign(π(i) − π(j))sign(π 0 (i) − π 0 (j))
i∈I∩J
(n!) 2 π∈A I
j∈I\J π 0 ∈AJ
1 X m! X
= n sign(τ (i) − τ (j)) sign(π 0 (i) − π 0 (j))
2 i∈I∩J
n! π 0 ∈A J
j∈I\J
1 X Xn m!
= n
sign(τ (i) − τ (j))
2 i∈I∩J
n!
j∈I\J
n! 0 o
× σ (i) − ((m + 1) − σ 0 (i))
(m + 1)!
1 X X
= n 2σ 0 (i) − m − 1 sign(τ (i) − τ (j))
2
(m + 1)i∈I∩J j∈I\J
1 X n 0
= n 2σ (i) − m − 1
2
(m + 1) i∈I∩J
o
× 2(σ(i) − τ (i)) − k + |I ∩ J| ,
where the last line concludes from basic deductive calculation. Similarly we have for case 3,
1 X n
s3 = n 2σ(i) − k − 1
2
(k + 1) i∈I∩J
0 0
o
× 2(σ (i) − τ (i)) − m + |I ∩ J| .
32
For every item pair {i, j} in case 4, we may assume without loss of generality that
i ∈ I ∩ J, j ∈ (I ∪ J){ . As j is absent from I (or J respectively), there are (k + 1) (or
(m + 1) resp.) possible positions for j to interleave in some π ∈ AI (or π 0 ∈ AJ resp.) and
the number of positions with a lower relative order of j to i is σ(i) (or σ 0 (i) resp.). The
times we get (π(i) − π(j))(π 0 (i) − π 0 (j)) > 0 for all possible interleaved positions of j in some
π ∈ AI , π 0 ∈ AJ is in total [σ(i)σ 0 (i) + (k + 1 − σ(i))(m + 1 − σ 0 (i))], and the times we get
(π(i) − π(j))(π 0 (i) − π 0 (j)) < 0 is in total [σ(i)(m + 1 − σ 0 (i)) + σ 0 (i)(k + 1 − σ(i))]. Therefore
we have
X k!m! X
s4 = 2 n
sign(π(i) − π(j))sign(π 0 (i) − π 0 (j))
i∈I∩J
(n!) 2 π∈A I
j∈(I∪J){ π 0 ∈AJ
X k!m! (n!)2
= 2 n (k + 1)!(m + 1)!
|(I ∪ J){ |
i∈I∩J
(n!) 2
n
× σ(i)σ 0 (i) + (k + 1 − σ(i))(m + 1 − σ 0 (i))
o
− σ(i)(m + 1 − σ 0 (i)) + σ 0 (i)(k + 1 − σ(i))
|(I ∪ J){ |
= n
2
(k + 1)(m + 1)
X
2σ(i) − k − 1 2σ 0 (i) − m − 1 .
×
i∈I∩J
For case 5, similar derivation (as case 4) with interleaving i in AJ and interleaving j in
AI leads to
X k!m! X
s5 = 2 n
sign(π(i) − π(j))sign(π 0 (i) − π 0 (j))
i∈I\J
(n!) 2 π∈A I
j∈J\I π 0 ∈AJ
X X k!m! (n!)2
=
(n!)2 n2 (k + 1)!(m + 1)!
i∈I\J j∈J\I
n
× σ(i)(m + 1 − σ 0 (j)) + σ 0 (j)(k + 1 − σ(i))
o
− σ(i)σ 0 (j) + (k + 1 − σ(i))(m + 1 − σ 0 (j))
−1
= n
2
(k + 1)(m + 1)
X X 0
× 2σ(i) − k − 1 2σ (j) − m − 1 .
i∈I\J j∈J\I
Finally Kτ (Ai1 ,...,ik , Aj1 ,...,jm ) = s1 + s2 + s3 + s4 + s5 concludes the proof. The algorithms
are summarized in Algorithm 3 for interleaving partial rankings and Algorithm 4 for top-k
rankings. Note that in both algorithms, the first step is the computationally most intensive
one, where we need to identify the total ranking restricted to the items present in the partial
33
rankings. This can be achieved by any sorting algorithm, leading the algorithms to a time
complexity O(k log k + m log m).
R implementation of both algorithms can be found at https://github.com/YunlongJiao/kernrank.
Proof of Lemma 1. For any x ∈ Rn , note that kΦ(x)k ≤ 1. We can therefore apply [10,
j
Example 6.3] to the random √ vector Xj = Φ(x̃ ) − Ψ(x) that satisfies EXj = 0 and kXj k ≤ 2
a.s. to get, for any u ≥ 2/ D,
√ 2 !
u D−2
P (kΨD (x) − Ψ(x)k ≥ u) ≤ exp − .
8
We recover (a) by setting the right-hand side equal to δ and solving for u. (b) then follows
by a simple union bound.
Proof of Theorem 3. Let w b be a solution to the original SVM optimization problem, and
w
b D a solution to the perturbed SVM, i.e., a solution of
λ
min FD (w) = kwk2 + R
bD (w), (25)
w 2
Pm
with R
bD (w) = 1
m i=1 `(yi w> ΨD (xi )). Since the hinge loss is 1-Lipschitz, i.e., |`(a)−`(b)| ≤
n
|a − b| for any a, b ∈ R, we obtain that for any u ∈ R( 2 ) :
m
bD (u) ≤ 1
X
R(u)
b −R u> (Ψ(xi ) − ΨD (xi ))
m i=1 (26)
≤ kuk sup kΨD (xi ) − Ψ(xi )k .
i=1,...,m
34
Algorithm 3 Kendall kernel for two interleaving partial rankings.
Input: two partial rankings Ai1 ,...,ik , Aj1 ,...,jm ⊂ Sn , corresponding to subsets of item indices
I := {i1 , . . . , ik } and J := {j1 , . . . , jm }.
1: Let σ ∈ Sk be the total ranking corresponding to the k observed items in Ai1 ,...,ik , and
σ 0 ∈ Sm be the total ranking corresponding to the m observed items in Aj1 ,...,jm .
2: Let τ ∈ S|I∩J| be the total ranking of the observed items indexed by I ∩ J in Ai1 ,...,ik , and
τ 0 ∈ S|I∩J| the total ranking of the observed items indexed by I ∩ J in partial ranking
Aj1 ,...,jm .
3: Initialize s1 = s2 = s3 = s4 = s5 = 0.
4: If |I ∩ J| ≥ 2, update
|I∩J|
s1 = 2
n Kτ (τ, τ 0 ).
2
5: If |I ∩ J| ≥ 1 and |I \ J| ≥ 1, update
1 X n
2σ 0 (i) − m − 1
s2 = n
2
(m + 1) i∈I∩J
o
× 2(σ(i) − τ (i)) − k + |I ∩ J| .
6: If |I ∩ J| ≥ 1 and |J \ I| ≥ 1, update
1 X n
s3 = n 2σ(i) − k − 1
2
(k + 1) i∈I∩J
o
× 2(σ 0 (i) − τ 0 (i)) − m + |I ∩ J| .
|(I ∪ J){ |
s4 = n
2
(k + 1)(m + 1)
X
2σ(i) − k − 1 2σ 0 (i) − m − 1 .
×
i∈I∩J
8: If |I \ J| ≥ 1 and |J \ I| ≥ 1, update
−1
s5 = n
(k + 1)(m + 1)
2
X X 0
× 2σ(i) − k − 1 2σ (j) − m − 1 .
i∈I\J j∈J\I
35
Algorithm 4 Kendall kernel for a top-k partial ranking and a top-m partial ranking.
Input: a top-k partial ranking and a top-m partial ranking Bi1 ,...,ik , Bj1 ,...,jm ⊂
Sn , corresponding to subsets of item indices I := {i1 , . . . , ik } and J :=
{j1 , . . . , jm }.
1: Let σ ∈ Sk be the total ranking corresponding to the k observed items in Bi1 ,...,ik , and
σ 0 ∈ Sm be the total ranking corresponding to the m observed items in Bj1 ,...,jm .
2: Let τ ∈ S|I∩J| be the total ranking of the observed items indexed by I ∩ J in Bi1 ,...,ik , and
τ 0 ∈ S|I∩J| the total ranking of the observed items indexed by I ∩ J in partial ranking
Bj1 ,...,jm .
3: Initialize s1 = s2 = s3 = s4 = s5 = 0.
4: If |I ∩ J| ≥ 2, update
|I∩J|
s1 = 2
n Kτ (τ, τ 0 ) .
2
5: If |I ∩ J| ≥ 1 and |I \ J| ≥ 1, update
1 X
s2 = n
2(σ(i) − τ (i)) − k + |I ∩ J| .
2 i∈I∩J
6: If |I ∩ J| ≥ 1 and |J \ I| ≥ 1, update
1 X 0 0
s3 = n
2(σ (i) − τ (i)) − m + |I ∩ J| .
2 i∈I∩J
|I ∩ J| · |(I ∪ J){ |
s4 = n
.
2
8: If |I \ J| ≥ 1 and |J \ I| ≥ 1, update
−|I \ J| · |J \ I|
s5 = n
.
2
36
B Stability study of k-means algorithms
Good clustering algorithms are supposed be robust to “perturbation” in data, in the sense
that clusters formed by running an algorithm on bootstrap replicas of the original data
should be similar. In other words, if we bootstrap the complete dataset twice and form
a clustering with respect to each, the two clustering assignments should be close to each
other. Note that in order to measure the similarity of two clustering assignments, we use
the (adjusted) Rand index defined by the percentage of instance pairs falling in the same or
in different clusters by the two assignments [33].
In Section 7.3 of the main paper, we performed clustering on the 1980 APA election
data with k-means approaches including the proposed kernel k-means and several classic
k-means algorithms. Under the same experimental setting, we now compare their stability
performance. Specifically, for each fixed number of clusters, we repeatedly use a bootstrap
replica of the dataset to search for centroids returned by running k-means algorithms, and
partition the original dataset with these identified centroids. The Rand index for two such
clustering assignments is computed and the computation is repeated for 100 times accounting
for the random process of bootstrapping.
● ●
● ●
● ●
● ●
●
●
● ●
●
● ● ●
●
●
● ●
● ●
● ● ● ●
0.6 ●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
● ● ●
● ●
●
● ● ●
● ●
● ● ●
● ● ●
●
rand
0.4 ●
0.2
bordaKmeans
bruteKmeans
copelandKmeans
0.0 kernelKmeans
2 3 4 5 6 7 8 9 10
nclust
Figure 7: Across different number of clusters, Rand index between clustering assignments by
running k-means algorithm on bootstrap replicas of the 1980 APA election data. For each
fixed number of clusters, the boxplot represents the variance over 100 repeated runs.
Results are shown in Figure 7. We observe that, for each fixed number of clusters, kernel
k-means has higher stability scores than the classic k-means algorithms in general. Notably,
the discrepancy between kernel k-means and the others in terms of their stability performance
is even sharper when the number of clusters becomes large. In conclusion, evidence advocates
again the use of kernel k-means over classic k-means algorithms in clustering rank data.
37