Academia.eduAcademia.edu

Metric and Kernel Learning using a Linear Transformation

2009

Metric and kernel learning are important in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study metric learning as a problem of learning a linear transformation of the input data. We show that for high-dimensional data, a particular framework for learning a linear transformation of the data based on the LogDet divergence can be efficiently kernelized to learn a metric (or equivalently, a kernel function) over an arbitrarily high dimensional space. We further demonstrate that a wide class of convex loss functions for learning linear transformations can similarly be kernelized, thereby considerably expanding the potential applications of metric learning. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision and text mining.

Metric and Kernel Learning using a Linear Transformation Prateek Jain Brian Kulis Jason V. Davis Inderjit S. Dhillon arXiv:0910.5932v1 [cs.LG] 30 Oct 2009 October 30, 2009 Abstract Metric and kernel learning are important in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study metric learning as a problem of learning a linear transformation of the input data. We show that for high-dimensional data, a particular framework for learning a linear transformation of the data based on the LogDet divergence can be efficiently kernelized to learn a metric (or equivalently, a kernel function) over an arbitrarily high dimensional space. We further demonstrate that a wide class of convex loss functions for learning linear transformations can similarly be kernelized, thereby considerably expanding the potential applications of metric learning. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision and text mining. 1 Introduction One of the basic requirements of many machine learning algorithms (e.g., semi-supervised clustering algorithms, nearest neighbor classification algorithms) is the ability to compare two objects to compute a similarity or distance between them. In many cases, off-the-shelf distance or similarity functions such as the Euclidean distance or cosine similarity are used; for example, in text retrieval applications, the cosine similarity is a standard function to compare two text documents. However, such standard distance or similarity functions are not appropriate for all problems. Recently, there has been significant effort focused on learning how to compare data objects. One approach has been to learn a distance metric between objects given additional side information such as pairwise similarity and dissimilarity constraints over the data. One class of distance metrics that has shown excellent generalization properties is the Mahalanobis distance function [DKJ+ 07, XNJR02, WBS05, GR05, SSSN04]. The Mahalanobis distance can be viewed as a method in which data is subject to a linear transformation, and then distances in this transformed space are computed via the standard squared Euclidean distance. Despite their simplicity and generalization ability, Mahalanobis distances suffer from two major drawbacks: 1) the number of parameters grows quadratically with the dimensionality of the data, making it difficult to learn distance functions over high-dimensional data, 2) learning a linear transformation is inadequate for data sets with non-linear decision boundaries. To address the latter shortcoming, kernel learning algorithms typically attempt to learn a kernel matrix over the data. Limitations of linear methods can be overcome by employing a non-linear input kernel, which effectively maps the data non-linearly to a high-dimensional feature space. However, many existing kernel learning methods are still limited in that the learned kernels do not generalize to new points [KT03, KSD06, TRW05]. These methods are restricted to learning in the transductive setting where all the data (labelled and unlabeled) is assumed to be given upfront. 1 There has been some work on learning kernels that generalize to new points, most notably work on hyperkernels [OSW03], but the resulting optimization problems are expensive and cannot be scaled to large or even medium-sized data sets. In this paper, we explore metric learning with linear transformations over arbitrarily highdimensional spaces; as we will see, this is equivalent to learning a parameterized kernel function φ(x)T W φ(y) given an input kernel function φ(x)T φ(y). In the first part of the paper, we focus on a particular loss function called the LogDet divergence, for learning the positive definite matrix W . This loss function is advantageous for several reasons: it is defined only over positive definite matrices, which makes the optimization simpler, as we will be able to effectively ignore the positive definiteness constraint on W . The loss function has precedence in optimization [Fle91] and statistics [JS61]. An important advantage of our method is that the proposed optimization algorithm is scalable to very large data sets of the order of millions of data objects. But perhaps most importantly, the loss function permits efficient kernelization, allowing the learning of a linear transformation in kernel space. As a result, unlike transductive kernel learning methods, our method easily handles out-of-sample extensions, i.e., it can be applied to unseen data. Later in the paper, we extend our result on kernelization of the LogDet formulation to other convex loss functions for learning W , and give conditions for which we are able to compute and evaluate the learned kernel functions. Our result is akin to the representer theorem for reproducing kernel Hilbert spaces, where the optimal parameters can be expressed purely in terms of the training data. In our case, even though the matrix W may be infinite-dimensional, it can be fully represented in terms of the constrained data points, making it possible to compute the learned kernel function value over arbitrary points. Finally, we apply our algorithm to a number of challenging learning problems, including ones from the domains of computer vision and text mining. Unlike existing techniques, we can learn linear transformation-based distance or kernel functions over these domains, and we show that the resulting functions lead to improvements over state-of-the-art techniques for a variety of problems. 2 Related Work Most of the existing work in metric learning has been done in the Mahalanobis distance (or metric) learning paradigm, which has been found to be a sufficiently powerful class of metrics for a variety of different data. One of the earliest papers on metric learning [XNJR02] proposes a semidefinite programming formulation under similarity and dissimilarity constraints for learning a Mahalanobis distance, but the resulting formulation is slow to optimize and has been outperformed by more sophisticated techniques. More recently, [WBS05] formulate the metric learning problem in a large margin setting, with a focus on k-NN classification. They also formulate the problem as a semidefinite programming problem and consequently solve it using a method that combines sub-gradient descent and alternating projections. [GR05] proceed to learn a linear transformation in the fully supervised setting. Their formulation seeks to ‘collapse classes’ by constraining within-class distances to be zero while maximizing between-class distances. While each of these algorithms was shown to yield improved classification performance over the baseline metrics, their constraints do not generalize outside of their particular problem domains; in contrast, our approach allows arbitrary linear constraints on the Mahalanobis matrix. Furthermore, these algorithms all require eigenvalue decompositions or semi-definite programming, an operation that is cubic in the dimensionality of the data. Other notable work where the authors present methods for learning Mahalanobis metrics includes [SSSN04] (online metric learning), Relevant Components Analysis (RCA) [SHWP02] (similar to discriminant analysis), locally-adaptive discriminative methods [HT96], and learning from rela2 tive comparisons [SJ03]. In particular, the method of [SSSN04] provided the first demonstration of Mahalanobis distance learning in kernel space. Their construction, however, is expensive to compute, requiring cubic time per iteration to update the parameters. As we will see, our LogDet-based algorithm can be implemented more efficiently. Non-linear transformation based metric learning methods have also been proposed, though these methods usually suffer from suboptimal performance, non-convexity, or computational complexity. Some example methods include neighborhood component analysis (NCA) [GRHS04] that learns a distance metric specifically for nearest-neighbor based classification; the convolutional neural net based method of [CHL05]; and a general Riemannian metric learning method [Leb06]. There have been several recent papers on kernel learning. As mentioned in the introduction, much of the research is limited to learning in the transductive setting, e.g. [KT03, KSD06, TRW05]. Research on kernel learning that does generalize to new data points includes multiple kernel learning [LCB+ 04], where a linear combination of base kernel functions are learned; this approach has proven to be useful for a variety of problems, such as object recognition in computer vision. Another approach to kernel learning is to use hyperkernels [OSW03], which consider functions between kernels, and learn in the appropriate reproducing kernel Hilbert space between such functions. In both cases, semidefinite programming is used, making the approach impractical for large-scale learning problems. Recently, some work has been done on making hyperkernel learning more efficient via second-order cone programming [TK06], however this formulation still cannot be applied to large data sets. Concurrent to our work in showing kernelization for a wide class of convex loss functions, a recent paper considers kernelization of other Mahalanobis distance learning algorithms such as LMNN and NCA [CKTK08]. The latter paper, which appeared after the conference version of the results in our paper, presents a representer-type theorem and can be seen as complementary to the general kernelization results (see Section 4) we present in this paper. The research in this paper extends work done in [DKJ+ 07], [KSD06], and [DD08]. While the focus in [DKJ+ 07] and [DD08] was solely on the LogDet divergence, in this work we characterize kernelization of a wider class of convex loss functions. Furthermore, we provide a more detailed analysis of kernelization for the Log Determinant loss, and include experimental results on large scale kernel learning. We extend the work in [KSD06] to the inductive setting; the main goal in [KSD06] was to demonstrate the computational benefits of using the LogDet and von Neumann divergences for learning low-rank kernel matrices. Finally in this paper, we do not consider online models for metric and kernel learning, however interested readers can refer to [JKDG08]. 3 Metric and Kernel Learning via the LogDet Divergence In this section, we introduce the LogDet formulation for linearly transforming the data given a set of pairwise distance constraints. As discussed below, this is equivalent to a Mahalanobis metric learning problem. We then discuss kernelization issues of the formulation and present efficient optimization algorithms. Finally, we address limitations of the method when the amount of training data is large, and propose a modified algorithm to efficiently learn a kernel under such circumstances. 3.1 Mahalanobis Distances and Parameterized Kernels First we introduce the framework for metric and kernel learning that is employed in this paper. Given a data set of objects X = [x1 , ..., xn ], xi ∈ Rd (when working in kernel space, the data matrix will be represented as X = [φ(x1 ), ..., φ(xn )], where φ is the mapping to feature space), we are interested in finding an appropriate distance function to compare two objects. We consider 3 the Mahalanobis distance, parameterized by a positive definite matrix W ; the squared distance between two points xi and xj is given by dW (xi , xj ) = (xi − xj )T W (xi − xj ). This distance function can be viewed as learning a linear transformation of the data and measuring the squared Euclidean distance in the transformed space. This is seen by factorizing the matrix W = GT G and observing that dW (xi , xj ) = kGxi − Gxj k22 . However, if the data is not linearly separable in the input space, then the resulting distance function may not be powerful enough for the desired application. As a result, we are interested in working in kernel space; that is, we can express the Mahalanobis distance in kernel space after applying an appropriate mapping φ from input to feature space: dW (xi , xj ) = (φ(xi ) − φ(xj ))T W (φ(xi ) − φ(xj )). As is standard with kernel-based algorithms, we require that this distance be computable given the ability to compute the kernel function κ0 (x, y) = φ(x)T φ(y). We can therefore equivalently pose the problem as learning a parameterized kernel function κ(x, y) = φ(x)T W φ(y) given some input kernel function κ0 (x, y) = φ(x)T φ(y). To learn the resulting metric/kernel, we assume that we are given constraints on the desired distance function. In this paper, we assume that pairwise similarity and dissimilarity constraints are given over the data—that is, pairs of points that should be similar under the learned metric/kernel, and pairs of points that should be dissimilar under the learned metric/kernel. Such constraints are natural in many settings; for example, given class labels over the data, points in the same class should be similar to one another and dissimilar to points in different classes. However, our approach is general and can accommodate other potential constraints over the distance function, such as relative distance constraints. The main challenge is in finding an appropriate loss function for learning the matrix W so that 1) the resulting algorithm is scalable and efficiently computable in kernel space, 2) the resulting metric/kernel yields improved performance on the underlying machine learning problem, such as classification, semi-supervised clustering etc. We now move on to the details. 3.2 LogDet Metric Learning The LogDet divergence between two positive definite matrices1 W , W0 ∈ Rd×d is defined to be Dℓd (W, W0 ) = tr(W W0−1 ) − log det(W W0−1 ) − d. We are interested in finding W that is closest to W0 as measured by the LogDet divergence but that satisfies our desired constraints. When W0 = I, this formulation can be interpreted as a maximum entropy problem. Given a set of similarity constraints S and dissimilarity constraints D, we propose the following problem: min W 0 Dℓd (W, I) s.t. dW (xi , xj ) ≤ u, (i, j) ∈ S, dW (xi , xj ) ≥ ℓ, (i, j) ∈ D. (3.1) 1 The definition of LogDet divergence can be extended to the case when W0 and W are rank deficient by appropriate use of the pseudo-inverse. The interested reader may refer to [KSD06]. 4 The above problem was considered in [DKJ+ 07]. LogDet has many important properties that make it useful for machine learning and optimization, including scale-invariance and preservation of the range space. Please see [KSD08] for a detailed discussion on the properties of LogDet. Beyond this, we prefer LogDet over other loss functions (including the squared Frobenius loss as used in [SSSN04] or a linear objective as in [WBS05]) due to the fact that the resulting algorithm turns out to be simple and efficiently kernelizable. We note that formulation (3.1) minimizes the LogDet divergence to the identity matrix I. This can be generalized to arbitrary positive definite matrices W0 , however −1/2 −1/2 without loss of generality we can consider W0 = I since Dℓd (W, W0 ) = Dℓd (W0 W W0 , I). Further, formulation (3.1) considers simple similarity and dissimilarity constraints over the learned Mahalanobis distance, but other linear constraints are possible. Finally, the above formulation assumes that there exists a feasible solution to the proposed optimization problem; extensions to the infeasible case involving slack variables are discussed later (see Section 3.5). 3.3 Kernelizing the Problem We now consider the problem of kernelizing the metric learning problem. Subsequently, we will present an efficient algorithm and discuss generalization to new points. Given a set of n constrained data points, let K0 denote the input kernel matrix for the data, i.e. K0 (i, j) = κ(xi , xj ) = φ(xi )T φ(xj ). Note that the squared Mahalanobis distance in kernel space may be written as dW (φ(xi ), φ(xj )) = K(xi , xi ) + K(xj , xj ) − 2K(xi , xj ), where K is the learned kernel matrix; equivalently, we may write the squared distance as tr(K(ei − ej )(ei − ej )T ), where ei is the i-th canonical basis vector. Consider the following problem to find K: min K0 s.t. Dℓd (K, K0 ) tr(K(ei − ej )(ei − ej )T ) ≤ u T tr(K(ei − ej )(ei − ej ) ) ≥ ℓ (i, j) ∈ S, (3.2) (i, j) ∈ D. This kernel learning problem was first proposed in the transductive setting in [KSD06], though no extensions to the inductive case were considered. Note that problem (3.1) optimizes over a d × d matrix W , while the kernel learning problem (3.2) optimizes over an n × n matrix K. We now present our key theorem connecting problems (3.1) and (3.2). Theorem 3.1. Let W ∗ be the optimal solution to problem (3.1) and let K ∗ be the optimal solution to problem (3.2). Then the optimal solutions are related by the following: K ∗ = X T W ∗ X, W ∗ = I + XM X T , where M = K0−1 (K ∗ − K0 )K0−1 , K0 = X T X, X = [φ(x1 ), φ(x2 ), . . . , φ(xn )] . To prove this theorem, we first prove a lemma for general Bregman matrix divergences, of which the LogDet divergence is a special case. Consider the following general optimization problem: min W s.t. Dφ (W, W0 ) tr(W Ri ) ≤ si , W  0, ∀1 ≤ i ≤ m, (3.3) 5 where Dφ (W, W0 ) is a Bregman matrix divergence [KSD06] generated by a real-valued strictly convex function over symmetric matrices φ : Rn×n → R, i.e., Dφ (W, W0 ) = φ(W ) − φ(W0 ) − tr((W − W0 )T ∇φ(W0 )). (3.4) Note that the LogDet divergence is generated by φ(W ) = − log det W . Lemma 3.2. The solution to the dual of the primal formulation (3.3) is given by: max W,λ,Z s.t. where s(λ) = Pm i=1 λi si φ(W ) − φ(W0 ) − tr(W ∇φ(W )) + tr(W0 ∇φ(W0 )) − s(λ) ∇φ(W ) = ∇φ(W0 ) − R(λ) + Z, (3.5) λ ≥ 0, (3.6) Z  0, and R(λ) = Pm i=1 λi Ri . Proof. First, consider the Lagrangian of (3.3): L(W, λ, Z) = Dφ (W, W0 ) + tr(W R(λ)) − s(λ) − tr(W Z), m m X X λi si , Z  0, λ ≥ 0. λi Ri , s(λ) = where R(λ) = (3.7) ∇W Dφ (W, W0 ) = ∇φ(W ) − ∇φ(W0 ). (3.8) i=1 i=1 Now, note that Setting the gradient of the Lagrangian with respect to W to be zero and using (3.8), we get: ∇φ(W ) − ∇φ(W0 ) + R(λ) − Z = 0, and so, tr(W ∇φ(W0 )) = tr(W ∇φ(W )) + tr(W R(λ)) − tr(W Z). (3.9) (3.10) Now, substituting (3.10) into the Lagrangian, we get: L(W, λ, Z) = φ(W ) − φ(W0 ) − tr(W ∇φ(W )) + tr(W0 ∇φ(W0 )) − s(λ), where ∇φ(W ) = ∇φ(W0 ) − R(λ) + Z. The lemma now follows directly. To prove Theorem 3.1, we will also need the following well-known lemma: Lemma 3.3. det(I + AB) = det(I + BA) for all A ∈ Rm×n , B ∈ Rn×m . We are now ready to prove Theorem 3.1. Proof. of Theorem 3.1. First we observe that the squared Mahalanobis distances from the constraints in (3.1) may be written as dW (xi , xj ) = tr(W (xi − xj )(xi − xj )T ) = tr(W X(ei − ej )(ei − ej )T X T ). The objective in problem (3.1), Dℓd (W, I), is defined only for positive definite W and is a convex function of W , hence using Slater’s optimality condition, Z = 0 (in Lemma 3.2) and may be removed from the constraints. Further, note that the LogDet divergence Dℓd (·, ·) is a Bregman 6 matrix divergence with generating function φ(W ) = − log det(W ). Thus using ∇φ(W ) = −W −1 and Lemma 3.2, the dual of problem (3.1) is given by: min W,λ s.t. log det W + b(λ) W −1 = I + XC(λ)X T , (3.11) λ ≥ 0, P P T T (i,j)∈S λij (ei −ej )(ei −ej ) − (i,j)∈D λij (ei −ej )(ei −ej ) and b(λ) = (i,j)∈S λij u− P where C(λ) = P (i,j)∈D λij ℓ. Now, for matrices W feasible for problem (3.11), log det W = − log det W −1 = − log det(I + XC(λ)X T ) = − log det(I + C(λ)K0 ), where the last equality follows from Lemma 3.3 (recall that K0 = X T X). Since, log det(AB) = log det A + log det B for square matrices A and B, (3.11) may be rewritten as min λ − log det(K0−1 + C(λ)) + b(λ), s.t. λ ≥ 0. (3.12) Writing K −1 = K0−1 + C(λ), the above can be written as: min K,λ s.t. log det K + b(λ), K −1 = K0−1 + C(λ), λ ≥ 0. (3.13) The above problem can be seen by inspection to be identical to the dual problem of (3.2) as given by Lemma 3.2. Hence, since their dual problems are identical, problems (3.1) and (3.2) are equivalent. Using (3.11) and the Sherman-Morrison-Woodbury formula, the form of the optimal W ∗ is: W ∗ = I − X(C(λ∗ )−1 + K0 )−1 X T = I + XM X T , where λ∗ is the dual optimal and M = −(C(λ∗ )−1 + K0 )−1 . Similarly, using (3.13), the optimal K ∗ is given by: K ∗ = K0 − K0 (C(λ∗ )−1 + K0 )−1 K0 = X T W ∗ X We can explicitly solve for M as M = K0−1 (K ∗ − K0 )K0−1 by simplification of these expressions using the fact that K0 = X T X. This proves the theorem. We now generalize the above theorem to regularize against arbitrary positive definite matrices W0 . Corollary 3.4. Consider the following problem: min W 0 s.t. Dℓd (W, W0 ) dW (xi , xj ) ≤ u (i, j) ∈ S, dW (xi , xj ) ≥ ℓ (i, j) ∈ D. (3.14) Let W ∗ be the optimal solution to problem (3.14) and let K ∗ be the optimal solution to problem (3.2). Then the optimal solutions are related by the following: K ∗ = X T W ∗X W ∗ = W0 + W0 XM X T W0 , where M = K0−1 (K ∗ − K0 )K0−1 , K0 = X T W0 X, 7 X = [φ(x1 ), φ(x2 ), . . . , φ(xn )] −1/2 Proof. Note that Dℓd (W, W0 ) = Dℓd (W0 is now equivalent to: −1/2 W W0 −1/2 f=W , I). Let W 0 f , I) Dℓd (W s.t. dW f (x̃i , x̃j ) ≤ u (i, j) ∈ S, dW f (x̃i , x̃j ) ≥ ℓ (i, j) ∈ D, min f 0 W −1/2 W W0 . Problem (3.14) (3.15) f = W −1/2 W W −1/2 , X e = W 1/2 X and X e = [x̃1 , x̃2 , . . . , x̃n ]. Now using Theorem 3.1, the where W 0 0 0 f ∗ of problem (3.15) is related to the optimal K ∗ of problem (3.2) by K ∗ = optimal solution W eT W f∗X e = X T W 1/2 W −1/2 W ∗ W −1/2 W 1/2 X = X T W ∗ X. Similarly, W ∗ = W 1/2 W f ∗ W 1/2 = W0 + X 0 0 0 0 0 0 W0 XM X T W0 where M = K0−1 (K ∗ − K0 )K0−1 . Since the kernelized version of LogDet metric learning can be posed as a linearly constrained optimization problem with a LogDet objective, similar algorithms can be used to solve either problem. This equivalence implies that we can implicitly solve the metric learning problem by instead solving for the optimal kernel matrix K ∗ . Note that using LogDet divergence as objective function has two significant benefits over many other popular loss functions: 1) the metric and kernel learning problems (3.1), (3.2) are both equivalent and hence solving the kernel learning formulation directly provides an out of sample extension (see Section 3.4 for details), 2) projection with respect to the LogDet divergence onto a single distance constraint has a closed form solution, thus making it amenable to an efficient cyclic projection algorithm (refer to Section 3.5). 3.4 Generalizing to New Points In this section, we see how to generalize to new points using the learned kernel matrix K ∗ . Suppose that we have solved the kernel learning problem for K ∗ (from now on, we will drop the ∗ superscript and assume that K and W are at optimality). The distance between two points φ(x ) i and φ(xj ) that are in the training set can be computed directly from the learned kernel matrix as K(i, i) + K(j, j) − 2K(i, j). We now consider the problem of computing the learned distance between two points φ(z1 ) and φ(z2 ) that may not be in the training set. In Theorem 3.1, we showed that the optimal solution to the metric learning problem can be expressed as W = I + XM X T . To compute the Mahalanobis distance in kernel space, we see that the inner product φ(z1 )T W φ(z2 ) can be computed entirely via inner products between points: φ(z1 )T W φ(z2 ) = φ(z1 )T (I + XM X T )φ(z2 ) = φ(z1 )T φ(z2 ) + φ(z1 )T XM X T φ(z2 ) = κ(z1 , z2 ) + k1T M k2 , where ki = [κ(zi , x1 ), ..., κ(zi , xn )]T . (3.16) Thus, the expression above can be used to evaluate kernelized distances with respect to the learned kernel function between arbitrary data objects. In summary, the connection between kernel learning and metric learning allows us to generalize our metrics to new points in kernel space. This is performed by first solving the kernel learning problem for K, then using the learned kernel matrix and the input kernel function to compute learned distances via (3.16). 8 Algorithm 1 Metric/Kernel Learning with the LogDet Divergence Input: K0 : input n × n kernel matrix, S: set of similar pairs, D: set of dissimilar pairs, u, ℓ: distance thresholds, γ: slack parameter Output: K: output kernel matrix 1. K ← K0 , λij ← 0 ∀ ij 2. ξij ← u for (i, j) ∈ S; otherwise ξij ← ℓ 3. repeat 3.1. Pick a constraint (i, j) ∈ S or D 3.2. p ← (ei − ej )T K(ei − ej ) 3.3. δ ← 1 if (i,  j) ∈ S, −1 otherwise  δγ 3.4. α ← min λij , γ+1 1 p − 1 ξij 3.5. β ← δα/(1 − δαp) 3.6. ξij ← γξij /(γ + δαξij ) 3.7. λij ← λij − α 3.8. K ← K + βK(ei − ej )(ei − ej )T K 4. until convergence return K 3.5 Kernel Learning Algorithm Given the connection between the Mahalanobis metric learning problem for the d× d matrix W and the kernel learning problem for the n × n kernel matrix K, we would like to develop an algorithm for efficiently performing metric learning in kernel space. Specifically, we provide an algorithm (see Algorithm 1) for solving the kernelized LogDet metric learning problem, as given in (3.2). First, to avoid problems with infeasibility, we incorporate slack variables into our formulation. These provide a tradeoff between minimizing the divergence between K and K0 and satisfying the constraints. Note that our earlier results (see Theorem 3.1) easily generalize to the slack case: min Dℓd (K, K0 ) + γ · Dℓd (diag(ξ), diag(ξ0 )) K,ξ s.t. tr(K(ei − ej )(ei − ej )T ) ≤ ξij (i, j) ∈ S, tr(K(ei − ej )(ei − ej )T ) ≥ ξij (i, j) ∈ D. (3.17) The parameter γ above controls the tradeoff between satisfying the constraints and minimizing Dℓd (K, K0 ), and the entries of ξ0 are set to be u for corresponding similarity constraints and ℓ for dissimilarity constraints. To solve problem (3.17), we employ the technique of Bregman projections, as discussed in the transductive setting [KSD06, KSD08]. At each iteration, we choose a constraint (i, j) from S or D. We then apply a Bregman projection such that K satisfies the constraint after projection; note that the projection is not an orthogonal projection but is rather tailored to the particular function that we are optimizing. Algorithm 1 details the steps for Bregman’s method on this optimization problem. Each update is given by a rank-one update K ← K + βK(ei − ej )(ei − ej )T K, where β is an appropriate projection parameter that can be computed in closed form (see Algorithm 1). Algorithm 1 has a number of key properties which make it useful for various kernel learning tasks. First, the Bregman projections can be computed in closed form, assuring that the projection 9 updates are efficient (O(n2 )). Note that, if the feature space dimensionality d is less than n then a similar algorithm can be used directly in the feature space (see [DKJ+ 07]). Instead of LogDet, if we use the von Neumann divergence, another potential loss function for this problem, O(n2 ) updates are possible, but are much more complicated and require use of the fast multipole method, which cannot be employed easily in practice. Secondly, the projections maintain positive definiteness, which avoids any eigenvector computation or semidefinite programming. This is in stark contrast with the Frobenius loss, which requires additional computation to maintain positive definiteness, leading to O(n3 ) updates. 3.6 Metric/Kernel Learning with Large Datasets In Sections 3.1 and 3.3 we proposed a LogDet divergence based Mahalanobis metric learning problem (3.1) and an equivalent kernel learning problem (3.2). The number of parameters involved in these problems is O(min(n2 , d2 )), where n is the number of training points and d is the dimensionality of the data. This quadratic dependency effects not only the running time for both training and testing, but also poses tremendous challenges in estimating a quadratic number of parameters. For example, a data set with 10,000 dimensions leads to a Mahalanobis matrix with 100 million values. This represents a fundamental limitation of existing approaches, as many modern data mining problems possess relatively high dimensionality. In this section, we present a method for learning structured Mahalanobis distance (kernel) functions that scale linearly with the dimensionality (or training set size). Instead of representing the Mahalanobis distance/kernel matrix as a full d × d (or n × n) matrix with O(min(n2 , d2 )) parameters, our methods use compressed representations, admitting matrices parameterized by O(min(n, d)) values. This enables the Mahalanobis distance/kernel function to be learned, stored, and evaluated efficiently in the context of high dimensionality and large training set size. In particular, we propose a method to efficiently learn an identity plus low-rank Mahalanobis distance matrix and its equivalent kernel function. Now, we formulate the high-dimensional identity plus low-rank (IPLR) metric learning problem. Consider a low-dimensional subspace in Rd and let the columns of U form an orthogonal basis of this subspace. We will constrain the learned Mahalanobis distance matrix to be of the form: W = I d + Wl = I d + U LU T , (3.18) k×k with where I d is the d × d identity matrix, Wl denotes the low-rank part of W and L ∈ S+ k ≪ min(n, d). Analogous to (3.1), we propose the following problem to learn an identity plus low-rank Mahalanobis distance function: min W,L0 Dℓd (W, I d ) s.t. dW (xi , xj ) ≤ u (i, j) ∈ S, dW (xi , xj ) ≥ ℓ (i, j) ∈ D, (3.19) W = I d + U LU T . Note that the above problem is identical to (3.1) except for the added constraint W = I d + U LU T . Let F = I k + L. Now we have Dℓd (W, I d ) = tr(I d + U LU T ) − log det(I d + U LU T ) − d, = tr(I k + L) + d − k − log det(I k + L) − d, = Dℓd (F, I k ), (3.20) 10 where the second equality follows from the fact that tr(AB) = tr(BA) and Lemma 3.3. Also note that for all C ∈ Rn×n , tr(W XCX T ) = tr((I d + U LU T )XCX T ), = tr(XCX T ) + tr(LU T XCX T U ), T T = tr(XCX T ) − tr(X ′ CX ′ ) + tr(F X ′ CX ′ ), where X ′ = U T X is the reduced-dimensional representation of X. Hence, dW (xi , xj ) = tr(W X(ei − ej )(ei − ej )T X T ) = dI (xi , xj ) − dI (x′i , x′j ) + dF (x′i , x′j ). (3.21) Using (3.20) and (3.21), problem (3.19) is equivalent to the following: min F 0 s.t. Dℓd (F, I k ) dF (x′i , x′j ) ≤ u − dI (xi , xj ) + dI (x′i , x′j ) (i, j) ∈ S, dF (x′i , x′j ) ≥ ℓ − dI (xi , xj ) + dI (x′i , x′j ) (i, j) ∈ D. (3.22) Note that the above formulation is an instance of problem (3.1) and can be solved using an algorithm similar to Algorithm 1. Furthermore, the above problem solves for a k ×k matrix rather than a d×d matrix seemingly required by (3.19). The optimal W ∗ is obtained as W ∗ = I d + U (F ∗ − I k )U T . Next, we show that problem (3.22) and equivalently (3.19) can be solved efficiently in feature space by selecting an appropriate basis R (U = R(RT R)−1/2 ). Let R = XJ, where J ∈ Rn×k . Note that U = XJ(J T K0 J)−1/2 and X ′ = U T X = (J T K0 J)−1/2 J T K0 , i.e., X ′ ∈ Rk×n can be computed efficiently in the feature space (requiring inversion of only a k × k matrix). Hence, problem (3.22) can be solved efficiently in feature space using Algorithm 1 and the optimal kernel K ∗ is given by K ∗ = X T W ∗ X = K0 + K0 J(J T K0 J)−1/2 (F ∗ − I k )(J T K0 J)−1/2 J T K0 . Note that problem (3.22) can be solved via Algorithm 1 using O(k2 ) computational steps per iteration. Additionally, O(min(n, d)k) steps are required to prepare the data. Also, the optimal solution W ∗ (or K ∗ ) can be stored implicitly in O(min(n, d)k) steps and similarly, the Mahalanobis distance between any two points can be computed in time O(min(n, d)k) steps. The metric learning problem presented here depends critically on the basis selected. For the case when d is not significantly larger than n and feature space vectors X are available explicitly, the basis R can be selected by using one of the following heuristics (see Section 5, [DD08] for more details): • Using the top k singular vectors of X. • Clustering the columns of X and using the mean vectors as the basis R. • For the fully-supervised case, if the number of classes (c) is greater than the required dimensionality (k) then cluster the class-mean vectors into k clusters and use the obtained cluster centers for forming the basis R. If c < k then cluster each class into k/c clusters and use the cluster centers to form R. For learning the kernel function, the basis R = XJ can be selected by: 1) using a randomly sampled coefficient matrix J, 2) clustering X using kernel k-means or a spectral clustering method, 3) choosing a random subset of X, i.e, the columns of J are random indicator vectors. A more careful selection of the basis R should further improve accuracy of our method and is left as a topic for future research. 11 4 Kernelization with Other Convex Loss Functions One of the key benefits to using the LogDet divergence for metric learning is its ability to efficiently learn a linear mapping for high-dimensional kernelized data. A natural question is whether one can kernelize metric learning with other loss functions, such as those considered previously in the literature. To this end, the work of [CKTK08] showed how to kernelize some popular metric learning algorithms such as MCML [GR05] and LMNN [WBS05]. In this section, we show a complementary result that shows how to kernelize a class of metric learning algorithms that learns a linear map in input or feature space. Consider the following (more) general optimization problem that may be viewed as a generalization of (3.1) for learning a linear transformation matrix G, where W = GT G: min W s.t. tr(f (W )) tr(W XCi X T ) ≤ bi , ∀1 ≤ i ≤ m W  0, (4.1) d×d , X ∈ Rd×n , and each Ci ∈ Rn×n where f : Rd×d → Rd×d , tr(f (W )) is a convex function, W ∈ S+ is a symmetric matrix. Note that we have generalized both the loss function and the constraints. For example, the LogDet divergence can be viewed as a special case, since we may write Dℓd (X, Y ) = tr(XY −1 − log(XY −1 ) − I). The loss function f (W ) regularizes the learned transformation W against the baseline Euclidean distance metric, i.e., W0 = I. Hence, a desirable property of f would be: tr(f (W )) ≥ 0 with tr(f (W )) = 0 iff W = I. In this section we show that for a large and important class of functions f , problem (4.1) can be solved for W implicitly in the feature space, i.e., the problem (4.1) is kernelizable. We assume that the kernel function K0 (x, y) = φ(x)T φ(y) between any two data points can be computed in O(1) time. Denote W ∗ as an optimal solution for (4.1). Now, we formally define kernelizable metric learning problems. Definition 4.1. An instance of metric learning problem (4.1) is kernelizable if the following conditions hold: • Problem (4.1) is solvable efficiently in time poly(n, m) without explicit use of feature space vectors X. • tr(W ∗ Y CY T ), where Y ∈ Rd×N is the feature space representation of any given data points, can be computed in time poly(N ) for all C ∈ RN ×N . Theorem 4.2. Let f : R → R be a function defined over the reals such that: • f (x) is a convex function. • A sub-gradient of f (x) can be computed efficiently in O(1) time. • f (x) ≥ 0 ∀x with f (η) = 0 for some η ≥ 0. Consider the extension of f to the spectrum of W ∈ Sd+ , i.e. f (W ) = U f (Λ)U T , where W = U ΛU T is the eigenvalue decomposition of W (Definition 1.2, [Hig08]). Assuming X to be full-rank, i.e., K0 = X T X is invertible, problem (4.1) is kernelizable (Definition 4.1). To prove the above theorem, we need the following two lemmas: 12 Lemma 4.3. Assuming f satisfies the conditions stated in Theorem 4.2 and X is full-rank, ∃S ∗ ∈ Rn×n such that W ∗ = ηI + XS ∗ X T is an optimal solution to (4.1). P Proof. Let W = U ΛU T = j λj uj uTj be the eigenvalue decomposition of W , where λ1 ≥ λ2 ≥ · · · ≥ λd ≥ 0. Consider a linear constraint tr(W XCi X T ) ≤ bi as specified in problem (4.1). Note P that tr(W XCi X T ) = j λj uTj XCi X T uj . Note that if the j-th eigenvector uj of W is orthogonal to the range space of X, i.e. X T uj = 0, then the corresponding eigenvalue λj is not constrained (except for the non-negativity constraint imposed by the positive semi-definiteness constraint). Since the range space of X is at most n-dimensional, without loss of generality we can assume that λj ≥ 0, ∀j > n are not constrained by the linear inequality constraints in (4.1). P Furthermore, by the definition of a spectral function (Definition 1.2, [Hig08]), tr(f (W )) = j f (λj ). Since f satisfies the conditions of Theorem 4.2, f (η) = minx f (x) = 0. In order to minimize tr(f (W )), we can select λ∗j = η ≥ 0, ∀j > n (note that the non-negativity constraint is satisfied for this choice of λj ). Furthermore, eigenvectors uj , ∀j ≤ n, lie in the range space of X, i.e., ∀j ≤ n, uj = Xαj for some αj ∈ Rn . Hence, W∗ = n X λ∗i u∗j u∗T j +η = u∗j u∗T j , j=n+1 j=1 = d X d n X X u∗j u∗T (λ∗i − η)u∗j u∗T + η j , j j=1 n X j=1 T d X((λ∗j − η)α∗j α∗T j )X + ηI , j=1 = XS ∗ X T + ηI d , where S ∗ = Pn ∗ j=1 (λj − η)α∗j α∗T j . Lemma 4.4. If n < d and X ∈ Rd×n has full column rank, i.e., X T X is invertible then: XSX T  0 ⇐⇒ S  0. Proof. =⇒ XSX T  0 =⇒ v T XSX T v ≥ 0, ∀v ∈ Rd . Since X has full column rank, ∀q ∈ Rn ∃v ∈ Rd s.t. X T v = q. Hence, q T Sq = v T XSX T v ≥ 0, ∀q ∈ Rn =⇒ S  0 ⇐= Now ∀v ∈ Rd , v T XSX T v ≥ 0 as S  0. Thus XSX T  0. We now present a proof of Theorem 4.2. The key idea is to prove that (4.1) can solved implicitly by solving for S ∗ of Lemma 4.3. Proof. [Theorem 4.2] Using Lemma 4.3, W ∗ is of the form W ∗ = ηI d +XS ∗ X T . Assuming X is full-rank, i.e., all the data points xi are linearly independent, then there is a one-to-one mapping between W ∗ and S ∗ . Hence, solving for W ∗ is equivalent to solving for S ∗ . So, now our goal is to reformulate problem (4.1) in terms of S ∗ . 13 Let X = UX ΣX VXT be the SVD of X. Then, W = ηI d + XSX T , T = ηI d + UX ΣX VXT SVX ΣX UX , #" # " T UX ΣX VXT SVX ΣX + ηI n 0 , = [UX U⊥ ] 0 ηI n−d U⊥T (4.2) where U⊥T U = 0. Now, consider f (W ) = f (ηI d + XSX T ). Using (4.2): f (W ) = f (ηI d + XSX T ), # " #! " T UX ΣX VXT SVX ΣX + ηI n 0 = f [UX U⊥ ] , 0 ηI n−d U⊥T #! " # " T UX ΣX VXT SVX ΣX + ηI n 0 , = [UX U⊥ ] f 0 ηI n−d U⊥T #" # "  T f ΣX VXT SVX ΣX + ηI n 0 UX , = [UX U⊥ ] U⊥T 0 0  T = UX f ΣX VXT SVX ΣX + ηI n UX , where the second equality follows from the property that f (QZQT ) = Qf (Z)QT for anorthogonal  A 0 Q and a spectral function f . The third equality follows from the property that f = 0 B   f (A) 0 and the fact that f (η) = 0. Hence, 0 f (B)  tr(f (W )) = f ΣX VXT SVX ΣX + ηI n . (4.3) Next, consider the constraint tr(W XCi X T ) ≤ bi . Note that tr(W XCi X T ) = tr((ηI d + XSX T )XCi X T ) = tr(ηCi K0 + Ci K0 SK0 ). (4.4) Hence, the constraint tr(W XCi X T ) ≤ bi reduces to: tr(ηCi K0 + Ci K0 SK0 ) ≤ bi . (4.5) Finally, consider the constraint W  0. Using (4.2), we see that this is equivalent to: ηI n + ΣX VXT SVX ΣX  0, S  −ηK0−1 , (4.6) where K0 = X T X = VX Σ2X VXT . Using (4.3), (4.5), and (4.6) we get the following problem which is equivalent to (4.1):  min f ΣX VXT SVX ΣX + ηI n S s.t. tr(ηCi K0 + Ci K0 SK0 ) ≤ bi , S −ηK0−1 . ∀1 ≤ i ≤ m (4.7) 14 Note that the objective function is a strictly convex function of a linear transformation of S, and hence is strictly convex in S. Furthermore, all the constraints are linear in S. As a result, problem (4.7) is a convex program. Also, both ΣX and VX can be computed efficiently in O(n3 ) steps using eigenvalue decomposition of K0 = X T X. Hence, problem (4.1) can be solved efficiently in poly(n, m) steps using standard convex optimization methods such as the ellipsoid method [GLS88]. 5 Special Cases In the previous section, we proved a general result on kernelization of metric learning. In this section, we further consider a few special cases of interest: the von Neumann divergence, the squared Frobenius norm and semi-definite programming. For each of the cases, we derive the required optimization problem to be solved and mention the relevant optimization algorithms that can be used. 5.1 von Neumann Divergence The von Neumann divergence is a generalization of the well known KL-divergence to matrices. It is used extensively in quantum computing to compare density matrices of two different systems [NC00]. It is also used in the exponentiated matrix gradient method by [TRW05], online-PCA method by [WK08] and fast SVD solver by [AK07]. The von Neumann divergence between W and W0 is defined to be: DvN (W, W0 ) = tr(W log W − W log W0 − W + W0 ), where both W and W0 are positive definite. The metric learning problem that corresponds to (4.1) is: min W s.t. DvN (W, I) tr(W XCi X T ) ≤ bi , ∀1 ≤ i ≤ m, W  0. (5.1) It is easy to see that DvN (W, I) = tr(fvN (W )), where fvN (W ) = W log W − W + I = U fvN (Λ)U T , where W = U ΛU T is the eigenvalue decomposition of W and fvN : R → R, fvN (x) = x log x − x + 1. Also, note that fvN (x) is a strictly convex function with argminx fvN (x) = 1 and fvN (1) = 0. Hence, using Theorem 4.2, problem (5.1) is kernelizable since DvN (W, I) satisfies the required conditions. Using (4.7), the optimization problem to be solved is given by: min DvN ΣX VXT SVX ΣX + I n , I n S s.t. tr(Ci K0 + Ci K0 SK0 ) ≤ bi ,  ∀1 ≤ i ≤ m S  −K0−1 , (5.2) Next, we derive a simplified version of the above optimization problem. 15 Note that DvN (·, ·) is defined only for positive semi-definite matrices. Hence, the constraint S  −K0−1 should be satisfied if the above problem is feasible. Thus, the reduced optimization problem is given by:  min DvN ΣX VXT SVX ΣX + I n , I n S s.t. tr(Ci K0 + Ci K0 SK0 ) ≤ bi , ∀1 ≤ i ≤ m. (5.3) Note that the von-Neumann divergence is a Bregman matrix divergence (see Equation (3.4)) with the generating function φ(X) = tr(X log X − X). Now using Lemma 3.2 and simplifying using the fact that ∂ tr(X∂Xlog X) = log X, we get the following dual for problem (5.1): max λ − tr(exp(−ΣX VXT C(λ)VX ΣX )) − b(λ) s.t. λ ≥ 0, (5.4) P P where C(λ) = i λi Ci and b(λ) = i λi bi . Now, using VX Σ2X VXT = K0 we see that: tr(−ΣX VXT C(λ)VX ΣX )k ) = tr((−C(λ)K0 )k ). Next, using the Taylor series expansion for the matrix exponential: ! ∞ X (−ΣX VXT C(λ)VX ΣX )i T tr(exp(−ΣX VX C(λ)VX ΣX )) = tr i! i=0  ∞ X tr (−ΣX VXT C(λ)VX ΣX )i = i! i=0  ∞ X tr (−C(λ)K0 )i = tr(exp(−C(λ)K0 )). = i! i=0 Hence, the resulting dual problem is given by: min F (λ) = tr(exp(−C(λ)K0 )) + b(λ) λ s.t. λ ≥ 0. (5.5) ∂F Also, ∂λ = tr(exp(−C(λ)K0 )Ci K0 ) + bi . Hence, any first order smooth optimization method can i be used to solve the above dual problem. Also, similar to [KSD06], a Bregman’s cyclic projection method can be used to solve the primal problem (5.3). 5.2 Squared Frobenius Divergence The squared Frobenius norm divergence is defined as: Dfrob (W, W0 ) = 1 kW − W0 k2F , 2 and is a popular measure of distance between matrices. Consider the following instance of (4.1) with the squared Frobenius divergence as the objective function: min W s.t. Dfrob (W, ηI) tr(W XCi X T ) ≤ bi , W  0. ∀1 ≤ i ≤ m, (5.6) 16 Note that for η = 0 and Ci = (ea −eb )(ea −eb )T −(ea −ec )(ea −ec )T (relative distance constraints), the above problem (5.6) is the same as the one proposed by [SSSN04]. Below we see that, similar to [SSSN04], Theorem 4.2 in Section 4 guarantees kernelization for a more general class of Frobenius divergence based objective functions. It is easy to see that Dfrob (W, ηI) = tr(ffrob (W )), where ffrob (W ) = (W − ηI)T (W − ηI) = U ffrob (Λ)U T , W = U ΛU T is the eigenvalue decomposition of W and ffrob : R → R, ffrob (x) = (x − η)2 . Note that ffrob (x) is a strictly convex function with argminx ffrob (x) = η and ffrob (η) = 0. Hence, using Theorem 4.2, problem (5.1) is kernelizable since Dfrob (W, ηI) satisfies the required conditions. Using (4.7), the optimization problem to be solved is given by: min S s.t. kΣX VXT SVX ΣX k2F tr(ηCi K0 + Ci K0 SK0 ) ≤ bi , S ∀1 ≤ i ≤ m −ηK0−1 , (5.7) Also, note that kΣX VXT SVX ΣX k2F = tr(K0 SK0 S). The above problem can be solved using standard convex optimization techniques like interior point methods. 5.3 SDPs In this section we consider the case when the objective function in (4.1) is a linear function. A similar formulation for metric learning was proposed by [WBS05]. We consider the following generic semidefinite program (SDP) to learn a linear transformation W : tr(XC0 X T W ) min W tr(W XCi X T ) ≤ bi , s.t. ∀1 ≤ i ≤ m W  0. (5.8) Here we show that this problem can be efficiently solved for high dimensional data in its kernel space. Theorem 5.1. Problem (5.8) is kernelizable. Proof. (5.8) has a linear objective, i.e., it is a non-strict convex problem that may have multiple solutions. A variety of regularizations can be considered that lead to slightly different solutions. Here, we consider two regularizations: • Frobenius norm: We add a squared Frobenius norm regularization to (5.8) so as to find the minimum Frobenius norm solution to (5.8) (when γ is sufficiently small): min W s.t. γ kW k2F 2 tr(W XCi X T ) ≤ bi , ∀1 ≤ i ≤ m, tr(XC0 X T W ) + W  0. (5.9) 17 Consider the following variational formulation of the problem: t + γkW k2F min min t W s.t. tr(W XCi X T ) ≤ bi , ∀1 ≤ i ≤ m T tr(XC0 X W ) ≤ t W  0. (5.10) Note that for constant t, the inner minimization problem in the above problem is similar to (5.6) and hence can be kernelized. Corresponding optimization problem is given by: min t + γ tr(K0 SK0 S) S,t s.t. tr(Ci K0 SK0 ) ≤ bi , ∀1 ≤ i ≤ m tr(C0 K0 SK0 ) ≤ t S  0, (5.11) Similar to (5.7), the above problem can be solved using convex optimization methods. • Log determinant: In this case we seek the solution to (5.8) with minimum determinant. To this effect, we add a log-determinant regularization: min W s.t. tr(XC0 X T W ) − γ log det W tr(W XCi X T ) ≤ bi , ∀1 ≤ i ≤ m, W  0. (5.12) The above regularization was also considered by [KSD09], which provided a fast projection algorithm for the case when each Ci is a one-rank matrix and discussed conditions for which the optimal solution to the regularized problem is an optimal solution to the original SDP. Consider the following variational formulation of (5.12): min min t W s.t. t − γ log det W tr(W XCi X T ) ≤ bi , ∀1 ≤ i ≤ m, T tr(XC0 X W ) ≤ t, W  0. (5.13) Note that the objective function of the inner optimization problem of (5.13) satisfies the conditions of Theorem 4.2, and hence (5.13) or equivalently (5.12) is kernelizable. 6 Experimental Results In Section 3, we presented metric learning as a constrained LogDet optimization problem to learn a linear transformation, and we showed that the problem can be efficiently kernelized. Kernelization yields two fundamental advantages over standard non-kernelized metric learning. First, a nonlinear kernel can be used to learn non-linear decision boundaries common in applications such as 18 0.35 LogDet Gaussian LogDet Linear LogDet Online Euclidean Inv. Covariance MCML LMNN 0.3 k−NN Error 0.25 0.2 0.15 0.1 0.05 0 Wine Ionosphere Balance Scale Iris Soybean Figure 1: Results over benchmark UCI data sets. LogDet metric learning was run with in input space (LogDet Linear) as well as in kernel space with a Gaussian kernel (LogDet Gaussian). image analysis. Second, in Section 3.6, we showed that the kernelized problem can be learned with respect to a reduced basis of size k, admitting a learned kernel parameterized by O(k2 ) values. When the number of training examples n is large, this represents a substantial improvement over optimizing over the entire O(n2 ) kernel matrix, both in terms of computationally efficiency as well as statistical robustness. In this section, we present experiments from two domains: text analysis and imaging processing. As mentioned, image data sets tend to have highly non-linear decision boundaries. To this end, we learn a kernel matrix when the baseline kernel K0 is the pyramid match kernel, a method specifically designed for object/image recognition [GD05]. In contrast, text data sets tend to perform quite well with linear models, and the text experiments presented here have large training sets. We show that high quality metrics can be learned using a relatively small set of basis vectors. We evaluate performance of our learned distance metrics in the context of classification accuracy for the k-nearest neighbor algorithm. Our k-nearest neighbor classifier uses k = 10 nearest neighbors (except for section 6.2 where we use k = 1), breaking ties arbitrarily. We select the value of k arbitrarily and expect to get slightly better accuracies using cross-validation. Accuracy is defined as the number of correctly classified examples divided by the total number of classified examples. For our proposed algorithms, pairwise constraints are inferred from true class labels. For each class i, 100 pairs of points are randomly chosen from within class i and are constrained to be similar, and 100 pairs of points are drawn from classes other than i to form dissimilarity constraints. Given c classes, this results in 100c similarity constraints, and 100c dissimilarity constraints, for a total of 200c constraints. The upper and lower bounds for the similarity and dissimilarity constraints are determined empirically as the 1st and 99th percentiles of the distribution of distances computed using a baseline Mahalanobis distance parameterized by W0 . Finally, the slack penalty parameter γ used by our algorithms is cross-validated using values {.01, .1, 1, 10, 100, 1000}. All metrics are trained using data only in the training set. Test instances are drawn from the test set and are compared to examples in the training set using the learned distance function. The test and training sets are established using a standard two-fold cross validation approach. For experiments in which a baseline distance metric is evaluated (for example, the squared Euclidean distance), nearest neighbor searches are again computed from test instances to only those instances in the training set. 19 6.1 Low-Dimensional Data Sets First we evaluate our metric learning method on the standard UCI datasets in the low-dimensional (non-kernelized) setting, to directly compare with several existing metric learning methods. In Figure 1, we compare LogDet Linear (K0 equals the linear kernel) and the LogDet Gaussian (K0 equals Gaussian kernel in kernel space) algorithms against existing metric learning methods for kNN classification. We use the squared Euclidean distance, d(x, y) = (x − y)T (x − y) as a baseline method. We also use a Mahalanobis distance parameterized by the inverse of the sample covariance matrix. This method is equivalent to first performing a standard PCA whitening transform over the feature space and then computing distances using the squared Euclidean distance. We compare our method to two recently proposed algorithms: Maximally Collapsing Metric Learning [GR05] (MCML), and metric learning via Large Margin Nearest Neighbor [WBS05] (LMNN). Consistent with existing work such as [GR05], we found the method of [XNJR02] to be very slow and inaccurate, so the latter was not included in our experiments. As seen in Figure 1, LogDet Linear and LogDet Gaussian algorithms obtain somewhat higher accuracy for most of the datasets. 0.4 0.2 Error 0.3 0.45 LogDet Linear Euclidean MCML LMNN 0.35 0.3 0.1 0.25 0.2 0.0 Error 0.5 LogDet Linear LogDet−Inverse Covariance Euclidean Inverse Covariance MCML LMNN 5 Latex Mpg321 Foxpro Iptables (a) Clarify Datasets 10 15 20 Number of Dimensions 25 (b) Latex Figure 2: Classification error rates for k-nearest neighbor software support via different learned metrics. We see in figure (a) that LogDet Linear is the only algorithm to be optimal (within the 95% confidence intervals) across all datasets. LogDet is also robust at learning metrics over higher dimensions. In (b), we see that the error rate for the Latex dataset stays relatively constant for LogDet Linear. In addition to our evaluations on standard UCI datasets, we also apply our algorithm to the recently proposed problem of nearest neighbor software support for the Clarify system [HRD+ 07]. The basis of the Clarify system lies in the fact that modern software design promotes modularity and abstraction. When a program terminates abnormally, it is often unclear which component should be responsible for (or is capable of) providing an error report. The system works by monitoring a set of predefined program features (the datasets presented use function counts) during program runtime which are then used by a classifier in the event of abnormal program termination. Nearest neighbor searches are particularly relevant to this problem. Ideally, the neighbors returned should not only have the correct class label, but should also represent those with similar program configurations 20 Table 1: Training time (in seconds) for the Dataset LogDet Linear Latex 0.0517 Mpg321 0.0808 Foxpro 0.0793 Iptables 0.149 results presented in Figure 2(b). MCML LMNN 19.8 0.538 0.460 0.253 0.152 0.189 0.0838 4.19 Table 2: Unsupervised k-means clustering error using the baseline squared Euclidean distance, along with semi-supervised clustering error with 50 constraints. Dataset Unsupervised LogDet Linear HMRF-KMeans Ionosphere 0.314 0.113 0.256 Digits-389 0.226 0.175 0.286 or program inputs. Such a matching can be a powerful tool to help users diagnose the root cause of their problem. The four datasets we use correspond to the following softwares: Latex (the document compiler, 9 classes), Mpg321 (an mp3 player, 4 classes), Foxpro (a database manager, 4 classes), and Iptables (a Linux kernel application, 5 classes). Our experiments on the Clarify system, like the UCI data, are over fairly low-dimensional data. It was shown [HRD+ 07] that high classification accuracy can be obtained by using a relatively small subset of available features. Thus, for each dataset, we use a standard information gain feature selection test to obtain a reduced feature set of size 20. From this, we learn metrics for k-NN classification using the methods developed in this paper. Results are given in Figure 2(b). The LogDet Linear algorithm yields significant gains for the Latex benchmark. Note that for datasets where Euclidean distance performs better than using the inverse covariance metric, the LogDet Linear algorithm that normalizes to the standard Euclidean distance yields higher accuracy than that regularized to the inverse covariance matrix (LogDet-Inverse Covariance). In general, for the Mpg321, Foxpro, and Iptables datasets, learned metrics yield only marginal gains over the baseline Euclidean distance measure. Figure 2(c) shows the error rate for the Latex datasets with a varying number of features (the feature sets are again chosen using the information gain criteria). We see here that LogDet Linear is surprisingly robust. Euclidean distance, MCML, and LMNN all achieve their best error rates for five dimensions. LogDet Linear, however, attains its lowest error rate of .15 at d = 20 dimensions. In Table 1, we see that LogDet Linear generally learns metrics significantly faster than other metric learning algorithms. The implementations for MCML and LMNN were obtained from their respective authors. The timing tests were run on a dual processor 3.2 GHz Intel Xeon processor running Ubuntu Linux. Time given is in seconds and represents the average over 5 runs. We also present some semi-supervised clustering results for two of the UCI data sets. Note that both MCML and LMNN are not amenable to optimization subject to pairwise distance constraints. Instead, we compare our method to the semi-supervised clustering algorithm HMRFKMeans [BBM04]. We use a standard 2-fold cross validation approach for evaluating semi-supervised clustering results. Distances are constrained to be either similar or dissimilar, based on class values, and are drawn only from the training set. The entire dataset is then clustered into c clusters using k-means (where c is the number of classes) and error is computed using only the test set. Table 2 provides results for the baseline k-means error, as well as semi-supervised clustering results with 50 constraints. 21 Caltech 101: Comparison to Existing Methods 80 70 mean recognition rate per class 60 50 40 ML+SUM ML+CORR ML+PMK Frome et al. (ICCV07) Zhang et al.(CVPR06) Lazebnik et al. (CVPR06) Berg (thesis) Mutch & Lowe(CVPR06) Grauman & Darrell(ICCV 2005) Berg et al.(CVPR05) Wang et al.(CVPR06) Holub et al.(ICCV05) Serre et al.(CVPR05) Fei−Fei et al. (ICCV03) SSD baseline 30 20 10 0 5 10 15 number of training examples per class 20 25 Figure 3: Caltech-101: Comparison of LogDet based metric learning method with other stateof-the-art object recognition methods. Our method outperforms all other single metric/kernel approaches. ML+SUM refers to our learned kernel when the average of four kernels (PMK [GD05], SPMK [LSP06], Geoblur-1, Geoblur-2 [BM01]) is the base kernel, ML+PMK refers to the learned kernel over the pyramid match [GD05] as the base kernel, and ML+CORR refers to the learned kernel when the correspondence kernel of [ZBMM06] is the base kernel. 6.2 Metric Learning for Object Recognition Next we evaluate our method over high-dimensional data applied to the object-recognition task using Caltech-101 [Cal04], a common benchmark for this task. The goal is to predict the category of the object in the given image using a k-NN classifier. We compute distances between images using learning kernels with three different base image kernels: 1) PMK: Grauman and Darrell’s pyramid match kernel [GD05] applied to SIFT features, 2) CORR: the kernel designed by [ZBMM06] applied to geometric blur features , and 3) SUM: the average of four image kernels, namely, PMK [GD05], Spatial PMK [LSP06], Geoblur-1, and Geoblur-2 [BM01]. Note that the underlying dimensionality of these embeddings are typically in the millions of dimensions. We evaluate the effectiveness of metric/kernel learning on this dataset. We pose a k-NN classification task, and evaluate both the original (SUM, PMK or CORR) and learned kernels. We set k = 1 for our experiments; this value was chosen arbitrarily. We vary the number of training examples T per class for the database, using the remainder as test examples, and measure accuracy 22 Caltech 101: Gains over Baseline 80 mean recognition rate per class 70 60 50 40 30 20 10 0 5 ML+Sum(PMK, SPMK, Geoblur) ML+Zhang et al.(CVPR06) ML+PMK NN+Sum(PMK, SPMK, Geoblur) NN+Zhang et al.(CVPR06) NN+PMK 10 15 20 number of training examples per class 25 Figure 4: Object recognition on the Caltech-101 dataset. Our learned kernels significantly improve NN recognition accuracy relative to their non-learned counterparts, the SUM (average of four kernels), the CORR and PMK kernels. in terms of the mean recognition rate per class, as is standard practice for this dataset. Figure 3 shows our results relative to several other existing techniques that have been applied to this dataset. Our approach outperforms all existing single-kernel classifier methods when using the learned CORR kernel: we achieve 61.0% accuracy for T = 15 and 69.6% accuracy for T = 30. Our learned PMK achieves 52.2% accuracy for T = 15 and 62.1% accuracy for T = 30. Similarly, our learned SUM kernel achieves 73.7% accuracy for T = 15. Figure 4 specifically shows the comparison of the original baseline kernels for NN classification. The plot reveals gains in 1-NN classification accuracy; notably, our learned kernels with simple NN classification also outperform the baseline kernels when used with SVMs [ZBMM06, GD05]. 6.3 Metric Learning for Text Classification Next we present results in the text domain. Our text datasets are created by standard bag-of-words Tf-Idf representations. Words are stemmed using a standard Porter stemmer and common stop words are removed, and the text models are limited to the 5,000 words with the largest document frequency counts. We provide experiments for two data sets: CMU Newsgroups [CMU08], and Classic3 [Cla08]. Classic3 is a relatively small 3 class problem with 3,891 instances. The newsgroup data set is much larger, having 20 different classes from various newsgroup categories and 20,000 instances. As mentioned earlier, our text experiments use a linear kernel, and we use a set of basis vectors that is constructed from the class labels via the following procedure. Let c be the number of distinct classes and let k be the size of the desired basis. If k = c, then each class mean ri is computed to form the basis R = [r1 . . . rc ]. If k < c a similar process is used but restricted to a randomly selected subset of k classes. If k > c, instances within each class are clustered into approximately k c clusters. Each cluster’s mean vector is then computed to form the set of low-rank basis vectors 23 0.9 1 0.8 0.995 0.7 0.99 Accuracy Accuracy 0.6 0.985 0.98 0.975 0.965 0.96 2 2.5 3 3.5 4 4.5 0.4 LogDet Linear LSA LMNN Euclidean 0.3 LogDet Linear LSA LMNN Euclidean 0.97 0.5 0.2 0.1 0 5 5 10 15 20 25 Basis Size Kernel Basis Size (a) Classic3 (b) 20-Newsgroups Figure 5: Classification accuracy for our Mahalanobis metrics learned over basis of different dimensionality. Overall, our method (LogDet Linear) significantly outperforms existing methods. R. Figure 5 shows classification accuracy across bases of varying sizes for the Classic3 dataset, along with the newsgroup data set. As baseline measures, the standard squared Euclidean distance is shown, along with Latent Semantic Analysis (LSA) [DDL+ 90], which works by projecting the data via principal components analysis (PCA), and computing distances in this projected space. Comparing our algorithm to the baseline Euclidean measure, we can see that for smaller bases, the accuracy of our algorithm is similar to the Euclidean measure. As the size of the basis increases, our method obtains significantly higher accuracy compared to the baseline Euclidean measure. 7 Conclusions In this paper, we have considered the general problem of learning a linear transformation of input data and applied it to the problem of learning a metric over high-dimensional data or feature space implicitly. φ(xi )T Aφ(xj ). We first showed that the LogDet divergence is a useful loss for learning a linear transformation (or performing metric learning) in kernel space, as the algorithm can easily be generalized to work in kernel space. We then proposed an algorithm based on Bregman projections to learn a kernel function over the data-points efficiently. We also show that our learned metric can be restricted to a small dimensional basis efficiently, hence scaling our method to large datasets with high-dimensional feature space. Then we considered a larger class of convex loss functions for learning the metric/kernel using a linear transformation of the data; we saw that many loss functions can lead to kernelization, though the resulting optimizations may be more expensive to solve than the simpler LogDet formulation. Finally, we presented some experiments on benchmark data, high-dimensional vision, and text classification problems, demonstrating our method compared to several existing state-of-the-art techniques. There are several directions of future work. To facilitate even larger data sets than the ones considered in this paper, online learning methods are one promising research direction; in [JKDG08], an online learning algorithm was proposed based on LogDet regularization, and this remains a part of our ongoing efforts. Recently, there has been some interest in learning multiple local metrics over the data; [WS08] considered this problem. We plan to explore this setting with the LogDet divergence, with a focus on scalability to very large data sets. 24 Acknowledgements This research was supported by NSF grant CCF-0728879. We would also like to acknowledge Suvrit Sra for various helpful discussions. References [AK07] S. Arora and S. Kale. A combinatorial, primal-dual approach to semidefinite programs. In ACM Symposium on Theory of Computing (STOC), pages 227–236. 2007. [BBM04] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD). 2004. [BM01] A. C. Berg and J. Malik. Geometric blur for template matching. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). 2001. [Cal04] Caltech-101 Data Set. Public Dataset, 2004. [link]. [CHL05] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). 2005. [CKTK08] R. Chatpatanasiri, T. Korsrilabutr, P. Tangchanachaianan, and B. Kijsirikul. On kernelization of supervised Mahalanobis distance learners. ArXiv, 2008. http://arxiv.org/pdf/0804.1441. [Cla08] Classic3 Data Set. ftp.cs.cornell.edu/pub/smart, 2008. [CMU08] CMU 20-Newsgroups Data 20/www/data/news20.html, 2008. [DD08] http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo- Set. J. V. Davis and I. S. Dhillon. Structured metric learning for high dimensional problems. In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), pages 195–203. 2008. [DDL+ 90] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [DKJ+ 07] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In Int. Conf. on Machine Learning (ICML). 2007. [Fle91] R. Fletcher. A new variational result for quasi-newton formulae. SIAM Journal on Optimization, 1(1), 1991. [GD05] K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. In International Conference on Computer Vision (ICCV). 2005. [GLS88] M. Groschel, L. Lovasz, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization. Springer-Verlag, 1988. [GR05] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Adv. in Neural Inf. Proc. Sys. (NIPS). 2005. [GRHS04] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood component analysis. In Adv. in Neural Inf. Proc. Sys. (NIPS). 2004. [Hig08] N. J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008. + [HRD 07] J. Ha, C. Rossbach, J. Davis, I. Roy, D. Chen, H. Ramadan, and E. Witchel. Improved error reporting for software that uses black box components. In Programming Language Design and Implementation (PLDI). 2007. [HT96] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:607–616, 1996. [JKDG08] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning and fast similarity search. In Adv. in Neural Inf. Proc. Sys. (NIPS), pages 761–768. 2008. [JS61] W. James and C. Stein. Estimation with quadratic loss. In Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 361–379. Univ. of California Press, 1961. [KSD06] B. Kulis, M. Sustik, and I. S. Dhillon. Learning low-rank kernel matrices. In Int. Conf. on Machine Learning (ICML). 2006. 25 [KSD08] B. Kulis, M. Sustik, and I. Dhillon. Low-rank kernel learning with Bregman matrix divergences. Journal of Machine Learning Research, 2008. [KSD09] B. Kulis, S. Sra, and I. S. Dhillon. Convex perturbations for scalable semidefinite programming. In International Conference on Artificial Intelligence and Statistics (AISTATS). 2009. [KT03] J. Kwok and I. Tsang. Learning with idealized kernels. In Int. Conf. on Machine Learning (ICML). 2003. + [LCB 04] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. In Journal of Machine Learning Research. 2004. [Leb06] G. Lebanon. Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):497–508, 2006. [LSP06] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169–2178. 2006. [NC00] M.A. Nielsen and I.L. Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000. [OSW03] C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. In Adv. in Neural Inf. Proc. Sys. (NIPS). 2003. [SHWP02] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In European Conference on Computer Vision (ECCV). Copenhagen, DK, 2002. [SJ03] M. Schutz and T. Joachims. Learning a distance metric from relative comparisons. In Adv. in Neural Inf. Proc. Sys. (NIPS). 2003. [SSSN04] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In Int. Conf. on Machine Learning (ICML). 2004. [TK06] I. W. Tsang and J. T. Kwok. Efficient hyperkernel learning using second-order cone programming. IEEE Transactions on Neural Networks, 17(1):48–58, 2006. [TRW05] K. Tsuda, G. Rátsch, and M. Warmuth. Matrix exponentiated gradient updates for online learning and Bregman projection. Journal of Machine Learning Research, 6:995–1018, 2005. [WBS05] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In Adv. in Neural Inf. Proc. Sys. (NIPS). 2005. [WK08] M. K. Warmuth and D. Kuzmin. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9:2287–2320, 2008. [WS08] K. Q. Weinberger and L. K. Saul. Fast solvers and efficient implementations for distance metric learning. In Int. Conf. on Machine Learning (ICML). 2008. [XNJR02] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In Adv. in Neural Inf. Proc. Sys. (NIPS), volume 14. 2002. [ZBMM06] H. Zhang, A. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). 2006. 26