Academia.eduAcademia.edu

Vanishing Component Analysis

2013

The vanishing ideal of a set of points, S ⊂ R n , is the set of all polynomials that attain the value of zero on all the points in S. Such ideals can be compactly represented using a small set of polynomials known as generators of the ideal. Here we describe and analyze an efficient procedure that constructs a set of generators of a vanishing ideal. Our procedure is numerically stable, and can be used to find approximately vanishing polynomials. The resulting polynomials capture nonlinear structure in data, and can for example be used within supervised learning. Empirical comparison with kernel methods show that our method constructs more compact classifiers with comparable accuracy.

Vanishing Component Analysis Roi Livni∗ roi.livni@mail.huji.ac.il Interdiscplinary Center for Neural Computation Edmond and Lily Safra Center for Brain Sciences, The Hebrew University of Jerusalem Givat Ram, Jerusalem 91904, Israel David Lehavi∗ , Sagi Schein, Hila Nachlieli david.lehavi, sagi.schein, hila.nachlieli@hp.com Hewlett-Packard Laboratories Israel Ltd. Technion City Haifa 32000 Israel Shai Shalev-Shwartz, Amir Globerson shais, gamir@cs.huji.ac.il Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem Givat Ram, Jerusalem 91904, Israel Abstract The vanishing ideal of a set of points, S ⊂ Rn , is the set of all polynomials that attain the value of zero on all the points in S. Such ideals can be compactly represented using a small set of polynomials known as generators of the ideal. Here we describe and analyze an efficient procedure that constructs a set of generators of a vanishing ideal. Our procedure is numerically stable, and can be used to find approximately vanishing polynomials. The resulting polynomials capture nonlinear structure in data, and can for example be used within supervised learning. Empirical comparison with kernel methods show that our method constructs more compact classifiers with comparable accuracy. 1. Introduction Classification algorithms can only be as good as the features they work with. For example, in linear classification, high accuracy will only be obtained if our features are such that the classes are linearly separable. The problem of feature extraction has thus traditionally attracted considerable interest in the machine learning literature. One conceptually simple approach to describing a set of points S is to find a set of equations that each data point x ∈ S should (approximately) satisfy. In other words, we seek a set of functions f1 (x), . . . , fk (x) such * These authors contributed equally. Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). that fi (x) ≈ 0 for all i and x ∈ S. If the set of points S belongs to a particular class (say the digit eight) then these functions may provide a succinct characterization of elements in the class. We can thus extract such functions for all classes and use them as features in classification. Clearly the complexity of the functions fi (x) should be restricted so that the description is both compact and interpretable. A natural class which we focus on here is degree bounded polynomials. The goal of our work is to find a small set of such polynomials for a given set of points. The set of all polynomials f (x) that attain a value of zero on a set S is known as the vanishing ideal of S and is denoted by I(S) (i.e., f ∈ I(S) iff f (x) = 0 ∀x ∈ S). This set would qualify as a description of S. However, it contains an infinite number of polynomials and we would rather represent it with finitely many polynomials if possible. The first key observation is that if f (x) ∈ I(S) and h(x) is any polynomial, then hf is also in I(S). Thus, it is natural to ask whether there are f1 (x), . . . , fk (x) such P that any g ∈ I(S) can be represented as g(x) = i fi (x)hi (x) for some polynomials hi (x). Such a set of fi (x) is known as a set of generators of the ideal I(S). Luckily, Hilbert’s basis theorem (Cox et al., 2007) tells us that a finite set of generators exists for any ideal. A finite set of generators is an attractive mechanism for describing I(S) since all elements in I(S) can be derived from the set of generators. Thus, we turn our attention to finding such a finite set of generators, whose elements we call Vanishing Components. Current machine learning approaches do not offer a solution to the above problem. One seemingly relevant Vanishing Component Analysis approach is kernel PCA (Schölkopf et al., 1998). However, as we argue later, kernel PCA cannot be used to find vanishing components, since the kernel trick is inapplicable in this case. Linear PCA can be used to find vanishing components only in the case where these are linear, which is not expected to hold generally. The work that is closest in spirit to ours is the approximately vanishing ideal (AVI) algorithm in (Heldt et al., 2009). AVI requires an additional lexical order on the original features, and is also geared towards functions with a small number of monomials, which is often not the case. Our vanishing component analysis (VCA) algorithm takes as input a set S and outputs a set of polynomials V = {f1 (x), . . . , fk (x)}. VCA has the following attractive properties: • The set V generates I(S). • The algorithm is polynomial in |S| and in the original dimension n. Furthermore, fi (x) can be evaluated in time polynomial in |S| and n. • The algorithm does not depend on any lexical ordering of the variables. Thus, we achieve the goal of efficiently finding a generator set of I(S) and obtain a compact description of S. In practice, due to noisy data, we might wish to find a set of polynomials that only approximately vanish over the the set. To address this issue, our algorithm is dependent on an ǫ-tolerance parameter, and we search for polynomials that approximately vanish on the set S. In section 7 we illustrate its use as a feature learning procedure for classification. 2. Preliminaries and Problem Setup We begin with basic definitions. n Definition 1 (Monomials). A function f : RQ → R is n i called a monomial if it is of the form f (x) = i=1 xα i where each αi Q is a non-negative integer. We also use n i denote xα P = i=1 xα i . The degree of the monomial n is kαk1 = i=1 αi . We denote the set of monomials over n variables by T n , and the set of monomials of total degree up to d by Tdn . Definition 2 (Polynomials). A function f : Rn → R is called a polynomial if it is a weighted sum of monoP (j) mials. That is, f (x) = j βj xα , where each βj ∈ R (j) and each αi is a non-negative integer. The degree of f is the maximal degree of its monomials. Definition 3 (The polynomial ring). The polynomial ring in n variables over R, denoted by R [x1 , . . . , xn ], is the set of all polynomials in n variables over the reals of finite degree. The addition and multiplication operators over the ring are equivalent to addition and multiplication of functions. That is, if h = f + g then for all x, h(x) = f (x) + g(x) and if h = f g then for all x, h(x) = f (x)g(x). Definition 4 (Ideal). A set of polynomials I is an ideal if it is a sub-group with respect to addition (meaning that it is closed under addition and negation and contains the zero polynomial) and it “absorbs multiplication”, meaning that for any f ∈ I, g ∈ R [x1 , . . . , xn ] we have f g ∈ I. Definition 5 (Set of Generators). Given an ideal I. A set of polynomials {f1 , . . . , fk } ⊆ I is said to generate I, if ∀f ∈P I there exist g1 , . . . , gk ∈ R [x1 , . . . , xn ] such that f = i gi fi . In this case we denote the ideal by I(F ) where F = {f1 , . . . , fk }. Note that this should not be confused with I(S). Definition 6 (Vanishing Ideal). Given a set S ⊂ Rn , the vanishing ideal of S is the set of polynomials that vanish on S. We denote it by I(S) (it’s easy to see that it’s an ideal). That is, for all x ∈ S and f ∈ I(S) we have f (x) = 0. Definition 7 (Algebraic Set). A set V ⊂ Rn , is called an algebraic set if there is a finite set of polynomials {pi }ki=1 , such that V are the common roots of {pi }ki=1 . Our problem setup is as follows. We are given a sample (i) ∈ Rn . Our goal is Sm of points {x(i) }m i=1 where x to find a set of generators of I(Sm ). As mentioned earlier this is desirable since it succinctly captures all polynomials that vanish on Sm . To make the algorithm practical, we seek a method with a polynomial running time in m and n. Additionally, since real world data is noisy, we allow some tolerance by looking for polynomials that “almost” vanish on Sm . In Section 4 we describe our Vanishing Component Analysis (VCA). Before presenting VCA, we first discuss an approach for finding a set of generators that is simple to understand but is exponential in the sample size m. We also show that the kernel trick cannot be used to overcome this difficulty. 3. A Simple but Impractical Approach One approach to finding generators for I(Sm ) is to use a linear algebraic method. Assume for simplicity that we know there is a set of generators of I(Sm ) of maximal degree D. Now consider the set of monomials TDn and construct a matrix A of size (m, |TDn |) as follows: Aij = tj (x(i) ) , where tj (x) is the j th monomial in TDn . Vanishing Component Analysis Note that the number of columns in A is exponential in D. We now claim that the null space of A can be used to obtain a set of generators of I(Sm ). Proposition 3.1. Denote by V a set of vectors v 1 , . . . , v k which are a basis of the null space of A. Namely, for all i = 1, . . . , k we have Av i = 0, and any vector v such that Av = 0 can be written as a linear combination of the v i . Then the polynomiPn als fi (x) = j=1 vij tj (x) form a set of generators of I(Sm ). Proof. Clearly fi (x(j) ) = 0 for all sample points x(j) . Thus fi (x) ∈ I(Sm ). Next, we show that for any set of generators of I(Sm ) of maximal degree D, any polynomial in the set of generators can be obtained as a linear combination of fi . Consider such a polynomial g(x). By our assumption it is of max degree D, and hence can be written as a linear combination of the monomials in TDn . Denote the vector of the corresponding n coefficients by z ∈ R|Td | . Then Az = 0 and hence z is in the null space of A and is spanned by v i . Thus the polynomial g(x) can be written as a linear combination of the fi (x) polynomials. Thus any polynomial in a set of generators of I(Sm ) can be (linearly) generated by fi (x) and we conclude that fi (x) are generators of I(Sm ). The above procedure achieves the goal of finding a set of generators of I(Sm ). However, it does so at a cost exponential in D, which is impractical even for modest D values. Furthermore, the value of D, the maximum degree of the generator set, may be O(m).1 Thus the cost of the above algorithm can be exponential in m. Next, we show why the standard use of the kernel trick cannot be used to overcome this difficulty. 3.1. Kernels Can’t Help! The kernel trick is an elegant method for avoiding working in high dimensional feature spaces explicitly. For example it can be used to perform non-linear PCA, and to find non linear polynomial separators using kernel SVM (e.g., see Scholkopf and Smola, 2001). It may seem like the kernel trick can be used to find vanishing components without explicitly calculating all monomials in Tdn . However, perhaps surprisingly, this is not 1 As an example where D = m, consider the following: Let p be a polynomial of degree m in R, and let {r1 , . . . , rm } be its real roots. Choose some random unit vector v in Rn and let Sm = {r1 v, r2 v, . . . , rm v}. I(Sm ) is the ideal generated by the polynomials {p(v T x), uT1 x, . . . , uTn−1 x}, where v ⊥ u1 , . . . , un−1 . There is no alternative generator of I(Sm ) that does not contain a polynomial of total degree at least m. the case, as we argue next. We begin by recalling the kernel trick idea for the polynomial case, as used for kernel PCA (KPCA). The goal in KPCA is to perform PCA in the (exponentially large) vector of monomials in x of degree d, and find the projection on these components. To do this efficiently, one considers the kernel function k(x, y) = (1 + x · y)d . Then it can be shown that if the principal components correspond to non-zero eigenvalues (as they always do) the projection on the j th principal P component is given by a polynomial of the form i αij k(x, x(i) ), where αij are eigenvectors of the kernel matrix. Following a similar rationale one might posit that the vanishing polynomials are also of this form. However, as the following result shows, the vanishing polynomials cannot be expressed in this fashion. Theorem 3.2. Let k be a reproducing kernel and f ∈ span(k(·, x(i) )) such that f vanishes on all x(i) . Then f is the zero function.   c vT , where v K Ki,j := k(x(i) , x(j) ), c := k(x, x) and v i := k(x(i) , x). Under these notations, we need to prove that for every α in the null space of K, we have αT v = 0. Proof. Given x, define K̃ := The reproducing property ensures that K̃ is a positive semidefinite matrix. The Schur complement of c in K̃ is defined as A = K − 1c vv T . It is known that if K̃ is positive semidefinite then, c = 0 implies v = 0, and if c > 0 then A must be positive semidefinite (Boyd and Vandenberghe, 2004). But for any α in the null space of K we will have |αT v|2 = −cαT Aα ≤ 0. and thus αT v = 0. The above theorem says that we cannot use the kernel trick (i.e., use the kernel matrix instead of the explicit monomial vector) to find the vanishing components. 4. The VCA algorithm Recall that our goal is to find a set of generators for I(Sm ). Since we are dealing with noisy data, it is unreasonable to seek generators that exactly vanish on Sm , and in our VCA procedure we use a tolerance parameter to allow generators to approximately vanish. In what follows, we give a step by step description of the algorithm and its rationale. The procedure itself is described in Figures 1 and 2 and its properties are analyzed in Section 5. We can think of each polynomial f both in the usual sense, i.e. as a function from Rn to R, but also as a Vanishing Component Analysis vector in Rm containing the evaluations of f on the  sample Sm , namely, f (Sm ) = f (x(1) ), . . . , f (x(m) ) ∈ Rm . A polynomial f vanishes on Sm if and only if f (Sm ) = 0. To motivate the construction, let us first recall the case of the most simple polynomials — linear functions. Suppose we would like to find a set V of linear functions such that for each f ∈ V and x(i) ∈ Sm we have f (x(i) ) = 0. Each linear function is described Pn by a vector β ∈ Rn+1 , such that f (x) = β0 + j=1 βj xj . We can rewrite the linear function as a combination of base polynomials. Indeed, let f0 be the constant polyno√ mial, f0 (x) = 1/ m for all x. Let C1 = {f1 , . . . , fn } be a set of polynomials, where for all i, fi (x) = xi . Now, we can rewrite any linear function as a linear combination of polynomials from C1 ∪ {f0 }. That is, each linear function is of the form: f (x) = β0 + n X i=1 βi xi = n X βi fi (x) . i=0 It follows P that for any such polynomial we have n f (Sm ) = i=0 βi fi (Sm ). Therefore, a linear function vanishes on Sm if and only if f (Sm ) = 0 ∈ Rm . This amounts to requiring that the vector β would be in the null space of the m × (n + 1) matrix A1 = [f0 (Sm ), . . . , fn (Sm )]. To find the null space of A1 , we can follow the GramSchmidt procedure. As we show later, using the Singular Value Decomposition is preferable, since it provides us with a stable method for finding an approximate null space. However, for the sake of clarity, let us first describe the Gram-Schmidt approach. We maintain two sets: V for the vanishing polynomials and F for the non-vanishing polynomials. We use the notation F (Sm ) = {f (Sm ) : f ∈ F } ⊂ Rm to denote the vectors in Rm corresponding to evaluations of nonvanishing polynomials in F on Sm . We will construct F such that F (Sm ) is a set of orthonormal vectors in Rm . Since f0 is clearly non-vanishing we can initialize F = {f0 } and V = ∅. Now, at round t, consider the remainder of ft (Sm ) after projecting on the orthonormal set F (Sm ). That is, rt (Sm ) = ft (Sm ) − P hf (S Note that rt (Sm ) is t m ), f (Sm )i f (Sm ). f ∈F the evaluation of the polynomial rt (x) = ft (x) − P hf (S ), f (S )i f (x), on S . t m m m f ∈F Now, if rt (Sm ) is the zero vector, then rt is vanishing on Sm , so we update: V ← V ∪ {rt }. Otherwise, we update F ← F ∪ {rt /krt (Sm )k}, where the normalization ensures that all the vectors in F (Sm ) are of unit norm. At the end of this process, F contains a set of linear polynomials which are non-vanishing on Sm and V contains a set of linear polynomials that vanish on Sm . Furthermore, F (Sm ) is an orthonormal basis of the range of A1 . Let us call F1 and V1 the values of F and V after dealing with polynomials of degree 1. Next, consider polynomials of degree 2. Consider the set of polynomials C2 = {fi,j }ni,j=1 , where for all i, j, fi,j (x) = xi xjP . Each polynomial Pn of degree 2 takes the n form f (x) = i=0 βi fi (x) + i,j=1 βi,j fi,j (x) . As before, we can find vanishing 2nd order polynomials via the null space of the matrix: A2 = [A1 , f1,1 (Sm ), . . . , fn,n (Sm )]. To find the null space of the matrix A2 , we could simply continue the GramSchmidt procedure we have already performed for the columns of A1 . However, we now need to consider n2 columns, and as the degree goes up the number of columns increases exponentially. To overcome this obstacle, we rely on the underlying structure of the vanishing ideal, and in particular its absorbedness property. P Take some f of degree 2. Then, f = i gi hi , where gi , hi are all of degree at most 1. Without loss of generality, assume that for i ≤ i1 we have that both gi and hi are non vanishing on Sm , and that for i > i1 either gi or hi vanishes. It follows that for all i > i1 we have that the polynomial gi hi vanishes. Therefore, the P polynomial fˆ = i≤i1 gi hi satisfies fˆ(Sm ) = f (Sm ). In other words, f vanishes on Sm if and only if fˆ vanishes on Sm . Since, by our construction, f − fˆ is generated by V , it suffices to deal with fˆ(Sm ). Using our construction of F1 , we know that for all i ≤ i1 , both gi and hi are in the span of F1 . Denoting F1 = {p1 , . . . , pk }, we can write   XX X X X fˆ = αji pj βsi ps = pj ps  αji βsi  i≤i1 j s j,s i≤i1 It follows that fˆ is in the span of F1 × F1 and thus to construct F2 and V2 , it suffices to find the null space and range on the set of candidate polynomials in F1 × F1 . Formally, let us redefine C2 to be the set C2 = {f = gh : g, h ∈ F1 } , and let C2 (Sm ) = {f (Sm ) : f ∈ C2 } ⊂ Rm be the corresponding set of vectors obtained by evaluating all polynomials in C2 on Sm . We will construct F2 and V2 by continuing the Gram-Schmidt process on the candidate vectors in C2 (Sm ), as follows. Choose a polynomial f ∈ C2 and let f (Sm ) be the corresponding vector in C2 (Sm ). The remainder of f after projecting it on the current setPF is the polynomial r defined by r(x) = f (x) − g∈F hf (Sm ), g(Sm )ig(x). If Vanishing Component Analysis VCA parameters: Tolerance ǫ ; “FindRangeNull” procedure initialize: √ F = {f (·) = 1/ m}, V = ∅ C1 = {f1 , . . . , fn } where fi (x) = xi (F1 , V1 ) = FindRangeNull(F, C1 , Sm , ǫ) F = F ∪ F1 , V = V ∪ V1 for t = 2, 3 . . . Ct = {gh : g ∈ Ft−1 , h ∈ F1 } if Ct = ∅ break (Ft , Vt ) = FindRangeNull(F, Ct , Sm , ǫ) F = F ∪ Ft , V = V ∪ Vt output: F, V Figure 1. Our Vanishing Component Analysis (VCA) algorithm for finding a generator set for I(Sm ). r(Sm ) = 0, we add it to V2 and to V . Otherwise, we add r/kr(Sm )k to F2 and to F . This process continues to higher degrees. At iteration t, we construct the set of candidate polynomials to be Ft−1 ×F1 . Then, we construct Ft and Vt (and update F and V accordingly) by performing the Gram-Schmidt procedure on this set of candidate polynomials. The VCA procedure is described in Figures 1 and 2. We make one crucial modification to the above description. When constructing the new polynomials in Ft and Vt out of the candidates in Ct , we use the Singular Valued Decomposition procedure to find an approximate null space instead of performing the GramSchmidt procedure. This is obtained by calling a sub-procedure (Ft , Vt ) = FindRangeNull(F, Ct , Sm , ǫ), which receives the current set F , a list of candidates, Ct , the set of examples Sm , and a tolerance parameter ǫ. When ǫ = 0, the procedure can be implemented by initializing a Gram-Schmidt procedure with the orthonormal basis F (Sm ), and then continuing performing Gram-Schmidt orthonormalization on the candidate vectors f (Sm ) with f ∈ Ct . When ǫ > 0, we rely on the SVD procedure, as described below. As we will formally prove in the next section, if one sets ǫ = 0, then VCA is guaranteed to construct a set of generators of the vanishing ideal. We next describe how to implement the FindRangeNull procedure. It is known that finding an orthonormal basis using the Gram-Schmidt method can be numerically instable, and a preferred approach is by using the Singular Value Decomposition (SVD) approach. FindRangeNull (using SVD) input: F, C, Sm , ǫ denote k = |C| and C = {f1 , . . . , fk } for t = 1, . . . , k P let f˜i = fi − g∈F hfi (Sm ), g(Sm )ig let A = [f˜1 (Sm ), . . . , f˜k (Sm )] ∈ Rm,k decompose A = LDU ⊤ using SVD for i = 1, . . . , k Pk let gi = j=1 Uj,i f˜j output: F1 = {gi /kgi (Sm )k : Di,j > ǫ} V1 = {gi : Di,j ≤ ǫ} Figure 2. The implementation of the FindRangeNull function in Figure 1. First, given a candidate set C = {f1 , . . . , fk }, and a current set of polynomials F , we calculate f˜1 , . . . , f˜k to be the remainder of the polynomials in C after projecting them on F . Next, we use SVD to find the (approximate) range and null space of the matrix A = [f˜1 (Sm ), . . . , f˜k (Sm )], by calculating the right singular vectors of A. Finally, we define the corresponding polynomials based on the right singular vectors. 5. Analysis In this section we analyze the efficiency and correctness of the VCA procedure. We use the notation F t = ∪i≤t Ft and V t = ∪i≤t Vt . Our first theorem addresses with the efficiency of the VCA procedure. Theorem 5.1. If VCA is run with any ǫ then: 1. VCA stops after at most t ≤ m + 1 iterations. 2. All polynomials in V, F are of degree at most m. 3. |F | ≤ m and |V | ≤ |F |2 · min{|F |, n}. 4. It is possible to evaluate all the polynomials in F in time O(|F |2 ), and polynomials in V in tim O(|F |2 + |F | · |V |). Proof. 1. By the stopping rule of VCA, it will terminate at round t whenever Ft−1 is empty (since this will imply Ct = ∅). Therefore, if we didn’t stop at round t, then |F t | > t − 1, and therefore we also have that |F t (Sm )| ≥ t − 1. On the other hand, by construction, F t (Sm ) is a set of orthonormal vectors in Rm , hence its size is at most m. It follows that t − 1 ≤ m. Vanishing Component Analysis 2. It follows immediately that all the polynomials in V and F are of degree at most m. 3. We already argued that |F | ≤ m. Also, we only add polynomials to V in the first |F | iterations. Furthermore, whenever we add a polynomial to V , it is constructed from a set of candidates, Ct = Ft−1 × F1 , whose size is at most |F | · min{|F |, n}. Therefore, |V | ≤ |F |2 · min{|F |, n}. 4. Let us enumerate the polynomials in F = {g1 , . . . , g|F | } according to the order in which they were inserted into F . Since g1 is the constant polynomial, it can be evaluated in time O(1). Suppose we have evaluated g1 , . . . , gr−1 . By construction, gr can be written as a product of two polynomials from g1 , . . . , gr−1 minus a linear combination of g1 , . . . , gr−1 . Therefore, it can be evaluated in time O(r). It follows that we can evaluate all the polynomials in F in time O(|F |2 ). A similar argument shows that once we evaluated all polynomials in F , we can evaluate each polynomial in V in O(|F |). We next prove that VCA finds a generator of I(Sm ). Theorem 5.2. If VCA is run with ǫ = 0 its output satisfies: 1. V generates I(Sm ). 2. Any polynomial f can be written as f = g + h where g ∈ span(F ) and h ∈ I(Sm ). The proof of the theorem relies on the following lemma. Lemma 5.3. If VCA is run with ǫ = 0, then for all t, any polynomial f of degree at most t can be written as f = g + h where g ∈ span(F t ) and h ∈ I(V t ). Proof. We prove this claim by induction on t. The base of the induction, i.e. t = 1, follows by standard results from linear algebra. So, from now on we assume that t > 1. By our inductive assumption, for each i < t, any polynomial of degree at most i can be written as f = g + h where g ∈ span(F i ) and h ∈ I(V i ). Now, consider a polynomial f of degree t. Clearly, if each monomial of f can be decomposed to a sum of a vector from span(F t ) and a vector from I(V t ), then the claim holds for t as well. Furthermore, by our inductive assumption, any monomial of f whose degree is strictly smaller than t can be decomposed as required. Therefore, from now on we can assume that f is a single monomial of degree exactly t, which we can write f = pq where p is of degree t − 1 and q is of degree 1. Using our inductive assumption, we can write p = pF + pV where pF ∈ span(F t−1 ) and pV ∈ I(V t−1 ). Similarly, q = qF + qV where qF ∈ span(F 1 ) and qV ∈ I(V 1 ). Let us further rewrite pF = pF t−2 +pFt−1 , where pF t−2 ∈ span(F t−2 ) and pFt−1 ∈ span(Ft−1 ). Similarly, qF = qF 0 + qF1 , where qF 0 ∈ span(F 0 ) and qF1 ∈ span(F1 ). Therefore, f = pq = (pF t−2 + pFt−1 + pV )(qF 0 + qF1 + qV ) = pFt−1 qF1 + pV q + pqV + pF t−2 q + pqF 0 . {z } | {z } | ∈I(V t ) of degree <t Finally, since pFt−1 qF1 ∈ span(Ft−1 × F1 ), by our construction, it can be written as aF + aV where aF ∈ span(Ft ) and aV ∈ span(Vt ), which concludes our inductive argument. Proof of Theorem 5.2. Equipped with Lemma 5.3, we only need to show that V generates the vanishing ideal of Sm . Take any vanishing polynomial f . By Lemma 5.3, we can rewrite f = g + h where g ∈ span(F ) and h ∈ I(V ). Therefore, 0 = f (Sm ) = g(Sm ) + h(Sm ) = g(Sm ). Let us write F =P{f1 , . . . , fk }. Since g ∈ span(F ) we i αi fi . Therefore, P can write g = 0 = g(Sm ) = i αi fi (Sm ). But, by the construction of F , the set of vectors {f1 (Sm ), . . . , fk (Sm )} is orthonormal, hence all the αi must be zero. It follows that g is the zero polynomial, thus f = h ∈ I(V ), which concludes our proof of the first part. The second part of Theorem 5.2 follows immediately from the first part which states that I(V ) is the vanishing ideal of Sm and Lemma 5.3. 6. Related work Finding a set of generators for an ideal is a classical problem in algebraic geometry. A key breakthrough in this field has been the introduction of Gröbner bases by Buchberger (2006). A Gröbner base is a set of generators of an ideal with particular properties that allow operations such as polynomial division with unique remainders. Later Möller and Buchberger also presented a method for finding a Gröbner base for the vanishing ideal I(Sm ) as we do here (Möller and Buchberger, 1982). However, their goal was to find a base that vanishes exactly at these points which is of little practical use. Approximately vanishing ideals have been much less studied. The most relevant work on approximately vanishing ideals is the AVI algorithm (Heldt et al., 2009). AVI constructs an approximate border base for I(Sm ). A border base is also a generator set for I(Sm ), but with different properties from a Gröbner one. Unlike our approach, the AVI algorithm requires a lexical Vanishing Component Analysis ordering on the variables, and its output depends on this ordering . This is clearly undesirable since there is typically no a priori reason to order the variables in any way. Furthermore, AVI adds only monomial terms to the non-vanishing components F . Thus it will only generate compact descriptions of V when V contains few monomials. In contradistinction, VCA expands F using F1 , which can be non-sparse in the variables, and hence the resulting polynomials are non-sparse. Finally, AVI often generates redundant polynomials. For example, say we have two variables (x1 , x2 ) and x1 = 0 for all sample points. AVI will find the following vanishing components x1 , x1 x2 , x1 x22 , . . . , x1 x2m−1 , whereas VCA will find only the first one. In Section 8 we provide an example illustrating that these shortcomings of AVI result in superior generalization performance of VCA. 7. Application to Classification In this section we show how VCA can be used as a feature learning procedure for classification. Consider a classification problem, where the goal is to learn a mapping from Rn to {1, . . . , k}. To construct features for this task, we run VCA on the training sample for each class. The output of VCA is a set of polynomials which generate the vanishing ideal of each class. Let {pl1 (x), . . . , plnl (x)} be a set of generators of the vanishing ideal for class l. Then, for any training example in class l we have that |plj (x)| should be close to zero. In contrast, for points in other classes, we expect at least some of the polynomials not to vanish. If this is indeed the case, we can represent each point x using features of the form |plj (x)|. As we formalize below, if the points of each class belong to different algebraic sets, then in the representation by the aforementioned features, the data becomes linearly separable. Theorem 7.1. Let S = {(xi , yi )}m i=1 , be a set of labeled examples. Let S l be the set of examples with yi = l. Assume there are algebraic sets {U l }kl=1 whose intersection is empty in pairs such that S l ⊆ U l . For each l, let {pl1 (x), . . . , plnl (x)} be a set of generators of the ideal I(U l ). Then, S is linearly separable in the feature space: x 7→ (|p11 (x)|, . . . , |pknk (x)|). (1) Proof. Choose some point x, that belongs to class l, and hence x ∈ U l . Assume that x vanishes on all the polynomials in the set of generators of I(U j ), for some j 6= l. Since U j is defined as the common roots of some set of polynomials {hi } who are all in I(U j ), it follows immediately that x is a common root of the {hi }’s, hence x ∈ U j , but this contradicts the fact that U l ∩ U j = ∅. Hence, for any j, there must be some pji such that |pji (x)| > ǫ. On the other hand, for all i, |pli (x)| < ǫ. Therefore, the linear classifier that puts positive weights on the features corresponding to class l, and puts negative weights on the rest of the features, must separate between class l and the rest of the classes, which concludes our proof. Thus, we use VCA in classification as follows: calculate the polynomials pij above and use the feature map in Equation 1 to generate a new training dataset from S. Train a linear classifier on the new dataset. Since we are doing linear classification we can use ℓ1 regularization which has the added benefit of choosing a small number of features. Taking U l to be the algebraic set S l , theorem 7.1 states that the set S is linearly separable is the new feature space. In fact, it is enough to choose those polynomials that generate algebraic sets U l whose intersection is empty. 8. Experiments The following experiments use VCA as a feature extraction method, and use it in classification as described in Section 7 (we use VCA to refer to the overall classification approach). Linear classification (for VCA and Kernel SVM) is done using the LIBSVM (Chang and Lin, 2011) and LIBLINEAR (Fan et al., 2008) packages. We compared the VCA approach to the popular Kernel SVM classification (KSVM) method with a polynomial kernel (Scholkopf and Smola, 2001), and to AVI feature extraction. In the experiments in Section 8.2 we measured both the accuracy of the learned classifier as well as the test runtime. We evaluate runtime using the number of operations required for prediction. For KSVM this corresponds to the number of support vectors. For VCA based classification, it is the number of operations needed to compute the vanishing components. Hyperparameters for all algorithms were tuned using cross validation. For KSVM the parameters were the polynomial degree, and the regularization coefficient. For VCA, it was the tolerance ǫ and the overall number of components, as well as the regularization coefficient for ℓ1 regularized learning. 8.1. Toy Example We begin with a toy example where the two classes correspond to algebraic manifolds and hence VCA is expected to work well. The first class is shaped as a flattened sphere, corresponding to the polynomial: x21 + 0.01x22 + x23 = 1. The second is a cylinder, corresponding to the polynomial x21 + x23 = 1.4. The co- Vanishing Component Analysis VCA. Results in Table 1 show that while the accuracy of the methods is comparable, the runtime of VCA is up to two orders of magnitude better. This is a result of the compact representation that VCA achieves. 35 VCA AVI SVM 30 25 20 Data Set 15 Pendigits (5996) Letter (12000) USPS (5833) MNist (48000) 10 5 Error Rate Test Runtime VCA KSVM VCA KSVM 0.42 4.8 1.5 2.2 0.42 4.3 1.4 2 2.8e+003 1.1e+003 2.6e+003 4e+03 9.6e+003 7e+004 3.8e+005 3.1e+06 0 0 50 100 150 200 250 300 350 400 Figure 3. Toy dataset: The error percentage is depicted as a function of the sample size (solid lines are test errors, dashed lines are training errors). efficients are needed to make the manifolds separable. We added Gaussian noise with σ = 0.1 to both sets, and embedded them in R10 using a random matrix with correlated columns. This results in two sets that are approximately vanishing on polynomials that are highly non sparse in their monomial representation. We compared VCA to KSVM and to feature extraction using AVI. AVI chooses at each iteration a set of monomials that are non vanishing. These are analogous to the set Fd constructed by VCA. In AVI, the monomials are chosen by a lexicographical order (here we used DegRevLex). These monomials are then used to create new features and to represent the vanishing polynomials. When linear transformations are applied to the data, the representation of the vanishing polynomials in their monomial representation becomes less stable. This, in turn, affects the numerical stability of the algorithm. VCA on the other hand, represents each degree by the principle components of the former degrees, and this adds numerical stability even under linear embeddings. Figure 3 shows test accuracy as a function of the training sample size. Since VCA finds the vanishing polynomial for each class, it can separate the data with only two features, and thus has a much better training curve than KSVM. VCA also outperforms AVI for most data sizes due to the stability reasons mentioned above. 8.2. Real Datasets We next turn to experiments on real data of digit and character recognition tasks (obtained from the LIBSVM website). As mentioned earlier, we compare both the accuracy and the test runtime of the KSVM and Table 1. Error rate in (%) and test runtime (in number of operations) for VCA and KernelSVM (training size in brackets). Results are averaged over 10 random 80%/20% train/test partitions. 9. Discussion Algebraic geometry is a deep and fascinating field in mathematics, which deals with the structure of polynomial equations. Recently, methods in algebraic geometry and commutative algebra have been applied to study various problems in statistics. Specifically the field of “algebraic statistics” is concerned with statistical models that can be described via polynomials. For an introduction on the subject see for example (Gibilisco et al., 2009; Drton et al., 2008; Watanabe, 2009). In the machine learning literature, Király et al. (2012) proposes a method for approximating an arbitrary system of polynomial equations by a system of a particular type. This method was applied to the problem of finding a linear projection that makes a set of random variable identically distributed. However, to date, algebraic geometry has seen surprisingly few applications to the problem of classification and generative models. One reason is that many algebraic results and methods address noise free scenarios (e.g., what is the polynomial that exactly vanishes on a set of points). While there has been some recent interest in approximation approaches, these have not been applied to learning. Here we present an approach that is motivated by the notion of a polynomial ideal as a mechanism for compact description of a set of points. We show how to find a compact description of such an ideal, via a method that is also computationally stable. We believe such approaches have considerable potentials across machine learning, and could benefit from the deep known results in algebraic geometry, facilitating a synergistic interaction between the fields. Acknowledgments: This research is supported by the HP Labs Innovation Research Program. Vanishing Component Analysis References S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004. Bruno Buchberger. Bruno Buchberger’s PhD thesis 1965: An algorithm for finding the basis elements of the residue class ring of a zero dimensional polynomial ideal. Journal of Symbolic Computation, 41 (3-4):475 – 511, 2006. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2: 27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/~cjlin/libsvm. D.A. Cox, J. Little, and D. O’Shea. Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra, volume 10. Springer, 2007. M. Drton, B. Sturmfels, and S. Sullivant. Lectures on algebraic statistics. Birkhauser, 2008. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874, June 2008. ISSN 15324435. URL http://dl.acm.org/citation.cfm? id=1390681.1442794. P. Gibilisco, E. Riccomagno, M.P. Rogantin, and H.P. Wynn. Algebraic and geometric methods in statistics. Cambridge University Press, 2009. Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and Hennie Poulisse. Approximate computation of zerodimensional polynomial ideals. J. Symb. Comput., 44(11):1566–1591, 2009. F.J. Király, P. Buenau, J.S. Müller, D.A.J. Blythe, F. Meinecke, and K.R. Müller. Regression for sets of polynomial equations. JMLR Workshop and Conference Proceedings, 22:628–637, 2012. H Möller and B Buchberger. The construction of multivariate polynomials with preassigned zeros. Computer Algebra, pages 24–31, 1982. B. Schölkopf, A. Smola, and K.R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299–1319, 1998. Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. ISBN 0262194759. Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, United Kingdom, 2009.