Vanishing Component Analysis
Roi Livni∗
roi.livni@mail.huji.ac.il
Interdiscplinary Center for Neural Computation Edmond and Lily Safra Center for Brain Sciences, The Hebrew
University of Jerusalem Givat Ram, Jerusalem 91904, Israel
David Lehavi∗ , Sagi Schein, Hila Nachlieli
david.lehavi, sagi.schein, hila.nachlieli@hp.com
Hewlett-Packard Laboratories Israel Ltd. Technion City Haifa 32000 Israel
Shai Shalev-Shwartz, Amir Globerson
shais, gamir@cs.huji.ac.il
Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem Givat
Ram, Jerusalem 91904, Israel
Abstract
The vanishing ideal of a set of points, S ⊂
Rn , is the set of all polynomials that attain
the value of zero on all the points in S. Such
ideals can be compactly represented using a
small set of polynomials known as generators
of the ideal. Here we describe and analyze
an efficient procedure that constructs a set
of generators of a vanishing ideal. Our procedure is numerically stable, and can be used
to find approximately vanishing polynomials.
The resulting polynomials capture nonlinear
structure in data, and can for example be
used within supervised learning. Empirical
comparison with kernel methods show that
our method constructs more compact classifiers with comparable accuracy.
1. Introduction
Classification algorithms can only be as good as the
features they work with. For example, in linear classification, high accuracy will only be obtained if our
features are such that the classes are linearly separable. The problem of feature extraction has thus traditionally attracted considerable interest in the machine
learning literature.
One conceptually simple approach to describing a set
of points S is to find a set of equations that each data
point x ∈ S should (approximately) satisfy. In other
words, we seek a set of functions f1 (x), . . . , fk (x) such
* These authors contributed equally.
Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).
that fi (x) ≈ 0 for all i and x ∈ S. If the set of points
S belongs to a particular class (say the digit eight)
then these functions may provide a succinct characterization of elements in the class. We can thus extract
such functions for all classes and use them as features
in classification.
Clearly the complexity of the functions fi (x) should
be restricted so that the description is both compact
and interpretable. A natural class which we focus on
here is degree bounded polynomials. The goal of our
work is to find a small set of such polynomials for a
given set of points.
The set of all polynomials f (x) that attain a value of
zero on a set S is known as the vanishing ideal of S and
is denoted by I(S) (i.e., f ∈ I(S) iff f (x) = 0 ∀x ∈ S).
This set would qualify as a description of S. However,
it contains an infinite number of polynomials and we
would rather represent it with finitely many polynomials if possible. The first key observation is that if
f (x) ∈ I(S) and h(x) is any polynomial, then hf is
also in I(S). Thus, it is natural to ask whether there
are f1 (x), . . . , fk (x) such
P that any g ∈ I(S) can be
represented as g(x) = i fi (x)hi (x) for some polynomials hi (x). Such a set of fi (x) is known as a set of
generators of the ideal I(S). Luckily, Hilbert’s basis
theorem (Cox et al., 2007) tells us that a finite set of
generators exists for any ideal. A finite set of generators is an attractive mechanism for describing I(S)
since all elements in I(S) can be derived from the set
of generators. Thus, we turn our attention to finding
such a finite set of generators, whose elements we call
Vanishing Components.
Current machine learning approaches do not offer a solution to the above problem. One seemingly relevant
Vanishing Component Analysis
approach is kernel PCA (Schölkopf et al., 1998). However, as we argue later, kernel PCA cannot be used
to find vanishing components, since the kernel trick
is inapplicable in this case. Linear PCA can be used
to find vanishing components only in the case where
these are linear, which is not expected to hold generally. The work that is closest in spirit to ours is
the approximately vanishing ideal (AVI) algorithm in
(Heldt et al., 2009). AVI requires an additional lexical order on the original features, and is also geared
towards functions with a small number of monomials,
which is often not the case.
Our vanishing component analysis (VCA) algorithm
takes as input a set S and outputs a set of polynomials V = {f1 (x), . . . , fk (x)}. VCA has the following
attractive properties:
• The set V generates I(S).
• The algorithm is polynomial in |S| and in the original dimension n. Furthermore, fi (x) can be evaluated in time polynomial in |S| and n.
• The algorithm does not depend on any lexical ordering of the variables.
Thus, we achieve the goal of efficiently finding a generator set of I(S) and obtain a compact description
of S. In practice, due to noisy data, we might wish
to find a set of polynomials that only approximately
vanish over the the set. To address this issue, our algorithm is dependent on an ǫ-tolerance parameter, and
we search for polynomials that approximately vanish
on the set S. In section 7 we illustrate its use as a
feature learning procedure for classification.
2. Preliminaries and Problem Setup
We begin with basic definitions.
n
Definition 1 (Monomials). A function f : RQ
→ R is
n
i
called a monomial if it is of the form f (x) = i=1 xα
i
where each αi Q
is a non-negative integer. We also use
n
i
denote xα P
= i=1 xα
i . The degree of the monomial
n
is kαk1 = i=1 αi . We denote the set of monomials
over n variables by T n , and the set of monomials of
total degree up to d by Tdn .
Definition 2 (Polynomials). A function f : Rn → R
is called a polynomial if it is a weighted sum of monoP
(j)
mials. That is, f (x) = j βj xα , where each βj ∈ R
(j)
and each αi is a non-negative integer. The degree of
f is the maximal degree of its monomials.
Definition 3 (The polynomial ring). The polynomial
ring in n variables over R, denoted by R [x1 , . . . , xn ],
is the set of all polynomials in n variables over the
reals of finite degree. The addition and multiplication
operators over the ring are equivalent to addition and
multiplication of functions. That is, if h = f + g then
for all x, h(x) = f (x) + g(x) and if h = f g then for
all x, h(x) = f (x)g(x).
Definition 4 (Ideal). A set of polynomials I is an
ideal if it is a sub-group with respect to addition (meaning that it is closed under addition and negation and
contains the zero polynomial) and it “absorbs multiplication”, meaning that for any f ∈ I, g ∈ R [x1 , . . . , xn ]
we have f g ∈ I.
Definition 5 (Set of Generators). Given an ideal I. A
set of polynomials {f1 , . . . , fk } ⊆ I is said to generate
I, if ∀f ∈P
I there exist g1 , . . . , gk ∈ R [x1 , . . . , xn ] such
that f = i gi fi . In this case we denote the ideal by
I(F ) where F = {f1 , . . . , fk }. Note that this should
not be confused with I(S).
Definition 6 (Vanishing Ideal). Given a set S ⊂ Rn ,
the vanishing ideal of S is the set of polynomials that
vanish on S. We denote it by I(S) (it’s easy to see
that it’s an ideal). That is, for all x ∈ S and f ∈ I(S)
we have f (x) = 0.
Definition 7 (Algebraic Set). A set V ⊂ Rn , is called
an algebraic set if there is a finite set of polynomials
{pi }ki=1 , such that V are the common roots of {pi }ki=1 .
Our problem setup is as follows. We are given a sample
(i)
∈ Rn . Our goal is
Sm of points {x(i) }m
i=1 where x
to find a set of generators of I(Sm ). As mentioned
earlier this is desirable since it succinctly captures all
polynomials that vanish on Sm .
To make the algorithm practical, we seek a method
with a polynomial running time in m and n. Additionally, since real world data is noisy, we allow some tolerance by looking for polynomials that “almost” vanish
on Sm .
In Section 4 we describe our Vanishing Component
Analysis (VCA). Before presenting VCA, we first discuss an approach for finding a set of generators that is
simple to understand but is exponential in the sample
size m. We also show that the kernel trick cannot be
used to overcome this difficulty.
3. A Simple but Impractical Approach
One approach to finding generators for I(Sm ) is to use
a linear algebraic method. Assume for simplicity that
we know there is a set of generators of I(Sm ) of maximal degree D. Now consider the set of monomials TDn
and construct a matrix A of size (m, |TDn |) as follows:
Aij = tj (x(i) ) , where tj (x) is the j th monomial in TDn .
Vanishing Component Analysis
Note that the number of columns in A is exponential
in D. We now claim that the null space of A can be
used to obtain a set of generators of I(Sm ).
Proposition 3.1. Denote by V a set of vectors
v 1 , . . . , v k which are a basis of the null space of A.
Namely, for all i = 1, . . . , k we have Av i = 0, and
any vector v such that Av = 0 can be written as
a linear combination
of the v i . Then the polynomiPn
als fi (x) = j=1 vij tj (x) form a set of generators of
I(Sm ).
Proof. Clearly fi (x(j) ) = 0 for all sample points x(j) .
Thus fi (x) ∈ I(Sm ). Next, we show that for any set of
generators of I(Sm ) of maximal degree D, any polynomial in the set of generators can be obtained as a linear
combination of fi . Consider such a polynomial g(x).
By our assumption it is of max degree D, and hence
can be written as a linear combination of the monomials in TDn . Denote the vector of the corresponding
n
coefficients by z ∈ R|Td | . Then Az = 0 and hence z is
in the null space of A and is spanned by v i . Thus the
polynomial g(x) can be written as a linear combination of the fi (x) polynomials. Thus any polynomial in
a set of generators of I(Sm ) can be (linearly) generated
by fi (x) and we conclude that fi (x) are generators of
I(Sm ).
The above procedure achieves the goal of finding a set
of generators of I(Sm ). However, it does so at a cost
exponential in D, which is impractical even for modest
D values. Furthermore, the value of D, the maximum
degree of the generator set, may be O(m).1 Thus the
cost of the above algorithm can be exponential in m.
Next, we show why the standard use of the kernel trick
cannot be used to overcome this difficulty.
3.1. Kernels Can’t Help!
The kernel trick is an elegant method for avoiding
working in high dimensional feature spaces explicitly.
For example it can be used to perform non-linear PCA,
and to find non linear polynomial separators using kernel SVM (e.g., see Scholkopf and Smola, 2001). It may
seem like the kernel trick can be used to find vanishing
components without explicitly calculating all monomials in Tdn . However, perhaps surprisingly, this is not
1
As an example where D = m, consider the following:
Let p be a polynomial of degree m in R, and let {r1 , . . . , rm }
be its real roots. Choose some random unit vector v in
Rn and let Sm = {r1 v, r2 v, . . . , rm v}. I(Sm ) is the ideal
generated by the polynomials {p(v T x), uT1 x, . . . , uTn−1 x},
where v ⊥ u1 , . . . , un−1 . There is no alternative generator
of I(Sm ) that does not contain a polynomial of total degree
at least m.
the case, as we argue next.
We begin by recalling the kernel trick idea for the
polynomial case, as used for kernel PCA (KPCA).
The goal in KPCA is to perform PCA in the (exponentially large) vector of monomials in x of degree
d, and find the projection on these components. To
do this efficiently, one considers the kernel function
k(x, y) = (1 + x · y)d . Then it can be shown that
if the principal components correspond to non-zero
eigenvalues (as they always do) the projection on the
j th principal
P component is given by a polynomial of
the form i αij k(x, x(i) ), where αij are eigenvectors
of the kernel matrix. Following a similar rationale one
might posit that the vanishing polynomials are also
of this form. However, as the following result shows,
the vanishing polynomials cannot be expressed in this
fashion.
Theorem 3.2. Let k be a reproducing kernel and f ∈
span(k(·, x(i) )) such that f vanishes on all x(i) . Then
f is the zero function.
c vT
, where
v K
Ki,j := k(x(i) , x(j) ), c := k(x, x) and v i := k(x(i) , x).
Under these notations, we need to prove that for every
α in the null space of K, we have αT v = 0.
Proof. Given x, define K̃ :=
The reproducing property ensures that K̃ is a positive
semidefinite matrix. The Schur complement of c in K̃
is defined as A = K − 1c vv T . It is known that if K̃ is
positive semidefinite then, c = 0 implies v = 0, and if
c > 0 then A must be positive semidefinite (Boyd and
Vandenberghe, 2004). But for any α in the null space
of K we will have |αT v|2 = −cαT Aα ≤ 0. and thus
αT v = 0.
The above theorem says that we cannot use the kernel
trick (i.e., use the kernel matrix instead of the explicit
monomial vector) to find the vanishing components.
4. The VCA algorithm
Recall that our goal is to find a set of generators for
I(Sm ). Since we are dealing with noisy data, it is unreasonable to seek generators that exactly vanish on
Sm , and in our VCA procedure we use a tolerance parameter to allow generators to approximately vanish.
In what follows, we give a step by step description of
the algorithm and its rationale. The procedure itself
is described in Figures 1 and 2 and its properties are
analyzed in Section 5.
We can think of each polynomial f both in the usual
sense, i.e. as a function from Rn to R, but also as a
Vanishing Component Analysis
vector in Rm containing the evaluations of f on the
sample Sm , namely, f (Sm ) = f (x(1) ), . . . , f (x(m) ) ∈
Rm . A polynomial f vanishes on Sm if and only if
f (Sm ) = 0.
To motivate the construction, let us first recall the case
of the most simple polynomials — linear functions.
Suppose we would like to find a set V of linear functions such that for each f ∈ V and x(i) ∈ Sm we have
f (x(i) ) = 0. Each linear function is described
Pn by a vector β ∈ Rn+1 , such that f (x) = β0 + j=1 βj xj . We
can rewrite the linear function as a combination of base
polynomials. Indeed,
let f0 be the constant polyno√
mial, f0 (x) = 1/ m for all x. Let C1 = {f1 , . . . , fn }
be a set of polynomials, where for all i, fi (x) = xi .
Now, we can rewrite any linear function as a linear
combination of polynomials from C1 ∪ {f0 }. That is,
each linear function is of the form:
f (x) = β0 +
n
X
i=1
βi xi =
n
X
βi fi (x) .
i=0
It follows P
that for any such polynomial we have
n
f (Sm ) =
i=0 βi fi (Sm ). Therefore, a linear function vanishes on Sm if and only if f (Sm ) = 0 ∈ Rm .
This amounts to requiring that the vector β would
be in the null space of the m × (n + 1) matrix A1 =
[f0 (Sm ), . . . , fn (Sm )].
To find the null space of A1 , we can follow the GramSchmidt procedure. As we show later, using the Singular Value Decomposition is preferable, since it provides
us with a stable method for finding an approximate
null space. However, for the sake of clarity, let us first
describe the Gram-Schmidt approach.
We maintain two sets: V for the vanishing polynomials
and F for the non-vanishing polynomials. We use the
notation F (Sm ) = {f (Sm ) : f ∈ F } ⊂ Rm to denote
the vectors in Rm corresponding to evaluations of nonvanishing polynomials in F on Sm . We will construct
F such that F (Sm ) is a set of orthonormal vectors in
Rm .
Since f0 is clearly non-vanishing we can initialize
F = {f0 } and V = ∅. Now, at round t, consider
the remainder of ft (Sm ) after projecting on the orthonormal
set F (Sm ). That is, rt (Sm ) = ft (Sm ) −
P
hf
(S
Note that rt (Sm ) is
t
m ), f (Sm )i f (Sm ).
f ∈F
the
evaluation
of
the
polynomial
rt (x) = ft (x) −
P
hf
(S
),
f
(S
)i
f
(x),
on
S
.
t
m
m
m
f ∈F
Now, if rt (Sm ) is the zero vector, then rt is vanishing
on Sm , so we update: V ← V ∪ {rt }. Otherwise, we
update F ← F ∪ {rt /krt (Sm )k}, where the normalization ensures that all the vectors in F (Sm ) are of unit
norm. At the end of this process, F contains a set of
linear polynomials which are non-vanishing on Sm and
V contains a set of linear polynomials that vanish on
Sm . Furthermore, F (Sm ) is an orthonormal basis of
the range of A1 . Let us call F1 and V1 the values of F
and V after dealing with polynomials of degree 1.
Next, consider polynomials of degree 2. Consider the
set of polynomials C2 = {fi,j }ni,j=1 , where for all i, j,
fi,j (x) = xi xjP
. Each polynomial
Pn of degree 2 takes the
n
form f (x) = i=0 βi fi (x) + i,j=1 βi,j fi,j (x) .
As before, we can find vanishing 2nd order polynomials via the null space of the matrix: A2 =
[A1 , f1,1 (Sm ), . . . , fn,n (Sm )]. To find the null space of
the matrix A2 , we could simply continue the GramSchmidt procedure we have already performed for the
columns of A1 . However, we now need to consider
n2 columns, and as the degree goes up the number of
columns increases exponentially. To overcome this obstacle, we rely on the underlying structure of the vanishing ideal, and in particular its absorbedness property.
P
Take some f of degree 2. Then, f = i gi hi , where
gi , hi are all of degree at most 1. Without loss of generality, assume that for i ≤ i1 we have that both gi
and hi are non vanishing on Sm , and that for i > i1
either gi or hi vanishes. It follows that for all i > i1 we
have that the polynomial
gi hi vanishes. Therefore, the
P
polynomial fˆ = i≤i1 gi hi satisfies fˆ(Sm ) = f (Sm ).
In other words, f vanishes on Sm if and only if fˆ
vanishes on Sm . Since, by our construction, f − fˆ
is generated by V , it suffices to deal with fˆ(Sm ).
Using our construction of F1 , we know that for all
i ≤ i1 , both gi and hi are in the span of F1 . Denoting
F1 = {p1 , . . . , pk }, we can write
XX
X
X
X
fˆ =
αji pj
βsi ps =
pj ps
αji βsi
i≤i1
j
s
j,s
i≤i1
It follows that fˆ is in the span of F1 × F1 and thus to
construct F2 and V2 , it suffices to find the null space
and range on the set of candidate polynomials in F1 ×
F1 . Formally, let us redefine C2 to be the set
C2 = {f = gh : g, h ∈ F1 } ,
and let C2 (Sm ) = {f (Sm ) : f ∈ C2 } ⊂ Rm be the
corresponding set of vectors obtained by evaluating
all polynomials in C2 on Sm . We will construct F2
and V2 by continuing the Gram-Schmidt process on
the candidate vectors in C2 (Sm ), as follows. Choose a
polynomial f ∈ C2 and let f (Sm ) be the corresponding
vector in C2 (Sm ). The remainder of f after projecting it on the current setPF is the polynomial r defined by r(x) = f (x) − g∈F hf (Sm ), g(Sm )ig(x). If
Vanishing Component Analysis
VCA
parameters:
Tolerance ǫ ; “FindRangeNull” procedure
initialize:
√
F = {f (·) = 1/ m}, V = ∅
C1 = {f1 , . . . , fn } where fi (x) = xi
(F1 , V1 ) = FindRangeNull(F, C1 , Sm , ǫ)
F = F ∪ F1 , V = V ∪ V1
for t = 2, 3 . . .
Ct = {gh : g ∈ Ft−1 , h ∈ F1 }
if Ct = ∅
break
(Ft , Vt ) = FindRangeNull(F, Ct , Sm , ǫ)
F = F ∪ Ft , V = V ∪ Vt
output: F, V
Figure 1. Our Vanishing Component Analysis (VCA) algorithm for finding a generator set for I(Sm ).
r(Sm ) = 0, we add it to V2 and to V . Otherwise, we
add r/kr(Sm )k to F2 and to F .
This process continues to higher degrees. At iteration
t, we construct the set of candidate polynomials to be
Ft−1 ×F1 . Then, we construct Ft and Vt (and update F
and V accordingly) by performing the Gram-Schmidt
procedure on this set of candidate polynomials.
The VCA procedure is described in Figures 1 and 2.
We make one crucial modification to the above description. When constructing the new polynomials in
Ft and Vt out of the candidates in Ct , we use the Singular Valued Decomposition procedure to find an approximate null space instead of performing the GramSchmidt procedure. This is obtained by calling a
sub-procedure (Ft , Vt ) = FindRangeNull(F, Ct , Sm , ǫ),
which receives the current set F , a list of candidates,
Ct , the set of examples Sm , and a tolerance parameter
ǫ. When ǫ = 0, the procedure can be implemented
by initializing a Gram-Schmidt procedure with the orthonormal basis F (Sm ), and then continuing performing Gram-Schmidt orthonormalization on the candidate vectors f (Sm ) with f ∈ Ct . When ǫ > 0, we
rely on the SVD procedure, as described below. As
we will formally prove in the next section, if one sets
ǫ = 0, then VCA is guaranteed to construct a set of
generators of the vanishing ideal.
We next describe how to implement the FindRangeNull procedure. It is known that finding an orthonormal basis using the Gram-Schmidt method can
be numerically instable, and a preferred approach is
by using the Singular Value Decomposition (SVD) approach.
FindRangeNull (using SVD)
input: F, C, Sm , ǫ
denote k = |C| and C = {f1 , . . . , fk }
for t = 1, . . . , k
P
let f˜i = fi − g∈F hfi (Sm ), g(Sm )ig
let A = [f˜1 (Sm ), . . . , f˜k (Sm )] ∈ Rm,k
decompose A = LDU ⊤ using SVD
for i = 1, . . . , k
Pk
let gi = j=1 Uj,i f˜j
output:
F1 = {gi /kgi (Sm )k : Di,j > ǫ}
V1 = {gi : Di,j ≤ ǫ}
Figure 2. The implementation of the FindRangeNull function in Figure 1.
First, given a candidate set C = {f1 , . . . , fk }, and a
current set of polynomials F , we calculate f˜1 , . . . , f˜k
to be the remainder of the polynomials in C after
projecting them on F . Next, we use SVD to find
the (approximate) range and null space of the matrix
A = [f˜1 (Sm ), . . . , f˜k (Sm )], by calculating the right singular vectors of A. Finally, we define the corresponding polynomials based on the right singular vectors.
5. Analysis
In this section we analyze the efficiency and correctness
of the VCA procedure. We use the notation F t =
∪i≤t Ft and V t = ∪i≤t Vt . Our first theorem addresses
with the efficiency of the VCA procedure.
Theorem 5.1. If VCA is run with any ǫ then:
1. VCA stops after at most t ≤ m + 1 iterations.
2. All polynomials in V, F are of degree at most m.
3. |F | ≤ m and |V | ≤ |F |2 · min{|F |, n}.
4. It is possible to evaluate all the polynomials in
F in time O(|F |2 ), and polynomials in V in tim
O(|F |2 + |F | · |V |).
Proof.
1. By the stopping rule of VCA, it will terminate at
round t whenever Ft−1 is empty (since this will
imply Ct = ∅). Therefore, if we didn’t stop at
round t, then |F t | > t − 1, and therefore we also
have that |F t (Sm )| ≥ t − 1. On the other hand,
by construction, F t (Sm ) is a set of orthonormal
vectors in Rm , hence its size is at most m. It
follows that t − 1 ≤ m.
Vanishing Component Analysis
2. It follows immediately that all the polynomials in
V and F are of degree at most m.
3. We already argued that |F | ≤ m. Also, we only
add polynomials to V in the first |F | iterations.
Furthermore, whenever we add a polynomial to
V , it is constructed from a set of candidates, Ct =
Ft−1 × F1 , whose size is at most |F | · min{|F |, n}.
Therefore, |V | ≤ |F |2 · min{|F |, n}.
4. Let us enumerate the polynomials in F =
{g1 , . . . , g|F | } according to the order in which they
were inserted into F . Since g1 is the constant
polynomial, it can be evaluated in time O(1).
Suppose we have evaluated g1 , . . . , gr−1 . By construction, gr can be written as a product of two
polynomials from g1 , . . . , gr−1 minus a linear combination of g1 , . . . , gr−1 . Therefore, it can be evaluated in time O(r). It follows that we can evaluate all the polynomials in F in time O(|F |2 ). A
similar argument shows that once we evaluated
all polynomials in F , we can evaluate each polynomial in V in O(|F |).
We next prove that VCA finds a generator of I(Sm ).
Theorem 5.2. If VCA is run with ǫ = 0 its output
satisfies:
1. V generates I(Sm ).
2. Any polynomial f can be written as f = g + h
where g ∈ span(F ) and h ∈ I(Sm ).
The proof of the theorem relies on the following lemma.
Lemma 5.3. If VCA is run with ǫ = 0, then for all
t, any polynomial f of degree at most t can be written
as f = g + h where g ∈ span(F t ) and h ∈ I(V t ).
Proof. We prove this claim by induction on t. The
base of the induction, i.e. t = 1, follows by standard
results from linear algebra. So, from now on we assume
that t > 1.
By our inductive assumption, for each i < t, any polynomial of degree at most i can be written as f = g + h
where g ∈ span(F i ) and h ∈ I(V i ). Now, consider
a polynomial f of degree t. Clearly, if each monomial of f can be decomposed to a sum of a vector
from span(F t ) and a vector from I(V t ), then the claim
holds for t as well. Furthermore, by our inductive assumption, any monomial of f whose degree is strictly
smaller than t can be decomposed as required. Therefore, from now on we can assume that f is a single monomial of degree exactly t, which we can write
f = pq where p is of degree t − 1 and q is of degree 1.
Using our inductive assumption, we can write p =
pF + pV where pF ∈ span(F t−1 ) and pV ∈ I(V t−1 ).
Similarly, q = qF + qV where qF ∈ span(F 1 ) and
qV ∈ I(V 1 ). Let us further rewrite pF = pF t−2 +pFt−1 ,
where pF t−2 ∈ span(F t−2 ) and pFt−1 ∈ span(Ft−1 ).
Similarly, qF = qF 0 + qF1 , where qF 0 ∈ span(F 0 ) and
qF1 ∈ span(F1 ). Therefore,
f = pq = (pF t−2 + pFt−1 + pV )(qF 0 + qF1 + qV )
= pFt−1 qF1 + pV q + pqV + pF t−2 q + pqF 0 .
{z
}
| {z } |
∈I(V t )
of degree <t
Finally, since pFt−1 qF1 ∈ span(Ft−1 × F1 ), by our
construction, it can be written as aF + aV where
aF ∈ span(Ft ) and aV ∈ span(Vt ), which concludes
our inductive argument.
Proof of Theorem 5.2. Equipped with Lemma 5.3, we
only need to show that V generates the vanishing ideal
of Sm . Take any vanishing polynomial f . By Lemma
5.3, we can rewrite f = g + h where g ∈ span(F )
and h ∈ I(V ). Therefore, 0 = f (Sm ) = g(Sm ) +
h(Sm ) = g(Sm ). Let us write F =P{f1 , . . . , fk }. Since
g ∈ span(F ) we
i αi fi . Therefore,
P can write g =
0 = g(Sm ) = i αi fi (Sm ). But, by the construction
of F , the set of vectors {f1 (Sm ), . . . , fk (Sm )} is orthonormal, hence all the αi must be zero. It follows
that g is the zero polynomial, thus f = h ∈ I(V ),
which concludes our proof of the first part. The second part of Theorem 5.2 follows immediately from the
first part which states that I(V ) is the vanishing ideal
of Sm and Lemma 5.3.
6. Related work
Finding a set of generators for an ideal is a classical
problem in algebraic geometry. A key breakthrough in
this field has been the introduction of Gröbner bases
by Buchberger (2006). A Gröbner base is a set of generators of an ideal with particular properties that allow
operations such as polynomial division with unique remainders. Later Möller and Buchberger also presented
a method for finding a Gröbner base for the vanishing
ideal I(Sm ) as we do here (Möller and Buchberger,
1982). However, their goal was to find a base that
vanishes exactly at these points which is of little practical use.
Approximately vanishing ideals have been much less
studied. The most relevant work on approximately
vanishing ideals is the AVI algorithm (Heldt et al.,
2009). AVI constructs an approximate border base for
I(Sm ). A border base is also a generator set for I(Sm ),
but with different properties from a Gröbner one. Unlike our approach, the AVI algorithm requires a lexical
Vanishing Component Analysis
ordering on the variables, and its output depends on
this ordering . This is clearly undesirable since there
is typically no a priori reason to order the variables in
any way. Furthermore, AVI adds only monomial terms
to the non-vanishing components F . Thus it will only
generate compact descriptions of V when V contains
few monomials. In contradistinction, VCA expands
F using F1 , which can be non-sparse in the variables,
and hence the resulting polynomials are non-sparse.
Finally, AVI often generates redundant polynomials.
For example, say we have two variables (x1 , x2 ) and
x1 = 0 for all sample points. AVI will find the following vanishing components x1 , x1 x2 , x1 x22 , . . . , x1 x2m−1 ,
whereas VCA will find only the first one. In Section
8 we provide an example illustrating that these shortcomings of AVI result in superior generalization performance of VCA.
7. Application to Classification
In this section we show how VCA can be used as a
feature learning procedure for classification. Consider
a classification problem, where the goal is to learn a
mapping from Rn to {1, . . . , k}. To construct features
for this task, we run VCA on the training sample for
each class. The output of VCA is a set of polynomials
which generate the vanishing ideal of each class. Let
{pl1 (x), . . . , plnl (x)} be a set of generators of the vanishing ideal for class l. Then, for any training example
in class l we have that |plj (x)| should be close to zero.
In contrast, for points in other classes, we expect at
least some of the polynomials not to vanish. If this is
indeed the case, we can represent each point x using
features of the form |plj (x)|. As we formalize below,
if the points of each class belong to different algebraic
sets, then in the representation by the aforementioned
features, the data becomes linearly separable.
Theorem 7.1. Let S = {(xi , yi )}m
i=1 , be a set of labeled examples. Let S l be the set of examples with
yi = l. Assume there are algebraic sets {U l }kl=1 whose
intersection is empty in pairs such that S l ⊆ U l . For
each l, let {pl1 (x), . . . , plnl (x)} be a set of generators of
the ideal I(U l ). Then, S is linearly separable in the
feature space:
x 7→ (|p11 (x)|, . . . , |pknk (x)|).
(1)
Proof. Choose some point x, that belongs to class l,
and hence x ∈ U l . Assume that x vanishes on all the
polynomials in the set of generators of I(U j ), for some
j 6= l. Since U j is defined as the common roots of
some set of polynomials {hi } who are all in I(U j ), it
follows immediately that x is a common root of the
{hi }’s, hence x ∈ U j , but this contradicts the fact
that U l ∩ U j = ∅. Hence, for any j, there must be
some pji such that |pji (x)| > ǫ. On the other hand, for
all i, |pli (x)| < ǫ. Therefore, the linear classifier that
puts positive weights on the features corresponding to
class l, and puts negative weights on the rest of the
features, must separate between class l and the rest of
the classes, which concludes our proof.
Thus, we use VCA in classification as follows: calculate the polynomials pij above and use the feature map
in Equation 1 to generate a new training dataset from
S. Train a linear classifier on the new dataset. Since
we are doing linear classification we can use ℓ1 regularization which has the added benefit of choosing a small
number of features. Taking U l to be the algebraic set
S l , theorem 7.1 states that the set S is linearly separable is the new feature space. In fact, it is enough to
choose those polynomials that generate algebraic sets
U l whose intersection is empty.
8. Experiments
The following experiments use VCA as a feature extraction method, and use it in classification as described in Section 7 (we use VCA to refer to the overall classification approach). Linear classification (for
VCA and Kernel SVM) is done using the LIBSVM
(Chang and Lin, 2011) and LIBLINEAR (Fan et al.,
2008) packages.
We compared the VCA approach to the popular Kernel
SVM classification (KSVM) method with a polynomial
kernel (Scholkopf and Smola, 2001), and to AVI feature extraction. In the experiments in Section 8.2 we
measured both the accuracy of the learned classifier as
well as the test runtime. We evaluate runtime using
the number of operations required for prediction. For
KSVM this corresponds to the number of support vectors. For VCA based classification, it is the number
of operations needed to compute the vanishing components. Hyperparameters for all algorithms were tuned
using cross validation. For KSVM the parameters were
the polynomial degree, and the regularization coefficient. For VCA, it was the tolerance ǫ and the overall
number of components, as well as the regularization
coefficient for ℓ1 regularized learning.
8.1. Toy Example
We begin with a toy example where the two classes
correspond to algebraic manifolds and hence VCA is
expected to work well. The first class is shaped as
a flattened sphere, corresponding to the polynomial:
x21 + 0.01x22 + x23 = 1. The second is a cylinder, corresponding to the polynomial x21 + x23 = 1.4. The co-
Vanishing Component Analysis
VCA. Results in Table 1 show that while the accuracy
of the methods is comparable, the runtime of VCA is
up to two orders of magnitude better. This is a result
of the compact representation that VCA achieves.
35
VCA
AVI
SVM
30
25
20
Data Set
15
Pendigits (5996)
Letter (12000)
USPS (5833)
MNist (48000)
10
5
Error Rate
Test Runtime
VCA KSVM VCA
KSVM
0.42
4.8
1.5
2.2
0.42
4.3
1.4
2
2.8e+003
1.1e+003
2.6e+003
4e+03
9.6e+003
7e+004
3.8e+005
3.1e+06
0
0
50
100
150
200
250
300
350
400
Figure 3. Toy dataset: The error percentage is depicted as
a function of the sample size (solid lines are test errors,
dashed lines are training errors).
efficients are needed to make the manifolds separable.
We added Gaussian noise with σ = 0.1 to both sets,
and embedded them in R10 using a random matrix
with correlated columns. This results in two sets that
are approximately vanishing on polynomials that are
highly non sparse in their monomial representation.
We compared VCA to KSVM and to feature extraction using AVI.
AVI chooses at each iteration a set of monomials that
are non vanishing. These are analogous to the set
Fd constructed by VCA. In AVI, the monomials are
chosen by a lexicographical order (here we used DegRevLex). These monomials are then used to create
new features and to represent the vanishing polynomials. When linear transformations are applied to the
data, the representation of the vanishing polynomials in their monomial representation becomes less stable. This, in turn, affects the numerical stability of
the algorithm. VCA on the other hand, represents
each degree by the principle components of the former
degrees, and this adds numerical stability even under
linear embeddings.
Figure 3 shows test accuracy as a function of the training sample size. Since VCA finds the vanishing polynomial for each class, it can separate the data with
only two features, and thus has a much better training curve than KSVM. VCA also outperforms AVI for
most data sizes due to the stability reasons mentioned
above.
8.2. Real Datasets
We next turn to experiments on real data of digit and
character recognition tasks (obtained from the LIBSVM website). As mentioned earlier, we compare both
the accuracy and the test runtime of the KSVM and
Table 1. Error rate in (%) and test runtime (in number
of operations) for VCA and KernelSVM (training size in
brackets). Results are averaged over 10 random 80%/20%
train/test partitions.
9. Discussion
Algebraic geometry is a deep and fascinating field in
mathematics, which deals with the structure of polynomial equations. Recently, methods in algebraic geometry and commutative algebra have been applied to
study various problems in statistics. Specifically the
field of “algebraic statistics” is concerned with statistical models that can be described via polynomials.
For an introduction on the subject see for example
(Gibilisco et al., 2009; Drton et al., 2008; Watanabe,
2009).
In the machine learning literature, Király et al. (2012)
proposes a method for approximating an arbitrary system of polynomial equations by a system of a particular type. This method was applied to the problem of
finding a linear projection that makes a set of random
variable identically distributed.
However, to date, algebraic geometry has seen surprisingly few applications to the problem of classification
and generative models. One reason is that many algebraic results and methods address noise free scenarios
(e.g., what is the polynomial that exactly vanishes on a
set of points). While there has been some recent interest in approximation approaches, these have not been
applied to learning. Here we present an approach that
is motivated by the notion of a polynomial ideal as a
mechanism for compact description of a set of points.
We show how to find a compact description of such
an ideal, via a method that is also computationally
stable. We believe such approaches have considerable
potentials across machine learning, and could benefit
from the deep known results in algebraic geometry, facilitating a synergistic interaction between the fields.
Acknowledgments: This research is supported by
the HP Labs Innovation Research Program.
Vanishing Component Analysis
References
S. Boyd and L. Vandenberghe. Convex optimization.
Cambridge university press, 2004.
Bruno Buchberger. Bruno Buchberger’s PhD thesis
1965: An algorithm for finding the basis elements
of the residue class ring of a zero dimensional polynomial ideal. Journal of Symbolic Computation, 41
(3-4):475 – 511, 2006.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A
library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:
27:1–27:27, 2011. Software available at http://www.
csie.ntu.edu.tw/~cjlin/libsvm.
D.A. Cox, J. Little, and D. O’Shea. Ideals, varieties,
and algorithms: an introduction to computational
algebraic geometry and commutative algebra, volume 10. Springer, 2007.
M. Drton, B. Sturmfels, and S. Sullivant. Lectures on
algebraic statistics. Birkhauser, 2008.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification.
J. Mach.
Learn. Res., 9:1871–1874, June 2008. ISSN 15324435. URL http://dl.acm.org/citation.cfm?
id=1390681.1442794.
P. Gibilisco, E. Riccomagno, M.P. Rogantin, and H.P.
Wynn. Algebraic and geometric methods in statistics. Cambridge University Press, 2009.
Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and
Hennie Poulisse. Approximate computation of zerodimensional polynomial ideals. J. Symb. Comput.,
44(11):1566–1591, 2009.
F.J. Király, P. Buenau, J.S. Müller, D.A.J. Blythe,
F. Meinecke, and K.R. Müller. Regression for sets
of polynomial equations. JMLR Workshop and Conference Proceedings, 22:628–637, 2012.
H Möller and B Buchberger. The construction of multivariate polynomials with preassigned zeros. Computer Algebra, pages 24–31, 1982.
B. Schölkopf, A. Smola, and K.R. Müller. Nonlinear
component analysis as a kernel eigenvalue problem.
Neural computation, 10(5):1299–1319, 1998.
Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press,
Cambridge, MA, USA, 2001. ISBN 0262194759.
Sumio Watanabe. Algebraic Geometry and Statistical
Learning Theory. Cambridge Monographs on Applied and Computational Mathematics. Cambridge
University Press, United Kingdom, 2009.