Shape Quantization and Recognition With Randomized Trees
Shape Quantization and Recognition With Randomized Trees
Shape Quantization and Recognition With Randomized Trees
Yali Amit
Department of Statistics, University of Chicago, Chicago, IL, 60637, U.S.A.
Donald Geman
Department of Mathematics and Statistics, University of Massachusetts,
Amherst, MA 01003, U.S.A.
1 Introduction
features of the image data that are constructed from local topographic codes
(tags). A large sample of small subimages of fixed size is recursively
partitioned based on individual pixel values. The tags are simply labels
for the cells of each successive partition, and each pixel in the image is
assigned all the labels of the subimage centered there. As a result, the tags do
not involve detecting distinguished points along curves, special topological
structures, or any other complex attributes whose very definition can be
problematic due to locally ambiguous data. In fact, the tags are too primitive
and numerous to classify the shapes.
Although the mere existence of a tag conveys very little information, one
can begin discriminating among shape classes by investigating just a few
spatial relationships among the tags, for example, asking whether there is a
tag of one type north of a tag of another type. Relationships are specified
by coarse constraints on the angles of the vectors connecting pairs of tags
and on the relative distances among triples of tags. No absolute location or
scale constraints are involved. An image may contain one or more instances
of an arrangement, with significant variations in location, distances, angles,
and so forth. There is one binary feature (query) for each such spatial
arrangement; the response is positive if a collection of tags consistent with
the associated constraints is present anywhere in the image. Hence a query
involves an extensive disjunction (ORing) operation.
Two images that answer the same to every query must have very similar
shapes. In fact, it is reasonable to assume that the shape class is determined
by the full feature set; that is, the theoretical Bayes error rate is zero. But no
classifier based on the full feature set can be evaluated, and it is impossible
to determine a priori which arrangements are informative. Our approach is
to select informative features and build tree classifiers (Breiman, Friedman,
Olshen, & Stone, 1984; Casey & Nagy, 1984; Quinlan, 1986) at the same time
by inductive learning. In effect, each tree provides an approximation to
the full posterior where the features chosen depend on the branch that is
traversed.
There is a natural partial ordering on the queries that results from re-
garding each tag arrangement as a labeled graph, with vertex labels corre-
sponding to the tag types and edge labels to angle and distance constraints
(see Figures 6 and 7). In this way, the features are ordered according to
increasing structure and complexity. A related attribute is semi-invariance,
which means that a large fraction of those images of a given class that an-
swer the same way to a given query will also answer the same way to any
query immediately succeeding it in the ordering. This leads to nearly invari-
ant classification with respect to many of the transformations that preserve
shape, such as scaling, translation, skew and small, nonlinear deformations
of the type shown in Figure 2.
Due to the partial ordering, tree construction with an infinite-dimensional
feature set is computationally efficient. During training, multiple trees
(Breiman, 1994; Dietterich & Bakiri, 1995; Shlien, 1990) are grown, and a
1548 Yali Amit and Donald Geman
Figure 2: (Top) Perturbed LATEX symbols. (Bottom) Training data for one symbol.
ing the trees, simply by updating a counter in each tree for each new data
point.
The separation between tree making and parameter estimation, and the
possibility of using different training samples for each phase, opens the
way to selecting the queries based on either unlabeled samples (i.e., unsu-
pervised learning) or samples from only some of the shape classes. Both of
these perform surprisingly well compared with ordinary supervised learn-
ing.
Our recognition strategy differs from those based on true invariants (al-
gebraic, differential, etc.) or structural features (holes, endings, etc.). These
methods certainly introduce prior knowledge about shape and structure,
and we share that emphasis. However, invariant features usually require
image normalization or boundary extraction, or both, and are generally
sensitive to shape distortion and image degradation. Similarly, structural
features can be difficult to express as well-defined functions of the image
(as opposed to model) data. In contrast, our queries are stable and prim-
itive, precisely because they are not truly invariant and are not based on
distinguished points or substructures.
A popular approach to multiclass learning problems in pattern recog-
nition is based on ANNs, such as feedforward, multilayer perceptrons
(Dietterich & Bakiri, 1995; Fukushima & Miyake, 1982; Knerr, Personnaz,
& Dreyfus, 1992; Martin & Pitman, 1991). For example, the best rates on
handwritten digits are reported in LeCun et al. (1990). Classification trees
and neural networks certainly have aspects in common; for example, both
rely on training data, are fast online, and require little storage (see Brown,
Corruble, & Pittard, 1993; Gelfand & Delp, 1991). However, our approach to
invariance and generalization is, by comparison, more direct in that certain
properties are acquired by hardwiring rather than depending on learning
or image normalization. With ANNs, the emphasis is on parallel and local
processing and a limited degree of disjunction, in large part due to assump-
tions regarding the operation of the visual system. However, only a limited
degree of invariance can be achieved with such models. In contrast, the fea-
tures here involve extensive disjunction and more global processing, thus
achieving a greater degree of invariance. This comparison is pursued in
section 12.
The article is organized as follows. Other approaches to invariant shape
recognition are reviewed in section 2; synthesized random deformations of
293 basic LATEX symbols (see Figures 1 and 2) provide a controlled experi-
mental setting for an empirical analysis of invariance in a high-dimensional
shape space. The basic building blocks of the algorithm, namely the tags
and the tag arrangements, are described in section 3. In section 4 we ad-
dress the fundamental question of how to exploit the discriminating power
of the feature set; we attempt to motivate the use of multiple decision trees
in the context of the ideal Bayes classifier and the trade-off between ap-
proximation error and estimation error. In section 5 we explain the roles
1550 Yali Amit and Donald Geman
of the partial ordering and randomization for both supervised and unsu-
pervised tree construction; we also discuss and quantify semi-invariance.
Multiple decision trees and the full classification algorithm are presented in
section 6, together with an analysis of the dependence on the training set.
In section 7 we calculate some rough performance bounds, for both indi-
vidual and multiple trees. Generalization experiments, where the training
and test samples represent different populations, are presented in section 8,
and incremental learning is addressed in section 9. Fast indexing, another
possible role for shape quantization, is considered in section 10. We then
apply the method in section 11 to a real problemclassifying handwritten
digitsusing the National Institute of Standards and Technology (NIST)
database for training and testing, achieving state-of-the-art error rates. In
section 12 we develop the comparison with ANNs in terms of invariance,
generalization error, and connections to observed functions in the visual
system. We conclude in section 13 by assessing extensions to other visual
recognition problems.
2 Invariant Recognition
methods which reduce variability and yet preserve information can lead to
improved performance of any classifier; we shall see an example of this in
regard to slant correction for handwritten digits.
Template matching is another approach. One estimates a transformation
from x for each of the prototypes in the library. Classification is then based on
the collection of estimated transformations. This requires explicit modeling
of the prototypes and extensive computation at the estimation stage (usually
involving relaxation methods) and appears impractical with large numbers
of shape classes.
A third approach, closer in spirit to ours, is to search for invariant func-
tions 8(x), meaning that P(8(x) = c |Y = c) = 1 for some constants c ,
c = 1, . . . , K. The discriminating power of 8 depends on the extent to which
the values c are distinct. Many invariants for planar objects (based on sin-
gle views) and nonplanar objects (based on multiple views) have been dis-
covered and proposed for recognition (see Reiss, 1993, and the references
therein). Some invariants are based on Fourier descriptors and image mo-
ments; for example, the magnitude of Zernike moments (Khotanzad & Lu,
1991) is invariant to rotation. Most invariants require computing tangents
from estimates of the shape boundaries (Forsyth et al., 1991; Sabourin &
Mitiche, 1992). Examples of such invariants include inflexions and discon-
tinuities in curvature. In general, the mathematical level of this work is
advanced, borrowing ideas from projective, algebraic, and differential ge-
ometry (Mundy & Zisserman, 1992).
Other successful treatments of invariance include geometric hashing
(Lamdan, Schwartz, & Wolfson, 1988) and nearest-neighbor classifiers based
on affine invariant metrics (Simard et al., 1994). Similarly, structural fea-
tures involving topological shape attributes (such as junctions, endings,
and loops) or distinguished boundary points (such as points of high curva-
ture) have some invariance properties, and many authors (e.g., Lee, Srihari,
& Gaborski, 1991) report much better results with such features than with
standardized raw data.
In our view, true invariant features of the form above might not be suf-
ficiently stable for intensity-based recognition because the data structures
are often too crude to analyze with continuum-based methods. In particu-
lar, such features are not invariant to nonlinear deformations and depend
heavily on preprocessing steps such as normalization and boundary extrac-
tion. Unless the data are of very high quality, these steps may result in a lack
of robustness to distortions of the shapes, due, for example, to digitization,
noise, blur, and other degrading factors (see the discussion in Reiss, 1993).
Structural features are difficult to model and to extract from the data in a
stable fashion. Indeed, it may be more difficult to recognize a hole than
to recognize an 8. (Similar doubts about hand-crafted features and dis-
tinguished points are expressed in Jung & Nagy, 1995.) In addition, if one
could recognize the components of objects without recognizing the objects
themselves, then the choice of classifier would likely be secondary.
1552 Yali Amit and Donald Geman
3 Shape Queries
3.1 Tags. We employ primitive local features called tags, which pro-
vide a coarse description of the local topography of the intensity surface in
the neighborhood of a pixel. Instead of trying to manually characterize lo-
cal configurations of interestfor example, trying to define local operators
to identify gradients in the various directionswe adopt an information-
theoretic approach and code a microworld of subimages by a process very
similar to tree-structured vector quantization. In this way we sidestep the
1554 Yali Amit and Donald Geman
Figure 3: (Left) Three curves corresponding to the digit 3. (Middle) Three tan-
gent configurations determining these shapes via spline interpolation. (Right)
Graphical description of relations between locations of derivatives consistent
with all three configurations.
issues of boundary detection and gradients in the discrete world and allow
for other forms of local topographies. This approach has been extended to
gray level images in Jedynak and Fleuret (1996).
The basic idea is to reassign symbolic values to each pixel based on exam-
ining a few pixels in its immediate vicinity; the symbolic values are the tag
types and represent labels for the local topography. The neighborhood we
choose is the 4 4 subimage containing the pixel at the upper left corner. We
cluster the subimages with binary splits corresponding to adaptively choos-
ing the five most informative locations of the sixteen sites of the subimage.
Note that the size of the subimages used must depend on the resolution
at which the shapes are imaged. The 4 4 subimages are appropriate for
a certain range of resolutionsroughly 10 10 through 70 70 in our
experience. The size must be adjusted for higher-resolution data, and the
ultimate performance of the classifier will suffer if the resolution of the
test data is not approximately the same as that of the training data. The
best approach would be one that is multiresolution, something we have not
done in this article (except for some preliminary experiments in section 11)
but which is carried out in Jedynak and Fleuret (1996) in the context of
gray-level images and 3D objects.
A large sample of 44 subimages is randomly extracted from the training
data. The corresponding shape classes are irrelevant and are not retained.
The reason is that the purpose of the sample is to provide a representative
database of microimages and to discover the biases at that scale; the statistics
of that world is largely independent of global image attributes, such as
symbolic labels. This family of subimages is then recursively partitioned
with binary splits. There are 4 4 = 16 possible questions: Is site (i, j)
black? for i, j = 1, 2, 3, 4. The criterion for choosing a question at a node
Shape Quantization and Recognition 1555
Figure 5: (Top) All instances of the four two-bit tags. (Bottom) All instances of
the eight three-bit tags.
0 1 2 3 4 5 6 7 8 9
.13 .003 .03 .08 .04 .07 .23 0 .26 .16
angle of the vector u v is within /4 of k /4. Let A denote the set of all
possible arrangements, and let Q = {QA : A A}, our feature set.
There are many other binary and ternary relations that have discriminat-
ing power. For example, there is an entire family of metric relationships
that are, like the directional relationships above, completely scale and trans-
lation invariant. Given points u, v, w, z, one example of a ternary relation is
ku vk < ku wk, which inquires whether u is closer to v than to w. With
four points we might ask if ku vk < kw zk.
along each branch from root to leaf is chosen sequentially and based on the
current information content given the observed values of the previously
chosen features. The classifier based on T is then
1 XN
YS = arg max PL (Y = c|Tn ).
c N n=1
However, this is impractical for reasonably sized training sets for the same
reasons that a single deep tree is impractical (see section 6.4 for some nu-
merical experiments). The trade-off between AE and EE is related to the
trade-off between bias and variance, which is discussed in section 6.2, and
the relative error rates among all these classifiers is analyzed in more detail
in section 6.4 in the context of parameter estimation.
Figure 7: Examples of node splitting. All six images lie in the same node and
have a pending arrangement with three vertices. The 0s are separated from the
3s and 5s by asking for the presence of a new tag, and then the 3s and 5s
are separated by asking a question about the relative angle between two existing
vertices. The particular tags associated with these vertices are not indicated.
Figure 8: Samples from data sets. (Top) Spot noise. (Middle) Duplication.
(Bottom) Severe perturbations.
rangements with only two tags. However, the chance of finding complex
arrangements utilizing noise tags or background tags is much smaller. Put
differently, a structural description is more robust than a list of attributes.
The situation is the same for more complex shapes; see, for example, the
middle panel of Figure 8, where the shapes were created by duplicating
each symbol four times with some shifts. Again, a random choice among
minimal extensions carries much more information than a random choice
in B.
At each nonterminal node t of each tree, the average value of H(QA |Y, Bt )
was calculated over twenty randomly sampled minimal extensions. Over all
nodes, the mean entropy was m = .33; this is the entropy of the distribution
(.06, .94). The standard deviation over all nodes and queries was = .08.
Moreover, there was a clear decrease in average entropy (i.e., increase in the
degree of invariance) as the depth of the node increases.
We also estimated the entropy for more severe deformations. On a more
variable data set with approximately double the range of rotations, log scale,
and log skew (relative to the values in section 2), and the same nonlinear de-
formations, the corresponding numbers were m = .38, = .09. Finally for
rotations sampled from (30, 30) degrees, log scale from (.5, .5), log skew
from (1, 1), and doubling the variance of the random nonlinear deforma-
tion (see the bottom panel of Figure 8), the corresponding mean entropy
was m = .44 ( = .11), corresponding to a (.1, .9) split. In other words, on
average, 90 percent of the images in the same shape class still answer the
same way to a new query.
Notice that invariance property is independent of the discriminating
power of the query, that is, the extent to which the distribution P(Y =
c|Bt , QA ) is more peaked than the distribution P(Y = c|Bt ). Due to the sym-
metry of mutual information,
This means that if we seek a question that maximizes the reduction in the
conditional entropy of Y and assume the second term on the right is small
due to semi-invariance, then we need only find a query that maximizes
H(QA |Bt ). This, however, does not involve the class variable and hence
points to the possibility of unsupervised learning, which is discussed in the
following section.
Since the first term is independent of m, the query of choice will again be
the one maximizing H(Qm |Bt ). Recall that the entropy values are estimated
from training data and that Qm is binary. It follows that growing a tree aimed
at reducing uncertainty about Q is equivalent to finding at each node that
query which best splits the data at the node into two equal parts. This results
from the fact that maximizing H(p) = p log2 (p) + (1 p) log2 (1 p) reduces
to minimizing |p .5|.
In this way we generate shape quantiles or clusters ignoring the class labels.
Still, the tree variable T is highly correlated with the class variable Y. This
would be the case even if the tree were grown from samples representing
only some of the shape classes. In other words, these clustering trees produce
a generic quantization of shape space. In fact, the same trees can be used to
classify new shapes (see section 9).
We have experimented with such trees, using the splitting criterion de-
scribed above as well as another unsupervised one based on the question
metric,
1 XM
dQ (x, x0 ) = (Qm (x) 6= Qm (x0 )), x, x0 X
M m=1
where (. . .) = 1 if the statement is true and (. . .) = 0 otherwise. Since Q
leads to Y, it makes sense to divide the data so that each child is as homo-
geneous as possible with respect to dQ ; we omit the details. Both clustering
methods lead to classification rates that are inferior to those obtained with
splits determined by separating classes but still surprisingly high; one such
experiment is reported in section 6.1.
6 Multiple Trees
We have seen that small, random subsets of the admissible queries at any
node invariably contain at least one query that is informative about the
shape class. What happens if many such trees are constructed using the
same training set L? Because the family Q of queries is so large and because
different queriestag arrangementsaddress different aspects of shape,
separate trees should provide separate structural descriptions, characteriz-
ing the shapes from different points of view. This is illustrated in Figure 9,
where the same image is shown with an instance of the pending graph at
the terminal node in five different trees. Hence, aggregating the information
provided by a family of trees (see section 6.1) should yield more accurate
and more robust classification. This will be demonstrated in experiments
throughout the remainder of the article.
1566 Yali Amit and Donald Geman
but this is not feasible (see section 6.4). Another option would be to re-
gard the trees as high-dimensional inputs to standard classifiers. We tried
that with classification trees, linear and nonlinear discriminant analysis,
K-means clustering, and nearest neighbors, all without improvement over
simple averaging for the amount of training data we used.
By averaging, we mean the following. Let n, (c) denote the posterior
distribution P(Y = c|Tn = ), n = 1, . . . , N, c = 1, . . . , K, where denotes a
terminal node. We write Tn for the random variable n,Tn . These probabil-
ities are the parameters of the system, and the problem of estimating them
will be discussed in section 6.4. Define
1 XN
(x) = T (x) ,
N n=1 n
YS = arg max c .
c
Shape Quantization and Recognition 1567
6.3 Relative Error Rates. Due to estimation error, we favor many trees
of modest depth over a few deep ones, even at the expense of theoretically
higher error rates where perfect estimation is possible. In this section, we
analyze those error rates for some of the alternative classifiers discussed
above in the asymptotic case of infinite data and assuming the total number
of features examined is held fixed, presumably large enough to guarantee
low approximation error. The implications for finite data are outlined in
section 6.4.
Instead of making N trees T1 , . . . , TN of depth D, suppose we made just
one tree T of depth ND; in both cases we are asking ND questions. Of course
this is not practical for the values of D and N mentioned above (e.g., D = 10,
N = 20), but it is still illuminating to compare the hypothetical performance
of the two methods. Suppose further that the criterion for selecting T is to
minimize the error rate over all trees of depth ND:
where the maximum is over all trees of depth ND. The error rate of the
corresponding classifier Y = arg maxc P(Y = c|T ) is then e(Y ) = 1
E[maxc P(Y = c|T )]. Notice that finding T would require the solution of
a global optimization problem that is generally intractable, accounting for
the nearly universal adoption of greedy tree-growing algorithms based on
entropy reduction, such as the one we are using. Notice also that minimizing
the entropy H(Y|T) or the error rate P(Y 6= Y(T)) amounts to basically the
same thing.
Let e(YA ) and e(YS ) be the error rates of YA and YS (defined in 6.1),
respectively. Then it is easy to show that
The first inequality results from the observation that the N trees of depth
D could be combined into one tree of depth ND simply by grafting T2 onto
each terminal node of T1 , then grafting T3 onto each terminal node of the
new tree, and so forth. The error rate of the tree so constructed is just e(YA ).
However, the error rate of T is minimal among all trees of depth ND, and
hence is lower than e(YA ). Since YS is a function of T1 , . . . , TN , the second
inequality follows from a standard argument:
Shape Quantization and Recognition 1569
7 Performance Bounds
We divide this into two cases: individual trees and multiple trees. Most of the
analysis for individual trees concerns a rather ideal case (twenty questions)
in which the shape classes are atomic; there is then a natural metric on
shape classes, and one can obtain bounds on the expected uncertainty after
a given number of queries in terms of this metric and an initial distribution
over classes. The key issue for multiple trees is weak dependence, and the
analysis there is focused on the dependence structure among the trees.
7.1 Individual Trees: Twenty Questions. Suppose first that each shape
class or hypothesis c is atomic, that is, it consists of a single atom of Q
(as defined in section 4). In other words each hypothesis c has a unique
code word, which we denote by Q(c) = (Q1 (c), . . . , QM (c)), so that Q is
determined by Y. This setting corresponds exactly to a mathematical version
of the twenty questions game. There is also an initial distribution (c) =
P(Y = c). For each c = 1, . . . , K, the binary sequence (Qm (1), . . . , Qm (K))
determines a subset of hypothesesthose that answer yes to query Qm .
Since the code words are distinct, asking enough questions will eventually
determine Y. The mathematical problem is to find the ordering of the queries
that minimizes the mean number of queries needed to determine Y or the
mean uncertainty about Y after a fixed number of queries. The best-known
example is when there is a query for every subset of {1, . . . , K}, so that
M = 2K . The optimal strategy is given by the Huffman code, in which case
the mean number of queries required to determine Y lies in the interval
[H(Y), H(Y) + 1) (see Cover & Thomas, 1991).
Suppose 1 , . . . , k represent the indices of the first k queries. The mean
residual uncertainty about Y after k queries is then
Consequently, if at each stage there is a query that divides the active hy-
potheses into two groups such that the mass of the smaller group is at least
(0 < .5), then H(Y|Q1 , . . . , Qk ) H(Y) kH(). The mean deci-
sion time is roughly H(Y)/H(). In all unsupervised trees we produced, we
found H(Qk |Q1 , . . . , Qk1 ) to be greater than .99 (corresponding to .5)
at 95 percent of the nodes.
If assumptions are made about the degree of separation among the code
words, one can obtain bounds on mean decision times and the expected
uncertainty after a fixed number of queries, in terms of the prior distribution
. For these types of calculations, it is easier to work with the Hellinger
Shape Quantization and Recognition 1571
and define G(Y), G(Y|Bt ), and G(Y|Bt , Qm ) the same way as with the entropy
function H. (G and H have similar properties; for example, G is minimized
on a point mass, maximized on the uniform distribution, and it follows
from Jensens inequality that H(p) log2 [G(p) + 1].) The initial amount of
uncertainty is
X
G(Y) = 1/2 (c) 1/2 (c0 ).
c6=c0
For any subset {m1 , . . . , mk } {1, . . . , M}, using Bayes rule and the fact
that P(Q|Y) is either 0 or 1, we obtain
XY
k
G(Y|Qm1 , . . . , Qmk ) = (Qmi (c) = Qmi (c0 )) 1/2 (c) 1/2 (c0 ).
c6=c0 i=1
X X X Y
k
Mk G(Y|Qm1 , . . . , Qmk ) = Mk
(m1 ,...,mk ) c6=c0 (m1 ,...,mk ) i=1
Class 0 1 2 3 4 5 6 7 8 9
c 0.66 0.86 0.80 0.74 0.74 0.64 0.56 0.86 0.49 0.68
c 0.03 0.01 0.01 0.01 0.03 0.02 0.04 0.01 0.02 0.01
ec 0.14 0.04 0.03 0.04 0.11 0.13 0.32 0.02 0.23 0.05
Let c denote the sum of the conditional variances, and let c denote the
sum of the conditional covariances, both averaged over the trees:
1 XN X K
Var(Tn (d)|Y = c) = c
N n=1 d=1
1 XX K
Cov(Tn (d), Tm (d)|Y = c) = c .
N2 n6=m d=1
We see that
c + c /N 2(c + c /N)(K 1)2
P(YS 6= c|Y = c) = .
c2 (c K 1)2
1574 Yali Amit and Donald Geman
Since c /N will be small compared with c , the key parameters are c and c .
This inequality yields only coarse bounds. However, it is clear that under
the assumptions above, high classification rates are feasible as long as c is
sufficiently small and c is sufficiently large, even if the estimates Tn are
poor.
Observe that the N trees form a simple random sample from some large
population T of trees under a suitable distribution on T . This is due to the
randomization aspect of tree construction. (Recall that at each node, the split-
ting rule is chosen from a small random sample of queries.) Both Ec and
the sum of variances are sample means of functionals on T . The sum of the
covariances has the form of a U statistic. Since the trees are drawn indepen-
dently and the range of the corresponding variables is very small (typically
less than 1), standard statistical arguments imply that these sample means
are close to the corresponding population means for a moderate number N
of trees, say, tens or hundreds. In other words, c ET EX (T (c)|Y = c) and
P
c ET T Kd=1 CovX (T1 (d), T2 (d)|Y = c). Thus the conditions on c and
c translate into conditions on the corresponding expectations over T , and
the performance variability among the trees can be ignored.
Table 3 shows some estimates of c and c and the resulting bound ec on
the misclassification rate P(YS 6= c|Y = c). Ten pairs of random trees were
made on ten classes to estimate c and c . Again, the bounds are crude; they
could be refined by considering higher-order joint moments of the trees.
8 Generalization
Table 4: Classification Rates for Various Training Sample Sizes Compared with
Nearest-Neighbor Methods.
raw image (see Figure 10). With thirty-two samples per symbol, NN(raw)
climbs to 76 percent, whereas the trees reach 98.5 percent.
8.2 Extrapolation. We also grew trees using only the original prototypes
xc , c = 1, . . . , 293, recursively dividing this group until pure leaves were
obtained. Of course, the trees are relatively shallow. In this case, only about
half the symbols in X could then be recognized (see Table 4).
The 100 trees grown with thirty-two samples per symbol were tested on
samples that exhibit a greater level of distortion or variability than described
up to this point. The results appear in Table 5. Upscaling (resp. down-
scaling) refers to uniform sampling between the original scale and twice
(resp. half) the original scale, as in the top (resp. middle) panel of Figure 11;
spot noise refers to adding correlated noise (see the top panel of Figure 8).
Clutter (see the bottom panel of Figure 11) refers to the addition of pieces
of other symbols in the image. All of these distortions came in addition
to the random nonlinear deformations, skew, and rotations. Downscaling
creates more confusions due to extreme thinning of the stroke. Notice that
the NN(B) classifier falls apart with spot noise. The reason is the number of
false positives: tags due to the noise induce random occurrences of simple
arrangements. In contrast, complex arrangements A are far less likely to
be found in the image by pure chance; therefore, chance occurrences are
weeded out deeper in the tree.
8.3 Note. The purpose of all the experiments in this article is to illustrate
various attributes of the recognition strategy. No effort was made to opti-
mize the classification rates. In particular, the same tags and tree-making
Shape Quantization and Recognition 1577
output of these trees was combined with the output of the original trees
on the test data. No change in the classification rate was observed for the
original test set. For the test set with spot noise, the two sets of trees each
had a classification rate of about 72 percent. Combined, however, they yield
a rate of 86 percent. Clearly there is a significant potential for improvement
in this direction.
10 Fast Indexing
The optical character recognition (OCR) problem has many variations, and
the literature is immense; one survey is Mori, Suen, and Yamamoto (1992).
In the area of handwritten character recognition, perhaps the most difficult
problem is the recognition of unconstrained script; zip codes and hand-
drawn checks also present a formidable challenge. The problem we consider
is off-line recognition of isolated binary digits. Even this special case has at-
tracted enormous attention, including a competition sponsored by the NIST
(Wilkinson et al., 1992), and there is still no solution that matches human
performance, or even one that is commercially viable except in restricted
situations. (For comparisons among methods, see Bottou et al., 1994, and
the lucid discussion in Brown et al., 1993.) The best reported rates seem to be
those obtained by the AT&T Bell Laboratories: up to 99.3 percent by training
and testing on composites of the NIST training and test sets (Bottou et al.,
1994).
We present a brief summary of the results of experiments using the tree-
based shape quantization method to the NIST database. (For a more detailed
1580 Yali Amit and Donald Geman
Figure 12: Random sample of test images before (top) and after (bottom) pre-
processing.
description, see Geman et al., 1996.) Our experiments were based on por-
tions of the NIST database, which consists of approximately 223,000 binary
images of isolated digits written by more than 2000 writers. The images vary
widely in dimensions, ranging from about twenty to one hundred rows, and
they also vary in stroke thickness and other attributes. We used 100,000 for
training and 50,000 for testing. A random sample from the test set is shown
in Figure 12.
All results reported in the literature utilize rather sophisticated methods
of preprocessing, such as thinning, slant correction, and size normalization.
For the sake of comparison, we did several experiments using a crude form
of slant correction and scaling, and no thinning. Twenty-five trees were
made. We stopped splitting when the number of data points in the second
largest class fell below ten. The depth of the terminal nodes (i.e., number
of questions asked per tree) varied widely, the average over trees being 8.8.
The average number of terminal nodes was about 600, and the average clas-
sification rate (determined by taking the mode of the terminal distribution)
was about 91 percent. The best error rate we achieved with a single tree was
about 7 percent.
The classifier was tested in two ways. First, we preprocessed (scaled and
Shape Quantization and Recognition 1581
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.9
5 10 15 20 25
slant corrected) the test set in the same manner as the training set. The re-
sulting classification rate is 99.2 percent (with no rejection). Figure 13 shows
how the classification rate grows with the number of trees. Recall from sec-
tion 6.1 that the estimated class of an image x is the mode of the aggregate
distribution (x). A good measure of the confidence in this estimate is the
value of (x) at the mode; call it M(x). It provides a natural mechanism for
rejection by classifying only those images x for which M(x) > m; no rejec-
tion corresponds to m = 0. For example, the classification rate is 99.5 percent
with 1 percent rejection and 99.8 percent with 3 percent rejection. Finally,
doubling the number of trees makes the classification rates 99.3 percent,
99.6 percent, and 99.8 percent at 0, 1, and 2 percent rejection, respectively.
We performed a second experiment in which the test data were not pre-
processed in the manner of the training data; in fact, the test images were
classified without utilizing the size of the bounding box. This is especially
important in the presence of noise and clutter when it is essentially impossi-
1582 Yali Amit and Donald Geman
ble to determine the size of the bounding box. Instead, each test image was
classified with the same set of trees at two resolutions (original and halved)
and three (fixed) slants. The highest of the resulting six modes determines
the classification. The classification rate was 98.9 percent.
We classify approximately fifteen digits per second on a single proces-
sor SUN Sparcstation 20 (without special efforts to optimize the code); the
time is approximately equally divided between transforming to tags and
answering questions. Test data can be dropped down the trees in parallel,
in which case classification would become approximately twenty-five times
faster.
capacity, there is still the problem of estimating the weights from limited
data (Niyogi & Girosi, 1996).
Finally, classifiers based on ANNs address classification in a less non-
parametric fashion. In contrast to our approach, the classifier has an explicit
functional (input-output) representation from the feature vector to Y. Of
course, a great many parameters may need to be estimated in order to
achieve low approximation error, at least for very complex classification
problems such as shape recognition. This leads to a relatively large variance
component in the bias/variance decomposition. In view of these difficul-
ties, the small error rates achieved by ANNs on handwritten digits and
other shape recognition problems are noteworthy.
12.2 Invariance and the Visual System. The structure of ANNs for
shape classification is partially motivated by observations about process-
ing in the visual cortex. Thus, the emphasis has been on parallel processing
and the local integration (funneling) of information from one layer to the
next. Since this local integration is done in a spatially stationary manner,
translation invariance is achieved. The most successful example in the con-
text of handwritten digit is the convolution neural net LeNET (LeCun et
al., 1990). Scale, deformation, and other forms of invariance are achieved
to a lesser degree, partially from using gray-level data and soft thresholds,
but mainly by utilizing very large training sets and sophisticated image
normalization procedures.
The neocognitron (Fukushima & Miyake, 1982; Fukushima & Wake, 1991)
is an interesting mechanism to achieve a degree of robustness to shape de-
formations and scale changes in addition to translation invariance. This is
done using hidden layers, which carry out a local ORing operation mimick-
ing the operation of the complex neurons in V1. These layers are referred to
as C-type layers in Fukushima and Miyake (1982). A dual-resolution version
(Gochin, 1994) is aimed at achieving scale invariance over a larger range.
Spatial stationarity of the connection weights, the C-layers, and the use of
multiple resolutions are all forms of ORing aimed at achieving invariance.
The complex cell in V1 is indeed the most basic evidence of such an operation
in the visual system. There is additional evidence for disjunction at a very
global scale in the cells of the inferotemporal cortex with very large receptive
fields. These respond to certain shapes at all locations in the receptive field,
at a variety of scales but also for a variety of nonlinear deformations such as
changes in proportions of certain parts (see Ito, Fujita, Tamura, & Tanaka,
1994; Ito, Tamura, Fujita, & Tanaka, 1995).
It would be extremely inefficient to obtain such invariance through ORing
of responses to each of the variations of these shapes. It would be prefer-
able to carry out the global ORing at the level of primitive generic features,
which can serve to identify and discriminate between a large class of shapes.
We have demonstrated that global features consisting of geometric arrange-
1584 Yali Amit and Donald Geman
ments of local responses (tags) can do just that. The detection of any of the
tag arrangements described in the preceding sections can be viewed as an
extensive and global ORing operation. The ideal arrangement, such as a ver-
tical edge northwest of a horizontal edge, is tested at all locations, all scales,
and a large range of angles around the northwest vector. A positive response
to any of these produces a positive answer to the associated query. This leads
to a property we have called semi-invariance and allows our method to per-
form well without preprocessing and without very large training sets. Good
performance extends to transformations or perturbations not encountered
during training (e.g., scale changes, spot noise, and clutter).
The local responses we employ are very similar to ones employed at the
first level of the convolution neural nets, for example, in Fukushima and
Wake (1991). However, at the next level, our geometric arrangements are
defined explicitly and designed to accommodate a priori assumptions about
shape regularity and variability. In contrast, the features in convolution
neural nets are all implicitly derived from local inputs to each layer and
from the slack obtained by local ORing, or soft thresholding, carried out
layer by layer. It is not clear that this is sufficient to obtain the required
level of invariance, nor is it clear how portable they are from one problem
to another.
Evidence for correlated activities of neurons with distant receptive fields
is accumulating, not only in V1 but in the lateral geniculate nucleus and
in the retina (Neuenschwander and Singer, 1996). Gilbert, Das, Ito, Kapa-
dia, and Westheimer (1996) report increasing evidence for more extensive
horizontal connections. Large optical point spreads are observed, where
subthreshold neural activation appears in an area covering the size of sev-
eral receptive fields. When an orientation-sensitive cell fires in response to
a stimulus in its receptive field, subthreshold activation is observed in cells
with the same orientation preference in a wide area outside the receptive
field. It appears therefore that the integration mechanisms in V1 are more
global and definitely more complex than previously thought. Perhaps the
features described in this article can contribute to bridging the gap between
existing computational models and new experimental findings regarding
the visual system.
13 Conclusion
Acknowledgments
We thank Ken Wilder for many valuable suggestions and for carrying out
the experiments on handwritten digit recognition. Yali Amit has been sup-
ported in part by the ARO under grant DAAL-03-92-0322. Donald Geman
has been supported in part by the NSF under grant DMS-9217655, ONR
under contract N00014-91-J-1021, and ARPA under contract MDA972-93-1-
0012.
References
Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization?
Neural Comp., 1, 151160.
1586 Yali Amit and Donald Geman
Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error,
hypothesis complexity, and sample complexity for radial basis functions.
Neural Comp., 8, 819842.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81106.
Raudys, S., & Jain, A. K. (1991). Small sample size problems in designing artificial
neural networks. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks
and Statistical Pattern Recognition. Amsterdam: North-Holland.
Reiss, T. H. (1993). Recognizing planar objects using invariant image features. Lecture
Notes in Computer Science no. 676. Berlin: Springer-Verlag.
Sabourin, M., & Mitiche, A. (1992). Optical character recognition by a neural
network. Neural Networks, 5, 843852.
Sethi, I. K. (1991). Decision tree performance enhancement using an artificial
neural network implementation. In I. K. Sethi & A. K. Jain (Eds.), Artifi-
cial Neural Networks and Statistical Pattern Recognition. Amsterdam: North-
Holland.
Shlien, S. (1990). Multiple binary decision tree classifiers. Pattern Recognition, 23,
757763.
Simard, P. Y., LeCun, Y. L., & Denker, J. S. (1994). Memory-based character recog-
nition using a transformation invariant metric. In Proc. 12th Inter. Conf. on
Pattern Recognition (Vol. 2, pp. 262267). Los Alamitos, CA: IEEE Computer
Society Press.
Werbos, P. (1991). Links between artificial neural networks (ANN) and statistical
pattern recognition. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks
and Statistical Pattern Recognition. Amsterdam: North-Holland.
Wilkinson, R. A., Geist, J., Janet, S., Grother, P., Gurges, C., Creecy, R., Hammond,
B., Hull, J., Larsen, N., Vogl, T., & and Wilson, C. (1992). The first census
optical character recognition system conference (Tech. Rep. No. NISTIR 4912).
Gaithersburg, MD: National Institute of Standards and Technology.