Sketch-R2CNN: An Attentive Network for Vector Sketch Recognition
Lei Li
HKUST
Changqing Zou
University of Maryland, College Park
Youyi Zheng
Zhejiang University
arXiv:1811.08170v1 [cs.CV] 20 Nov 2018
Hongbo Fu
City University of Hong Kong
Abstract
Freehand sketching is a dynamic process where points
are sequentially sampled and grouped as strokes for sketch
acquisition on electronic devices. To recognize a sketched
object, most existing methods discard such important temporal ordering and grouping information from human and
simply rasterize sketches into binary images for classification. In this paper, we propose a novel singlebranch attentive network architecture RNN-RasterizationCNN (Sketch-R2CNN for short) to fully leverage the dynamics in sketches for recognition. Sketch-R2CNN takes as
input only a vector sketch with grouped sequences of points,
and uses an RNN for stroke attention estimation in the vector space and a CNN for 2D feature extraction in the pixel
space respectively. To bridge the gap between these two
spaces in neural networks, we propose a neural line rasterization module to convert the vector sketch along with
the attention estimated by RNN into a bitmap image, which
is subsequently consumed by CNN. The neural line rasterization module is designed in a differentiable way to yield
a unified pipeline for end-to-end learning. We perform experiments on existing large-scale sketch recognition benchmarks and show that by exploiting the sketch dynamics with
the attention mechanism, our method is more robust and
achieves better performance than the state-of-the-art methods.
1. Introduction
Freehand sketching is an easy and quick means of communication because of its simplicity and expressiveness.
While a human has the innate ability to interpret drawing
semantics, the vast capacity of expressiveness in sketches
poses great perception challenges to machines. For better
human-computer interactions, sketch analysis has been an
active research topic in the computer vision and graphics
fields, spanning a wide spectrum including sketch recognition [3, 44, 47], sketch segmentation [35, 11, 17, 18],
Qingkun Su
Alibaba A.I. Labs
Chiew-Lan Tai
HKUST
sketch-based retrieval [4, 38, 30, 42] and modeling [26], etc.
In this paper, we focus on developing a novel learning-based
method for freehand sketch recognition.
The goal of sketch classification or recognition is to identify the object category of an input sketch, which is more
challenging than image classification due to the lack of rich
texture details, inherent ambiguities, and large shape variations in the input. Traditional studies [3, 31, 19] commonly
cast sketch recognition as an image classification task by
converting sketches into binary images and then extracting
local image features. With the quantified feature descriptors, a typical classifier such as Support Vector Machine
(SVM) is trained for object category prediction. Recent
years have witnessed the success of deep learning in image classification [14]. Similar neural network designs have
also been used to address the recognition problem of sketch
images [44, 30]. Although these deep learning-based methods outperform the traditional ones, the unique properties
of sketches, as discussed in the following, are often overlooked, leaving room for further improving the performance
of sketch recognition.
In general, sketch has two widely-used representations
for processing, which are raster pixel sketch and vector
sketch. Raster pixel sketches are binary images with pixels covered by strokes having the value one and the rest of
pixels the value zero, resulting in a large portion of void
pixels and thus a sparse representation. This representation
does not allow the state-of-the-art convolutional neural networks (CNNs) to easily distinguish which strokes are more
important or which strokes can be ignored for better recognition [31]. Following the definition in [42], a vector sketch
in our work refers to a sequence of strokes containing the
points in the drawing order (Fig. 1). A vector sketch can
be easily converted into a bitmap image through rasterization but not vice versa. Notably, vector sketches contain
rich temporal ordering and grouping (i.e., strokes) information, which has been shown to be useful for learning more
descriptive features [42]. However, these information cues
are all discarded during the rasterization process for pixel
images and thus inaccessible by subsequent recognition al-
gorithms.
Motivated by the above discussions, to address the
incapacity of existing CNN-based methods for stroke
importance interpretation, we propose a novel singlebranch attentive network architecture RNN-RasterizationCNN (Sketch-R2CNN for short), for vector sketch recognition. Sketch-R2CNN takes advantages of both vector and
raster representations of sketches during the learning process and is able to focus on adaptively learned important
strokes, with an attention mechanism, for better recognition (Fig. 1). It takes only a vector sketch (i.e., grouped sequences of points) as input, and employs a recurrent neural
network (RNN) in the first stage for analyzing the temporal ordering and grouping information in the input and producing attention estimations for the stroke points. We then
develop a novel neural line rasterization (NLR) module, capable of converting the vector sketch with the computed attentions into an attention map in a differentiable way. Subsequently, Sketch-R2CNN uses a CNN to consume the obtained attention map for guided hierarchical understanding
and feature extraction on critical strokes to identify the target object category. Our proposed NLR module is the key
to connecting the vector sketch space and the raster sketch
space in neural networks and allows gradient information
to back propagate from CNN to RNN for end-to-end learning. Experiments on existing large-scale sketch recognition
benchmarks [3, 8] show that our method, leveraging more
human factors in the input, performs better than the state-ofthe-art methods, and our RNN-Rasterization-CNN design
consistently improves the performance of CNN-only methods.
In summary, our contributions in this work are: (1)
the first single-branch attentive network with an RNNRasterization-CNN design for vector sketch recognition; (2)
a novel differentiable neural line rasterization module that
unifies the vector sketch space and raster sketch space in
neural networks, allowing end-to-end learning. We will
make our code publicly available.
2. Related Work
To recognize sketched objects, traditional methods generally take preprocessed raster sketches as input. To quantify a sketch image, existing studies have tried to adapt several types of local features originally intended for photos
(e.g., bag-of-features [3], Fisher Vectors with SIFT features [31], HOG features [19]) to line drawing images.
With the extracted features, classifiers (e.g., SVMs) are
then trained to recognize unseen sketches [3, 31]. Different learning schemes, such as multiple kernel learning [19]
or active learning [43], may be employed for performance
improvement. Another line of traditional methods has also
attempted to utilize additional cues for recognition, such as
prior knowledge for domain-specific sketches [1, 15, 27, 23,
32, 2] or object context for sketched scenes [47, 48]. While
progress has been made in sketch recognition, these methods still cannot robustly handle freehand sketches with large
shape or style variations, especially those hastily drawn in
dozens of seconds [8], struggling to achieve performance on
par with human on existing benchmarks like the TU-Berlin
benchmark [3].
Recently, deep learning has revolutionized many research fields, including sketch recognition, with state-ofthe-art performance. Research efforts [30, 46, 39, 44]
have been made to employ deep neural networks, such as
AlexNet [14] or GoogLeNet [36], to learn more discriminative image features in the sketch domain to replace handengineered ones. Yu et al. [44] proposed Sketch-a-Net,
an AlexNet-like architecture specifically adapted for sketch
images by using large kernels in convolutions to accommodate the sparsity of stroke pixels. Their method achieved
superior classification accuracy (77.95%) on the TU-Berlin
benchmark [3], surpassing human performance (73.1%) for
the first time. Their method still follows the existing learning process of image classification, i.e., using the raster image representation of sketches as CNN inputs, and thus cannot easily learn the awareness of stroke importance in an
end-to-end manner for further improvement. In contrast,
our network directly consumes vector sketches as input for
learning stroke importance effectively and adaptively by exploiting the temporal ordering and grouping information
therein with RNNs.
Vector representation of sketches has been considered
for certain tasks such as sketch generation [7, 8, 33] or
sketch hashing [42] with deep learning. For example,
SketchRNN [8], which has received much attention recently, is built upon RNNs to process vector sketches. It
is composed of an RNN encoder followed by an RNN
decoder, and is able to model the underlying distribution
of points in vector sketches for a specific object category.
To learn to hash sketches for retrieval, Xu et al. [42] has
demonstrated that an RNN branch, exploiting temporal ordering in vector sketches, can complement the other CNN
branch for extracting more descriptive features. They fuse
two types of features, produced by RNN and CNN respectively, via a late-fusion layer by concatenation. Our work
shares a similar spirit with [42], advocating that the temporal and grouping information in vector sketches also offer
additional cues for more accurate sketch recognition. In
contrast to their two-branch network with simple concatenation, our RNN-Rasterization-CNN design seeks to boost
the synergy between the two networks in a single branch
during the learning process. To this end, inspired by [12],
which proposed an approximate gradient for in-network
mesh rendering and rasterization, we design a novel neural line rasterization module, allowing gradients to backpropagate from CNN (raster sketch space) to RNN (vector
x1, y1, 0
p2
x2, y2, 0
pn
xn, yn, 1
a1
a2 … an
…
pn
vector sketch
point sequence
a1
…
a2
p1 p2 … pn
RNN
Neural
Line
Raster
CNN
cat
…
p1
…
p2
p1
an
per-point
attention
attention map
classification
Figure 1. Illustration of our single-branch attentive network architecture for vector sketch recognition. (Neural Line Raster stands for our
neural line rasterization (NLR) module.)
sketch space) for end-to-end learning.
For a sketch, its constituent strokes may contribute differently to its recognition. With a trained SVM, Schneider et al. [31] qualitatively analyzed how stroke importance
affects classification scores by iteratively removing each
stroke from the corresponding raster sketch image. To automatically capture stroke importance during the learning process, researchers have attempted to adapt an attention mechanism in network design [34]. Attention mechanism has
been widely used in many visual tasks, such as image classification [24, 40, 37, 10], image caption [41, 22] or Visual
Question Answering (VQA) [25]. A simple attention module generally works by computing soft masks over the spatial image grid [37, 41], or even feature channels [10], to obtain weighted combination of features. Song et al. [34] has
incorporated a spatial attention module for raster sketches in
their network for fine-grained sketch-based image retrieval.
Differently, Riaz Muhammad et al. [28] tackled the sketch
abstraction task with reinforcement learning, which aims to
develop a stroke removal policy by considering the stroke
influence to recognizability. As discussed in existing studies [44, 42, 6, 5], CNNs may suffer from the sparsity of
inputs (e.g., raster sketches), though they excel at building
hierarchical representations of 2D inputs. Instead of struggling to estimate attention from binary images that contain limited information [34], we argue that additional cues,
such as the temporal ordering and grouping information in
vector sketches, are essential to learn reliable attention for
strokes. In our method, we resort to RNNs for computing attention for each point in a vector sketch, and use our
NLR module for in-network vector-to-raster conversion. To
our best knowledge, no existing work has tried to derive an
attention map from vector sketches with RNNs for CNNbased sketch recognition.
3. Method
Our network architecture, as illustrated in Fig. 1, is composed of two cascaded sub-networks: an RNN for stroke attention estimation in the vector sketch space and a CNN for
2D feature extraction in the raster sketch space (Sec. 3.2).
The key enabler for linking the two sub-networks that operate in completely different spaces is a novel neural line
rasterization (NLR) module, which converts a vector sketch
with the estimated attention to a raster pixel sketch in a differentiable way (Sec. 3.3). More specifically, during the
forward inference pass, given a vector sketch as input, the
RNN takes in a point at each time step and computes a
corresponding attention value for the point. Our proposed
NLR module then rasterizes the vector sketch, together with
the estimated per-point attention, into an attention map and
computes the corresponding gradients for the backward optimization pass. A subsequent CNN consumes the attention
map as input for hierarchical understanding and produces
category predictions as the final output.
3.1. Input Representation
The input to our network is a vector sketch, formed by
a sequence of strokes, each stroke being represented by a
sequence of points. This storing format is widely adopted
for sketches in existing crowdsourced datasets [8, 30, 3].
Following [7], we denote a vector sketch as an ordered
point sequence S = {pi = (xi , yi , si )}i=1···n , where n is
the total number of points in all strokes. For each point pi ,
xi and yi are the 2D coordinates, and si is a binary stroke
state. Specifically, state si = 0 indicates that the current
stroke has not ended and that the stroke connects pi to pi+1 ;
si = 1 indicates that pi is the last point of the current stroke
and pi+1 will be the starting point of another stroke. Our
network takes only the vector sketch S as input for end-toend learning.
3.2. Network Architecture
Our network architecture is formed by two sequentiallyarranged sub-networks, which are linked with a differentiable NLR module. The first sub-network is an RNN,
which analyzes the temporal ordering and grouping information in the input. The RNN consumes a vector sketch
S and estimates per-point attention as output at each iteration step. Specifically, we use a bidirectional Long ShortTerm Memory (LSTM) unit with two layers as the first subnetwork. We set the size of the hidden state to be 512 and
adopt dropout with probability = 0.5. For the hidden state
at step i, after the LSTM cell takes in pi , we pass it through
a fully-connected layer followed by a sigmoid function to
produce per-point attention, denoted as ai . That is, for each
point pi , we obtain a corresponding scalar ai , signifying the
point importance in the subsequent 2D visual understanding
by CNN. Similar to [8], instead of using absolute coordinates, for each pi fed into the RNN, we compute the offsets
from its previous point pi−1 as its coordinates.
Next, we pass the point sequence along with the estimated attention, i.e., (pi , ai )i=1···n , through our NLR module, as detailed in Sec. 3.3. The output of the module is a
raster sketch image I, which can also be viewed as an attention map with the intensity of each stroke pixel as the
corresponding attention. A deep CNN then takes the image
I as input for hierarchical 2D feature extraction. Sketch-aNet [44] or ResNet50 [9] can be used as the backbone network, which is then connected to a fully-connected layer to
produce estimations over all the possible object categories.
We use the cross entropy loss for optimizing the whole network.
Our network architecture for sketch recognition differs
from the one proposed by Xu et al. [42] for sketch retrieval
in several aspects. First, their network has two branches for
feature extraction, one branch with a RNN and the other
branch with a CNN. During learning, their RNN and CNN
individually work on two different sketch spaces with little
interaction, except at the last concatenation layer for feature
fusion. In contrast, our single-branch design allows more
information flow between RNN and CNN owing to our
NLR module, that is, the RNN can complement the CNN
by producing a more informative input whereas the CNN
provides guidance on attention estimation with learned hierarchical representations during back propagation. In addition, our network only uses vector sketches as input and
performs in-network vector-to-raster conversion, while the
two-branch late-fusion network [42] requires both vector
and raster sketches as input, thus a preprocessing stage for
rasterization is needed.
3.3. Neural Line Rasterization with Attention
To convert a point sequence with attention (pi , ai )i=1···n
to a pixel image I, the basic operation is to draw each valid
line segment pi pi+1 (Sec. 3.1) onto the canvas image. As
illustrated in Fig. 2, to determine whether or not a pixel Ik
is on the target line segment, we simply compute the distance from its center to the line segment pi pi+1 and check
whether it is smaller than a predefined threshold ǫ (we set
ǫ = 1 in our experiments). If Ik is a stroke pixel, we compute its attention by linear interpolation [12]; otherwise its
attention is set to zero. More specifically, let pk be the projection point of Ik ’s center onto pi pi+1 . The intensity or
attention of Ik is then defined as
Ik = (1 − αk ) · ai + αk · ai+1 ,
(1)
pi+1
Ik
pk
pi
Figure 2. Rasterization of line segment pi pi+1 and linear interpolation of the attention value for stroke pixel Ik .
where αk = kpk − pi k2 /kpi+1 − pi k2 , and pk , pi and
pi+1 are in absolute coordinates. This rasterization process
for line segments can be efficiently done in parallel on GPU
with a CUDA kernel. Note that in the implementation we
need to record the relevant information, such as line segment index and αk at each pixel Ik , for subsequent gradient
computation.
Through the above process, a vector sketch can be easily converted into a raster image in the forward inference
pass. In order to propagate gradients w.r.t the loss function
from CNN to RNN in the backward optimization pass, we
need to derive gradients for the above rasterization process.
Thanks to the simplicity of the used linear interpolation, the
gradients can be computed as follows:
∂Ik
∂Ik
= 1 − αk ,
= αk .
∂ai
∂ai+1
(2)
Let L be the loss function and δkI be the gradient backpropagated into Ik w.r.t L through the CNN. By the chain
rule, we have
X
X
∂L
∂L
δkI · (1 − αk ),
δkI · αk ,
=
=
(3)
∂ai
∂ai+1
k
k
where k iterates over all the stroke pixels covered by the line
segment pi pi+1 . If pi is adjacent to another line segment
pi−1 pi , we accumulate the gradients.
Our NLR module is simple and easy to implement, but it
is crucial to bridge the gap between the vector sketch space
and the raster sketch space in neural networks for end-toend learning. Unlike existing methods [37, 34] that derive attention from feature maps produced by CNNs, with
our NLR module, we can take advantage of additional cues
(i.e., temporal ordering and grouping information) in vector sketches for better attention map estimation, as shown
in experiments (Sec. 4.2). These additional cues, however,
are not accessible for the methods with raster inputs.
4. Experiments
4.1. Datasets and Settings
We have performed various experiments on two existing large-scale sketch recognition benchmarks, i.e., the TUBerlin benchmark [3] and the QuickDraw benchmark [8],
to validate the performance of our Sketch-R2CNN. These
two benchmarks differ in several aspects, such as sketching
style, acquisition procedure, and sketch quantity per category. Notably, sketches in the TU-Berlin benchmark tend
to be more realistic while the ones in QuickDraw are more
iconic and abstract (Fig. 4). The TU-Berlin benchmark [3]
contains 250 object categories with 80 sketches per category. Each sketch was created within 30 minutes by a participant from Amazon Mechanical Turk (AMT). The QuickDraw benchmark [8] contains 345 object categories with
75K sketches per category. During acquisition, the participants were given only 20 seconds to sketch an object.
Similar to [8], to simplify sketches in the TU-Berlin
benchmark, we applied the Ramer-Douglas-Peucker (RDP)
algorithm, resulting a maximum point sequence length of
448 for RNN. Following [44], we used three-fold cross validation on this benchmark (i.e., two folds for training, one
fold for testing). Sketches in the QuickDraw benchmark
have already been preprocessed with the RDP simplification
algorithm and the maximum number of points in a sketch is
321. In each QuickDraw category, the 75K sketches have
already been divided into training, validation and testing
sets with sizes of 70K, 2.5K and 2.5K, respectively.
We implemented our Sketch-R2CNN and NLR module
with PyTorch. We adopted Adam [13] for stochastic gradient descent update with a mini-batch size of 48. We
used a learning rate of 0.0001 for training on QuickDraw
and 0.00005 for training or fine-tuning on TU-Berlin (see
Sec. 4.2 for the pre-training and training procedures). Due
to the limited training data in the TU-Berlin benchmark, we
followed [44] to perform data augmentation, including horizontal reflection, stroke removal and sketch deformation.
4.2. Results and Discussions
Results on TU-Berlin Benchmark. We have compared
our method with a variety of existing methods on the TUBerlin benchmark. Table 1 includes the results of some
methods reported in [44]. These methods can be generally categorized into two groups. The first group follows the conventional pipeline using hand-crafted features
+ classifier, including the HOG-SVM method [3], structured ensemble matching [20], multi-kernel SVM [19], and
the Fisher Vector based method [31]. The second group
uses deep learning, including the state-of-the-art network
Sketch-a-Net (the earlier version Sketch-a-Net v1 [45] and
the later improved version Sketch-a-Net v2 [44]) and those
networks that have been evaluated in [44]: LeNet [16],
AlexNet-SVM [14] and AlexNet-Sketch [14].
Model
Accuracy
Humans [3]
73.1%
HOG-SVM [3]
Ensemble [20]
MKL-SVM [19]
Fisher-Vectors [31]
56.0%
61.5%
65.8%
68.9%
LeNet [16]
AlexNet-SVM [14]
AlexNet-Sketch [14]
Sketch-a-Net v1 [45]
Sketch-a-Net v2 [44]
Sketch-a-Net v2 (ours) [44]
ResNet50 [9]
Sketch-R2CNN (Sketch-a-Net v2)
Sketch-R2CNN (ResNet50)
55.2%
67.1%
68.6%
74.9%
77.95%
77.54%
82.08%
78.49%
83.25%
Table 1. Evaluations on the TU-Berlin benchmark. Our method
with ResNet50 working as the CNN backbone achieves the highest
recognition accuracy. Sketch-a-Net v2 (our) is our PyTorch-based
implementation.
We reimplemented Sketch-a-Net v2 with PyTorch since
the original model [44], implemented with Caffe, is not
compatible with our framework (i.e., the NLR module). We
pre-trained the Sketch-a-Net v2 on QuickDraw [8] instead
of preprocessed edge maps from photos [44] for ease of
preparation and reproduction. Our best reproduced recognition accuracy of Sketch-a-Net v2 on the TU-Berlin benchmark is 77.54%, close to the accuracy of 77.95% reported
with the original Caffe-based implementation [44]. In addition to Sketch-a-Net v2, we also evaluated ResNet50 [9],
a more advanced CNN architecture that has been widely
used for various visual tasks such as image classification [9] or object detection [21]. Specifically, before training on raster sketches of the TU-Berlin benchmark, we sequentially pre-trained the ResNet50 on ImageNet [29] and
QuickQraw. The ResNet50 achieves a recognition accuracy
of 82.08%, significantly outperforming the state-of-art approach Sketch-a-Net v2.
Since both Sketch-a-Net v2 and ResNet50 are CNN variants, they can be incorporated into our network architecture (Fig. 1) as the CNN backbone. By inserting one of
these CNN alternatives into the proposed architecture, we
can study how helpful the attention learned by RNN can
be for vector sketch recognition. The comparison results
are summarized in Table 1. Our method incorporated with
Sketch-a-Net v2, named Sketch-R2CNN (Sketch-a-Net-v2)
in Table 1, achieves a recognition accuracy of 78.49%, improving Sketch-a-Net v2 (ours) by about 1%. Another variant of our method with ResNet50, named Sketch-R2CNN
Ours
1
spider
AttentiveResNet50
helmet
indicates that the temporal information (i.e., stroke order)
provided by human can help RNN to learn more descriptive
sequential features, confirming a similar conclusion made
from sketch retrieval experiments in [42].
Model
Accuracy
Attentive-ResNet50 [34]
Random-Stroke-Order
Attention-using-Sketching-Order
Two-Branch-Late-Fusion [42]
Two-Branch-Early-Fusion
Sketch-R2CNN (ResNet50)
82.42%
82.78%
81.74%
81.43%
81.84%
83.25%
0
cabinet
present
Figure 3. Visualization of attention maps, in grayscale and
color coded, produced by our Sketch-R2CNN (ResNet50) and
Attentive-ResNet50. Recognition failures are in red and successes
are in green. Attention maps of Attentive-ResNet50 are estimated
from feature maps of the last layer of the C2 residual block, which
are of size 56 × 56, while attention maps by our method are of size
224 × 224. (Best viewed in the electronic version.)
(ResNet50) in Table 1, achieves an accuracy of 83.25%, improving the ResNet50-only model by about 1.2%, and surpasses all the existing approaches and human performance.
Alternatives Study on TU-Berlin Benchmark. To validate our proposed architecture, we have studied several
network design alternatives on the TU-Berlin benchmark
(Table 2). First, as mentioned in Sec. 2, attention modules have been used in existing CNN architectures for image classification [37] and sketch retrieval [34]. To compare against our RNN-based attention module, we modified ResNet50 and inserted the spatial attention module proposed by Song et al. [34] after the C2 residual block [9, 21].
This modified version of ResNet50 still takes binary sketch
images as input and tries to compute attention maps from
feature maps of previous convolutional layers. This model,
named Attentive-ResNet50 in Table 2, achieves a recognition accuracy of 82.42%, slightly higher than 82.08% by the
ResNet50-only model, while lower than 83.25% attained by
our method, showing the comparatively higher effectiveness
of additional cues in vector sketches used by our method
for attention estimation. Attention maps produced by our
RNN-based attention module and Attentive-ResNet50 are
visualized in Fig. 3. Note that our method only predicts attention for stroke pixels and sets non-stroke pixels to have
an attention value of zero, while Attentive-ResNet50 computes attention for every pixel of the attention map.
To study the influence of temporal ordering information provided by human on RNN’s attention estimation,
we trained Sketch-R2CNN (ResNet50) with randomized
stroke orders. That is, instead of keeping the human drawing order in vector sketch, the stroke sequence is randomly disrupted. This scheme, named Random-StrokeOrder, achieves a slightly lower recognition accuracy of
82.78% than Sketch-R2CNN (ResNet50) on the TU-Berlin
benchmark, still superior to the ResNet50-only model. This
Table 2. Alternative design choice studies on the TU-Berlin benchmark.
In addition to our RNN-based encoding method for vector sketches, we also explored a straightforward approach
to allow CNNs to gain access to the sketching order information for feature extraction. Specifically, in a preprocessing step, for a sketch in the point sequence representation, we encode its ordering information into an image
through rasterization by assigning an intensity value of one
to the first point and zero to the last point and linearly interpolating the intensities of the points in-between. Fig. 5
shows some examples of the resulting images. This encoding scheme is based on a hypothesis that users tend
to draw more “important” strokes first, and the resulting
raster sketches can be considered as temporal-encoding attention maps. We trained a ResNet50 with such handcrafted attention maps as input, but found that this encoding
scheme, with a recognition accuracy of 81.74% (Attentionusing-Sketching-Order in Table 2), is not effective and even
slightly worse than the baseline with binary image inputs
(ResNet50 in Table 1). This indicates that, for CNN-based
recognition networks, stroke importance may not always be
properly aligned with stroke order under such a straightforward encoding scheme, due to different drawing styles used
by different users, and this encoding scheme may even pose
challenges to CNNs for learning effective patterns. Thus,
instead of “hard-coding” temporal information into images,
a more adaptive and robust encoder (e.g., RNN) is needed
to accommodate sequential variations in vector sketches.
Next, we discuss arrangements of RNN and CNN in the
network architecture design. As mentioned before, Xu et
al. [42] use a two-branch late-fusion architecture, which
fuses the features extracted from a CNN branch and a parallel RNN branch, for sketch retrieval. In contrast, our
design combines an RNN encoder and a CNN feature extractor sequentially in a single branch for sketch classification. We therefore set up another experiment to investigate which of the above two types of architecture is a better scheme to incorporate the addition temporal ordering
and grouping information existing in vector sketches. Following [42], we built a similar model, named Two-BranchLate-Fusion in Table 2, which uses the same RNN cell and
CNN backbone as Sketch-R2CNN (ResNet50) for fairness
and consistency. The training procedure is the same as
Sketch-R2CNN (ResNet50), with the softmax cross entropy
loss [42]. The Two-Branch-Late-Fusion achieves a recognition accuracy of 81.43% on the TU-Berlin benchmark,
which is about 2% lower than Sketch-R2CNN (ResNet50).
This result reveals that our proposed single-branch architecture can make the CNN, which works as an abstract
visual concept extractor, and the RNN, which models human sketching orders, complement each other better than
the two-branch architecture. Surprisingly, another observation is that the recognition accuracy of Two-Branch-LateFusion, adapted to the sketch classification task from the
original sketch retrieval task, is even slightly inferior to that
of the single CNN branch (ResNet50 in Table 1). This is
also observed from results on the QuickDraw benchmark, as
presented in the following section. Due to the lack of implementation details of [42], we postulate that the differences
of training strategies ([42]: multi-stage training for CNN
and RNN; Ours: joint training of CNN and RNN), CNN
backbones ([42]: AlexNet; Ours: ResNet50) and datasets
([42]: pruned QuickDraw dataset; Ours: original TU-Berlin
and QuickDraw datasets) may affect the learning of the latefusion layer and cause the performance degradation.
Complement to the above experiments on attention estimation with RNN as well as arrangements of RNN and
CNN, we stretched the design choice exploration to studying an alternative way of injecting the learned attention from
RNN into CNN. In our proposed architecture, the CNN
directly takes the attention maps produced by the RNN
as input. An alternative architecture is to weigh feature
maps of a certain intermediate layer in CNN (which still
takes binary sketch images as input) with the attention map
by RNN that leverages vector sketches as input. In our
implementation, we inject the attention map produced by
RNN, which is of size 56 × 56 with stroke width threshold
ǫ = 0.5, into the output of the C2 residual block [9, 21] of
ResNet50. Following the same training procedures as those
in Table 2, this alternative architecture, named Two-BranchEarly-Fusion, achieves a recognition accuracy of 81.84% on
the TU-Berlin benchmark, performing slightly better than
Two-Branch-Late-Fusion. However the recognition accuracy of Two-Branch-Early-Fusion is still slightly inferior
to that of the ResNet50-only model. This may be due to
non-stroke pixels in the attention map from RNN having an
attention value of zero, which, during the injection, would
make convolution features at those corresponding locations
vanish, reducing the feature information learned by previous convolutional layers from the input.
Results on QuickDraw Benchmark.
We further
Model
Accuracy
Sketch-a-Net v2 [44]
ResNet50 [9]
Two-Branch-Late-Fusion [42]
Sketch-R2CNN (Sketch-a-Net v2)
Sketch-R2CNN (ResNet50)
74.84%
82.48 %
82.11%
77.29%
84.41%
Table 3. Evaluations on the QuickDraw benchmark.
compared the proposed Sketch-R2CNN with Sketch-aNet v2 [44], ResNet50-only model, and Two-Branch-LateFusion [42] on the QuickDraw benchmark. Note the
ResNet50 is pre-trained on ImageNet [29] and served as the
CNN backbone in Sketch-R2CNN and Two-Branch-LateFusion. Quantitative results are summarized in Table 3, and
the performance of each competing method on the QuickDraw benchmark agrees well with those on the TU-Berlin
benchmark. Compared to the competitors, Sketch-R2CNN
(ResNet50) achieves the highest recognition accuracy on
the QuickDraw benchmark, echoing its performance on the
TU-Berlin benchmark. It is a similar case for the ResNet50only model, which still achieves better recognition performance than both Sketch-a-Net v2 and Two-Branch-LateFusion. Sketch-R2CNN (ResNet50) and Sketch-R2CNN
(Sketch-a-Net v2) improve ResNet50 and Sketch-a-Net v2
respectively by about 2%. Although the sketch quality of
QuickDraw may not be as good as that of TU-Berlin, thanks
to the voluminous data of QuickDraw (24.15M sketches for
training, 862.5K sketches for validation or testing), we still
have seen consistent performance improvement of SketchR2CNN over CNN-only models, showing the generality of
our proposed architecture.
Qualitative Results. Fig. 4 shows some qualitative
recognition comparisons between the CNN-only method
(ResNet50) and our Sketch-R2CNN (ResNet50). Through
visualization, it is observed that the attention maps produced by the RNN in Sketch-R2CNN can help the CNN
to focus on more effective stroke parts of the inputs and ignore the interference of irrelevant strokes (e.g., the circle
around the crab in Fig. 4) to make better classifications. In
contrast, the CNN-only model cannot access the additional
ordering and grouping cues existing in vector sketches and
thus tends to struggle with sketches that have similar shapes
but different category labels. Fig. 5 visualizes the attention
maps by our method and the ones encoding sketching order
(used in Attention-using-Sketching-Order in Table 2). It is
observed that our attention maps estimated by RNN share
a certain degree of similarity with the ones using sketching
order, but the attention magnitudes by RNN are more adaptively biased.
Limitation. As shown in Fig. 6, in some cases, the RNN
in Sketch-R2CNN may fail to produce correct attention
ResNet50
lobster
scorpion
panda
teddy bear
palm tree
windmill
spider
crab
Ours
1
0
scorpion
lobster
teddy bear
panda
windmill
palm tree
TU-Berlin
crab
spider
QuickDraw
Figure 4. Recognition comparisons between the CNN-only method (ResNet50) and our Sketch-R2CNN (ResNet50 as the CNN backbone).
Failures are in red and successes are in green. Attention maps produced by our RNN are shown in the second row and are color coded. Note
that our RNN only predicts attention for stroke pixels; non-stroke pixels are set to have an attention value of zero and are not color-coded.
church
castle
kangaroo
squirrel
Sketching
Order
Ours
1
0
TU-Berlin
QuickDraw
ResNet50
Figure 5. The first row shows color-coded attention maps produced by our Sketch-R2CNN (ResNet50) for specific object categories.
Correspondingly in the second row, we directly encode the sketching order as attention maps, higher attention values for strokes drawn
earlier. Note that non-stroke pixels are set to have an attention value of zero and are not color-coded.
1
armchair
pig
toaster
Ours
pumpkin
5. Conclusion
0
banana
backpack
TU-Berlin
cow
ble solution to address the ambiguity is to put the sketched
objects in context (i.e., scenes), and integrate our method
with the context-based recognition methods [47, 48].
suitcase
QuickDraw
Figure 6. Recognition failures of our Sketch-R2CNN (ResNet50).
guidance for the subsequent CNN, leading to recognition
failures (e.g., the pumpkin), possibly due to the inability
in extracting effective sequential features from inputs with
similar temporal ordering and grouping cues as other training sketches in different categories. Some sketches that are
seemingly with ambiguous categories (e.g., the toaster) may
also pose challenges to our method. It is expected that human would make similar mistakes on such cases. One possi-
In this work, we have proposed a novel single-branch
attentive network architecture named Sketch-R2CNN for
vector sketch recognition. Our RNN-Rasterization-CNN
design consistently improves the recognition accuracy of
CNN-only models by 1-2% on two existing large-scale
sketch recognition benchmarks. The key enabler for joining RNN and CNN together is a novel differentiable neural
line rasterization module that performs in-network vectorto-raster sketch conversion. Applying Sketch-R2CNN to
other tasks like sketch retrieval or sketch synthesis that need
descriptive line-drawing features could be interesting to explore in the future.
References
[1] C. Alvarado and R. Davis. SketchREAD: A multi-domain
sketch recognition engine. In Proc. ACM UIST. ACM, 2004.
2
[2] R. Arandjelović and T. M. Sezgin. Sketch recognition by fusion of temporal and image-based features. Pattern Recogn.,
44(6):1225–1234, 2011. 2
[3] M. Eitz, J. Hays, and M. Alexa. How do humans sketch
objects? ACM TOG, 31(4):44:1–44:10, July 2012. 1, 2, 3, 5
[4] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and
M. Alexa. Sketch-based shape retrieval. ACM TOG,
31(4):31:1–31:10, July 2012. 1
[5] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmentation with submanifold sparse convolutional
networks. In Proc. IEEE CVPR, 2018. 3
[6] B. Graham and L. van der Maaten. Submanifold sparse convolutional networks. CoRR, abs/1706.01307, 2017. 3
[7] A. Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013. 2, 3
[8] D. Ha and D. Eck. A neural representation of sketch drawings. In Proc. ICLR, 2018. 2, 3, 4, 5
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In Proc. IEEE CVPR, June 2016. 4,
5, 6, 7
[10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. IEEE CVPR, 2018. 3
[11] Z. Huang, H. Fu, and R. W. H. Lau. Data-driven segmentation and labeling of freehand sketches. ACM TOG,
33(6):175:1–175:10, Nov. 2014. 1
[12] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer.
In Proc. IEEE CVPR, 2018. 2, 4
[13] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. CoRR, abs/1412.6980, 2014. 5
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classification with deep convolutional neural networks. In
NIPS, pages 1097–1105. 2012. 1, 2, 5
[15] J. J. LaViola, Jr. and R. C. Zeleznik. MathPad2: A system for
the creation and exploration of mathematical sketches. ACM
TOG, 23(3):432–440, Aug. 2004. 2
[16] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Neural
Networks: Tricks of the Trade: Second Edition, pages 9–48.
2012. 5
[17] K. Li, K. Pang, J. Song, Y.-Z. Song, T. Xiang, T. M.
Hospedales, and H. Zhang. Universal sketch perceptual
grouping. In Proc. ECCV, 2018. 1
[18] L. Li, H. Fu, and C.-L. Tai. Fast sketch segmentation and
labeling with deep learning. CoRR, abs/1807.11847, 2018. 1
[19] Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Free-hand
sketch recognition by multi-kernel feature learning. CVIU,
137:1 – 11, 2015. 1, 2, 5
[20] Y. Li, Y.-Z. Song, and S. Gong. Sketch recognition by ensemble matching of structured features. In Proc. BMVC,
2013. 5
[21] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and
S. Belongie. Feature pyramid networks for object detection.
In Proc. IEEE CVPR, July 2017. 5, 6, 7
[22] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when
to look: Adaptive attention via a visual sentinel for image
captioning. In Proc. IEEE CVPR, 2017. 3
[23] T. Lu, C.-L. Tai, F. Su, and S. Cai. A new recognition model
for electronic architectural drawings. CAD, 37(10):1053 –
1069, 2005. 2
[24] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, pages 2204–2212.
2014. 3
[25] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for
multimodal reasoning and matching. In Proc. IEEE CVPR,
2017. 3
[26] L. Olsen, F. F. Samavati, M. C. Sousa, and J. A. Jorge.
Sketch-based modeling: A survey. Comput. & Graph.,
33(1):85 – 103, 2009. 1
[27] T. Y. Ouyang and R. Davis. ChemInk: A natural real-time
recognition system for chemical drawings. In Proc. ACM
IUI. ACM, 2011. 2
[28] U. Riaz Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, and
T. M. Hospedales. Learning deep sketch abstraction. In Proc.
IEEE CVPR, June 2018. 3
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. ImageNet large scale visual
recognition challenge. IJCV, 115(3):211–252, Dec 2015. 5,
7
[30] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The Sketchy
Database: Learning to retrieve badly drawn bunnies. ACM
TOG, 35(4):119:1–119:12, July 2016. 1, 2, 3
[31] R. G. Schneider and T. Tuytelaars. Sketch classification
and classification-driven analysis using Fisher Vectors. ACM
TOG, 33(6):174:1–174:9, Nov. 2014. 1, 2, 3, 5
[32] T. M. Sezgin and R. Davis. Sketch recognition in interspersed drawings using time-based graphical models. Comput. & Graph., 32(5):500–510, 2008. 2
[33] J. Song, K. Pang, Y.-Z. Song, T. Xiang, and T. M.
Hospedales. Learning to sketch with shortcut cycle consistency. In Proc. IEEE CVPR, June 2018. 2
[34] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales.
Deep spatial-semantic attention for fine-grained sketchbased image retrieval. In Proc. IEEE ICCV, 2017. 3, 4,
6
[35] Z. Sun, C. Wang, L. Zhang, and L. Zhang. Free handdrawn sketch segmentation. In Proc. ECCV, pages 626–639.
Springer, 2012. 1
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proc. IEEE CVPR, 2015.
2
[37] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang,
X. Wang, and X. Tang. Residual attention network for image
classification. In Proc. IEEE CVPR, 2017. 3, 4, 6
[38] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval
using convolutional neural networks. In Proc. IEEE CVPR,
2015. 1
[39] X. Wang, X. Chen, and Z. Zha. SketchPointNet: A compact
network for robust sketch recognition. In Proc. ICIP, pages
2994–2998, 2018. 2
[40] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.
The application of two-level attention models in deep convo-
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
lutional neural network for fine-grained image classification.
In Proc. IEEE CVPR, 2015. 3
K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and
tell: Neural image caption generation with visual attention.
CoRR, abs/1502.03044, 2015. 3
P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z. Song, T. Xiang,
T. M. Hospedales, Z. Ma, and J. Guo. SketchMate: Deep
hashing for million-scale human sketch retrieval. In Proc.
IEEE CVPR, June 2018. 1, 2, 3, 4, 6, 7
E. Yanık and T. M. Sezgin. Active learning for sketch recognition. Comput. & Graph., 52:93 – 105, 2015. 2
Q. Yu, Y. Yang, F. Liu, Y.-Z. Song, T. Xiang, and T. M.
Hospedales. Sketch-a-Net: A deep neural network that beats
humans. IJCV, 122(3):411–425, May 2017. 1, 2, 3, 4, 5, 7
Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales.
Sketch-a-net that beats humans. In Proc. BMVC, pages 7.1–
7.12, 2015. 5
H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao.
SketchNet: Sketch classification with web images. In Proc.
IEEE CVPR, 2016. 2
J. Zhang, Y. Chen, L. Li, H. Fu, and C.-L. Tai. Context-based
sketch classification. In Proc. Expressive, pages 3:1–3:10.
ACM, 2018. 1, 2, 8
C. Zou, Q. Yu, R. Du, H. Mo, Y.-Z. Song, T. Xiang, C. Gao,
B. Chen, and H. Zhang. SketchyScene: Richly-annotated
scene sketches. In Proc. ECCV, September 2018. 2, 8