Academia.eduAcademia.edu

Sketch-R2CNN: An Attentive Network for Vector Sketch Recognition

2018, arXiv (Cornell University)

Freehand sketching is a dynamic process where points are sequentially sampled and grouped as strokes for sketch acquisition on electronic devices. To recognize a sketched object, most existing methods discard such important temporal ordering and grouping information from human and simply rasterize sketches into binary images for classification. In this paper, we propose a novel singlebranch attentive network architecture RNN-Rasterization-CNN (Sketch-R2CNN for short) to fully leverage the dynamics in sketches for recognition. Sketch-R2CNN takes as input only a vector sketch with grouped sequences of points, and uses an RNN for stroke attention estimation in the vector space and a CNN for 2D feature extraction in the pixel space respectively. To bridge the gap between these two spaces in neural networks, we propose a neural line rasterization module to convert the vector sketch along with the attention estimated by RNN into a bitmap image, which is subsequently consumed by CNN. The neural line rasterization module is designed in a differentiable way to yield a unified pipeline for end-to-end learning. We perform experiments on existing large-scale sketch recognition benchmarks and show that by exploiting the sketch dynamics with the attention mechanism, our method is more robust and achieves better performance than the state-of-the-art methods.

Sketch-R2CNN: An Attentive Network for Vector Sketch Recognition Lei Li HKUST Changqing Zou University of Maryland, College Park Youyi Zheng Zhejiang University arXiv:1811.08170v1 [cs.CV] 20 Nov 2018 Hongbo Fu City University of Hong Kong Abstract Freehand sketching is a dynamic process where points are sequentially sampled and grouped as strokes for sketch acquisition on electronic devices. To recognize a sketched object, most existing methods discard such important temporal ordering and grouping information from human and simply rasterize sketches into binary images for classification. In this paper, we propose a novel singlebranch attentive network architecture RNN-RasterizationCNN (Sketch-R2CNN for short) to fully leverage the dynamics in sketches for recognition. Sketch-R2CNN takes as input only a vector sketch with grouped sequences of points, and uses an RNN for stroke attention estimation in the vector space and a CNN for 2D feature extraction in the pixel space respectively. To bridge the gap between these two spaces in neural networks, we propose a neural line rasterization module to convert the vector sketch along with the attention estimated by RNN into a bitmap image, which is subsequently consumed by CNN. The neural line rasterization module is designed in a differentiable way to yield a unified pipeline for end-to-end learning. We perform experiments on existing large-scale sketch recognition benchmarks and show that by exploiting the sketch dynamics with the attention mechanism, our method is more robust and achieves better performance than the state-of-the-art methods. 1. Introduction Freehand sketching is an easy and quick means of communication because of its simplicity and expressiveness. While a human has the innate ability to interpret drawing semantics, the vast capacity of expressiveness in sketches poses great perception challenges to machines. For better human-computer interactions, sketch analysis has been an active research topic in the computer vision and graphics fields, spanning a wide spectrum including sketch recognition [3, 44, 47], sketch segmentation [35, 11, 17, 18], Qingkun Su Alibaba A.I. Labs Chiew-Lan Tai HKUST sketch-based retrieval [4, 38, 30, 42] and modeling [26], etc. In this paper, we focus on developing a novel learning-based method for freehand sketch recognition. The goal of sketch classification or recognition is to identify the object category of an input sketch, which is more challenging than image classification due to the lack of rich texture details, inherent ambiguities, and large shape variations in the input. Traditional studies [3, 31, 19] commonly cast sketch recognition as an image classification task by converting sketches into binary images and then extracting local image features. With the quantified feature descriptors, a typical classifier such as Support Vector Machine (SVM) is trained for object category prediction. Recent years have witnessed the success of deep learning in image classification [14]. Similar neural network designs have also been used to address the recognition problem of sketch images [44, 30]. Although these deep learning-based methods outperform the traditional ones, the unique properties of sketches, as discussed in the following, are often overlooked, leaving room for further improving the performance of sketch recognition. In general, sketch has two widely-used representations for processing, which are raster pixel sketch and vector sketch. Raster pixel sketches are binary images with pixels covered by strokes having the value one and the rest of pixels the value zero, resulting in a large portion of void pixels and thus a sparse representation. This representation does not allow the state-of-the-art convolutional neural networks (CNNs) to easily distinguish which strokes are more important or which strokes can be ignored for better recognition [31]. Following the definition in [42], a vector sketch in our work refers to a sequence of strokes containing the points in the drawing order (Fig. 1). A vector sketch can be easily converted into a bitmap image through rasterization but not vice versa. Notably, vector sketches contain rich temporal ordering and grouping (i.e., strokes) information, which has been shown to be useful for learning more descriptive features [42]. However, these information cues are all discarded during the rasterization process for pixel images and thus inaccessible by subsequent recognition al- gorithms. Motivated by the above discussions, to address the incapacity of existing CNN-based methods for stroke importance interpretation, we propose a novel singlebranch attentive network architecture RNN-RasterizationCNN (Sketch-R2CNN for short), for vector sketch recognition. Sketch-R2CNN takes advantages of both vector and raster representations of sketches during the learning process and is able to focus on adaptively learned important strokes, with an attention mechanism, for better recognition (Fig. 1). It takes only a vector sketch (i.e., grouped sequences of points) as input, and employs a recurrent neural network (RNN) in the first stage for analyzing the temporal ordering and grouping information in the input and producing attention estimations for the stroke points. We then develop a novel neural line rasterization (NLR) module, capable of converting the vector sketch with the computed attentions into an attention map in a differentiable way. Subsequently, Sketch-R2CNN uses a CNN to consume the obtained attention map for guided hierarchical understanding and feature extraction on critical strokes to identify the target object category. Our proposed NLR module is the key to connecting the vector sketch space and the raster sketch space in neural networks and allows gradient information to back propagate from CNN to RNN for end-to-end learning. Experiments on existing large-scale sketch recognition benchmarks [3, 8] show that our method, leveraging more human factors in the input, performs better than the state-ofthe-art methods, and our RNN-Rasterization-CNN design consistently improves the performance of CNN-only methods. In summary, our contributions in this work are: (1) the first single-branch attentive network with an RNNRasterization-CNN design for vector sketch recognition; (2) a novel differentiable neural line rasterization module that unifies the vector sketch space and raster sketch space in neural networks, allowing end-to-end learning. We will make our code publicly available. 2. Related Work To recognize sketched objects, traditional methods generally take preprocessed raster sketches as input. To quantify a sketch image, existing studies have tried to adapt several types of local features originally intended for photos (e.g., bag-of-features [3], Fisher Vectors with SIFT features [31], HOG features [19]) to line drawing images. With the extracted features, classifiers (e.g., SVMs) are then trained to recognize unseen sketches [3, 31]. Different learning schemes, such as multiple kernel learning [19] or active learning [43], may be employed for performance improvement. Another line of traditional methods has also attempted to utilize additional cues for recognition, such as prior knowledge for domain-specific sketches [1, 15, 27, 23, 32, 2] or object context for sketched scenes [47, 48]. While progress has been made in sketch recognition, these methods still cannot robustly handle freehand sketches with large shape or style variations, especially those hastily drawn in dozens of seconds [8], struggling to achieve performance on par with human on existing benchmarks like the TU-Berlin benchmark [3]. Recently, deep learning has revolutionized many research fields, including sketch recognition, with state-ofthe-art performance. Research efforts [30, 46, 39, 44] have been made to employ deep neural networks, such as AlexNet [14] or GoogLeNet [36], to learn more discriminative image features in the sketch domain to replace handengineered ones. Yu et al. [44] proposed Sketch-a-Net, an AlexNet-like architecture specifically adapted for sketch images by using large kernels in convolutions to accommodate the sparsity of stroke pixels. Their method achieved superior classification accuracy (77.95%) on the TU-Berlin benchmark [3], surpassing human performance (73.1%) for the first time. Their method still follows the existing learning process of image classification, i.e., using the raster image representation of sketches as CNN inputs, and thus cannot easily learn the awareness of stroke importance in an end-to-end manner for further improvement. In contrast, our network directly consumes vector sketches as input for learning stroke importance effectively and adaptively by exploiting the temporal ordering and grouping information therein with RNNs. Vector representation of sketches has been considered for certain tasks such as sketch generation [7, 8, 33] or sketch hashing [42] with deep learning. For example, SketchRNN [8], which has received much attention recently, is built upon RNNs to process vector sketches. It is composed of an RNN encoder followed by an RNN decoder, and is able to model the underlying distribution of points in vector sketches for a specific object category. To learn to hash sketches for retrieval, Xu et al. [42] has demonstrated that an RNN branch, exploiting temporal ordering in vector sketches, can complement the other CNN branch for extracting more descriptive features. They fuse two types of features, produced by RNN and CNN respectively, via a late-fusion layer by concatenation. Our work shares a similar spirit with [42], advocating that the temporal and grouping information in vector sketches also offer additional cues for more accurate sketch recognition. In contrast to their two-branch network with simple concatenation, our RNN-Rasterization-CNN design seeks to boost the synergy between the two networks in a single branch during the learning process. To this end, inspired by [12], which proposed an approximate gradient for in-network mesh rendering and rasterization, we design a novel neural line rasterization module, allowing gradients to backpropagate from CNN (raster sketch space) to RNN (vector x1, y1, 0 p2 x2, y2, 0 pn xn, yn, 1 a1 a2 … an … pn vector sketch point sequence a1 … a2 p1 p2 … pn RNN Neural Line Raster CNN cat … p1 … p2 p1 an per-point attention attention map classification Figure 1. Illustration of our single-branch attentive network architecture for vector sketch recognition. (Neural Line Raster stands for our neural line rasterization (NLR) module.) sketch space) for end-to-end learning. For a sketch, its constituent strokes may contribute differently to its recognition. With a trained SVM, Schneider et al. [31] qualitatively analyzed how stroke importance affects classification scores by iteratively removing each stroke from the corresponding raster sketch image. To automatically capture stroke importance during the learning process, researchers have attempted to adapt an attention mechanism in network design [34]. Attention mechanism has been widely used in many visual tasks, such as image classification [24, 40, 37, 10], image caption [41, 22] or Visual Question Answering (VQA) [25]. A simple attention module generally works by computing soft masks over the spatial image grid [37, 41], or even feature channels [10], to obtain weighted combination of features. Song et al. [34] has incorporated a spatial attention module for raster sketches in their network for fine-grained sketch-based image retrieval. Differently, Riaz Muhammad et al. [28] tackled the sketch abstraction task with reinforcement learning, which aims to develop a stroke removal policy by considering the stroke influence to recognizability. As discussed in existing studies [44, 42, 6, 5], CNNs may suffer from the sparsity of inputs (e.g., raster sketches), though they excel at building hierarchical representations of 2D inputs. Instead of struggling to estimate attention from binary images that contain limited information [34], we argue that additional cues, such as the temporal ordering and grouping information in vector sketches, are essential to learn reliable attention for strokes. In our method, we resort to RNNs for computing attention for each point in a vector sketch, and use our NLR module for in-network vector-to-raster conversion. To our best knowledge, no existing work has tried to derive an attention map from vector sketches with RNNs for CNNbased sketch recognition. 3. Method Our network architecture, as illustrated in Fig. 1, is composed of two cascaded sub-networks: an RNN for stroke attention estimation in the vector sketch space and a CNN for 2D feature extraction in the raster sketch space (Sec. 3.2). The key enabler for linking the two sub-networks that operate in completely different spaces is a novel neural line rasterization (NLR) module, which converts a vector sketch with the estimated attention to a raster pixel sketch in a differentiable way (Sec. 3.3). More specifically, during the forward inference pass, given a vector sketch as input, the RNN takes in a point at each time step and computes a corresponding attention value for the point. Our proposed NLR module then rasterizes the vector sketch, together with the estimated per-point attention, into an attention map and computes the corresponding gradients for the backward optimization pass. A subsequent CNN consumes the attention map as input for hierarchical understanding and produces category predictions as the final output. 3.1. Input Representation The input to our network is a vector sketch, formed by a sequence of strokes, each stroke being represented by a sequence of points. This storing format is widely adopted for sketches in existing crowdsourced datasets [8, 30, 3]. Following [7], we denote a vector sketch as an ordered point sequence S = {pi = (xi , yi , si )}i=1···n , where n is the total number of points in all strokes. For each point pi , xi and yi are the 2D coordinates, and si is a binary stroke state. Specifically, state si = 0 indicates that the current stroke has not ended and that the stroke connects pi to pi+1 ; si = 1 indicates that pi is the last point of the current stroke and pi+1 will be the starting point of another stroke. Our network takes only the vector sketch S as input for end-toend learning. 3.2. Network Architecture Our network architecture is formed by two sequentiallyarranged sub-networks, which are linked with a differentiable NLR module. The first sub-network is an RNN, which analyzes the temporal ordering and grouping information in the input. The RNN consumes a vector sketch S and estimates per-point attention as output at each iteration step. Specifically, we use a bidirectional Long ShortTerm Memory (LSTM) unit with two layers as the first subnetwork. We set the size of the hidden state to be 512 and adopt dropout with probability = 0.5. For the hidden state at step i, after the LSTM cell takes in pi , we pass it through a fully-connected layer followed by a sigmoid function to produce per-point attention, denoted as ai . That is, for each point pi , we obtain a corresponding scalar ai , signifying the point importance in the subsequent 2D visual understanding by CNN. Similar to [8], instead of using absolute coordinates, for each pi fed into the RNN, we compute the offsets from its previous point pi−1 as its coordinates. Next, we pass the point sequence along with the estimated attention, i.e., (pi , ai )i=1···n , through our NLR module, as detailed in Sec. 3.3. The output of the module is a raster sketch image I, which can also be viewed as an attention map with the intensity of each stroke pixel as the corresponding attention. A deep CNN then takes the image I as input for hierarchical 2D feature extraction. Sketch-aNet [44] or ResNet50 [9] can be used as the backbone network, which is then connected to a fully-connected layer to produce estimations over all the possible object categories. We use the cross entropy loss for optimizing the whole network. Our network architecture for sketch recognition differs from the one proposed by Xu et al. [42] for sketch retrieval in several aspects. First, their network has two branches for feature extraction, one branch with a RNN and the other branch with a CNN. During learning, their RNN and CNN individually work on two different sketch spaces with little interaction, except at the last concatenation layer for feature fusion. In contrast, our single-branch design allows more information flow between RNN and CNN owing to our NLR module, that is, the RNN can complement the CNN by producing a more informative input whereas the CNN provides guidance on attention estimation with learned hierarchical representations during back propagation. In addition, our network only uses vector sketches as input and performs in-network vector-to-raster conversion, while the two-branch late-fusion network [42] requires both vector and raster sketches as input, thus a preprocessing stage for rasterization is needed. 3.3. Neural Line Rasterization with Attention To convert a point sequence with attention (pi , ai )i=1···n to a pixel image I, the basic operation is to draw each valid line segment pi pi+1 (Sec. 3.1) onto the canvas image. As illustrated in Fig. 2, to determine whether or not a pixel Ik is on the target line segment, we simply compute the distance from its center to the line segment pi pi+1 and check whether it is smaller than a predefined threshold ǫ (we set ǫ = 1 in our experiments). If Ik is a stroke pixel, we compute its attention by linear interpolation [12]; otherwise its attention is set to zero. More specifically, let pk be the projection point of Ik ’s center onto pi pi+1 . The intensity or attention of Ik is then defined as Ik = (1 − αk ) · ai + αk · ai+1 , (1) pi+1 Ik pk pi Figure 2. Rasterization of line segment pi pi+1 and linear interpolation of the attention value for stroke pixel Ik . where αk = kpk − pi k2 /kpi+1 − pi k2 , and pk , pi and pi+1 are in absolute coordinates. This rasterization process for line segments can be efficiently done in parallel on GPU with a CUDA kernel. Note that in the implementation we need to record the relevant information, such as line segment index and αk at each pixel Ik , for subsequent gradient computation. Through the above process, a vector sketch can be easily converted into a raster image in the forward inference pass. In order to propagate gradients w.r.t the loss function from CNN to RNN in the backward optimization pass, we need to derive gradients for the above rasterization process. Thanks to the simplicity of the used linear interpolation, the gradients can be computed as follows: ∂Ik ∂Ik = 1 − αk , = αk . ∂ai ∂ai+1 (2) Let L be the loss function and δkI be the gradient backpropagated into Ik w.r.t L through the CNN. By the chain rule, we have X X ∂L ∂L δkI · (1 − αk ), δkI · αk , = = (3) ∂ai ∂ai+1 k k where k iterates over all the stroke pixels covered by the line segment pi pi+1 . If pi is adjacent to another line segment pi−1 pi , we accumulate the gradients. Our NLR module is simple and easy to implement, but it is crucial to bridge the gap between the vector sketch space and the raster sketch space in neural networks for end-toend learning. Unlike existing methods [37, 34] that derive attention from feature maps produced by CNNs, with our NLR module, we can take advantage of additional cues (i.e., temporal ordering and grouping information) in vector sketches for better attention map estimation, as shown in experiments (Sec. 4.2). These additional cues, however, are not accessible for the methods with raster inputs. 4. Experiments 4.1. Datasets and Settings We have performed various experiments on two existing large-scale sketch recognition benchmarks, i.e., the TUBerlin benchmark [3] and the QuickDraw benchmark [8], to validate the performance of our Sketch-R2CNN. These two benchmarks differ in several aspects, such as sketching style, acquisition procedure, and sketch quantity per category. Notably, sketches in the TU-Berlin benchmark tend to be more realistic while the ones in QuickDraw are more iconic and abstract (Fig. 4). The TU-Berlin benchmark [3] contains 250 object categories with 80 sketches per category. Each sketch was created within 30 minutes by a participant from Amazon Mechanical Turk (AMT). The QuickDraw benchmark [8] contains 345 object categories with 75K sketches per category. During acquisition, the participants were given only 20 seconds to sketch an object. Similar to [8], to simplify sketches in the TU-Berlin benchmark, we applied the Ramer-Douglas-Peucker (RDP) algorithm, resulting a maximum point sequence length of 448 for RNN. Following [44], we used three-fold cross validation on this benchmark (i.e., two folds for training, one fold for testing). Sketches in the QuickDraw benchmark have already been preprocessed with the RDP simplification algorithm and the maximum number of points in a sketch is 321. In each QuickDraw category, the 75K sketches have already been divided into training, validation and testing sets with sizes of 70K, 2.5K and 2.5K, respectively. We implemented our Sketch-R2CNN and NLR module with PyTorch. We adopted Adam [13] for stochastic gradient descent update with a mini-batch size of 48. We used a learning rate of 0.0001 for training on QuickDraw and 0.00005 for training or fine-tuning on TU-Berlin (see Sec. 4.2 for the pre-training and training procedures). Due to the limited training data in the TU-Berlin benchmark, we followed [44] to perform data augmentation, including horizontal reflection, stroke removal and sketch deformation. 4.2. Results and Discussions Results on TU-Berlin Benchmark. We have compared our method with a variety of existing methods on the TUBerlin benchmark. Table 1 includes the results of some methods reported in [44]. These methods can be generally categorized into two groups. The first group follows the conventional pipeline using hand-crafted features + classifier, including the HOG-SVM method [3], structured ensemble matching [20], multi-kernel SVM [19], and the Fisher Vector based method [31]. The second group uses deep learning, including the state-of-the-art network Sketch-a-Net (the earlier version Sketch-a-Net v1 [45] and the later improved version Sketch-a-Net v2 [44]) and those networks that have been evaluated in [44]: LeNet [16], AlexNet-SVM [14] and AlexNet-Sketch [14]. Model Accuracy Humans [3] 73.1% HOG-SVM [3] Ensemble [20] MKL-SVM [19] Fisher-Vectors [31] 56.0% 61.5% 65.8% 68.9% LeNet [16] AlexNet-SVM [14] AlexNet-Sketch [14] Sketch-a-Net v1 [45] Sketch-a-Net v2 [44] Sketch-a-Net v2 (ours) [44] ResNet50 [9] Sketch-R2CNN (Sketch-a-Net v2) Sketch-R2CNN (ResNet50) 55.2% 67.1% 68.6% 74.9% 77.95% 77.54% 82.08% 78.49% 83.25% Table 1. Evaluations on the TU-Berlin benchmark. Our method with ResNet50 working as the CNN backbone achieves the highest recognition accuracy. Sketch-a-Net v2 (our) is our PyTorch-based implementation. We reimplemented Sketch-a-Net v2 with PyTorch since the original model [44], implemented with Caffe, is not compatible with our framework (i.e., the NLR module). We pre-trained the Sketch-a-Net v2 on QuickDraw [8] instead of preprocessed edge maps from photos [44] for ease of preparation and reproduction. Our best reproduced recognition accuracy of Sketch-a-Net v2 on the TU-Berlin benchmark is 77.54%, close to the accuracy of 77.95% reported with the original Caffe-based implementation [44]. In addition to Sketch-a-Net v2, we also evaluated ResNet50 [9], a more advanced CNN architecture that has been widely used for various visual tasks such as image classification [9] or object detection [21]. Specifically, before training on raster sketches of the TU-Berlin benchmark, we sequentially pre-trained the ResNet50 on ImageNet [29] and QuickQraw. The ResNet50 achieves a recognition accuracy of 82.08%, significantly outperforming the state-of-art approach Sketch-a-Net v2. Since both Sketch-a-Net v2 and ResNet50 are CNN variants, they can be incorporated into our network architecture (Fig. 1) as the CNN backbone. By inserting one of these CNN alternatives into the proposed architecture, we can study how helpful the attention learned by RNN can be for vector sketch recognition. The comparison results are summarized in Table 1. Our method incorporated with Sketch-a-Net v2, named Sketch-R2CNN (Sketch-a-Net-v2) in Table 1, achieves a recognition accuracy of 78.49%, improving Sketch-a-Net v2 (ours) by about 1%. Another variant of our method with ResNet50, named Sketch-R2CNN Ours 1 spider AttentiveResNet50 helmet indicates that the temporal information (i.e., stroke order) provided by human can help RNN to learn more descriptive sequential features, confirming a similar conclusion made from sketch retrieval experiments in [42]. Model Accuracy Attentive-ResNet50 [34] Random-Stroke-Order Attention-using-Sketching-Order Two-Branch-Late-Fusion [42] Two-Branch-Early-Fusion Sketch-R2CNN (ResNet50) 82.42% 82.78% 81.74% 81.43% 81.84% 83.25% 0 cabinet present Figure 3. Visualization of attention maps, in grayscale and color coded, produced by our Sketch-R2CNN (ResNet50) and Attentive-ResNet50. Recognition failures are in red and successes are in green. Attention maps of Attentive-ResNet50 are estimated from feature maps of the last layer of the C2 residual block, which are of size 56 × 56, while attention maps by our method are of size 224 × 224. (Best viewed in the electronic version.) (ResNet50) in Table 1, achieves an accuracy of 83.25%, improving the ResNet50-only model by about 1.2%, and surpasses all the existing approaches and human performance. Alternatives Study on TU-Berlin Benchmark. To validate our proposed architecture, we have studied several network design alternatives on the TU-Berlin benchmark (Table 2). First, as mentioned in Sec. 2, attention modules have been used in existing CNN architectures for image classification [37] and sketch retrieval [34]. To compare against our RNN-based attention module, we modified ResNet50 and inserted the spatial attention module proposed by Song et al. [34] after the C2 residual block [9, 21]. This modified version of ResNet50 still takes binary sketch images as input and tries to compute attention maps from feature maps of previous convolutional layers. This model, named Attentive-ResNet50 in Table 2, achieves a recognition accuracy of 82.42%, slightly higher than 82.08% by the ResNet50-only model, while lower than 83.25% attained by our method, showing the comparatively higher effectiveness of additional cues in vector sketches used by our method for attention estimation. Attention maps produced by our RNN-based attention module and Attentive-ResNet50 are visualized in Fig. 3. Note that our method only predicts attention for stroke pixels and sets non-stroke pixels to have an attention value of zero, while Attentive-ResNet50 computes attention for every pixel of the attention map. To study the influence of temporal ordering information provided by human on RNN’s attention estimation, we trained Sketch-R2CNN (ResNet50) with randomized stroke orders. That is, instead of keeping the human drawing order in vector sketch, the stroke sequence is randomly disrupted. This scheme, named Random-StrokeOrder, achieves a slightly lower recognition accuracy of 82.78% than Sketch-R2CNN (ResNet50) on the TU-Berlin benchmark, still superior to the ResNet50-only model. This Table 2. Alternative design choice studies on the TU-Berlin benchmark. In addition to our RNN-based encoding method for vector sketches, we also explored a straightforward approach to allow CNNs to gain access to the sketching order information for feature extraction. Specifically, in a preprocessing step, for a sketch in the point sequence representation, we encode its ordering information into an image through rasterization by assigning an intensity value of one to the first point and zero to the last point and linearly interpolating the intensities of the points in-between. Fig. 5 shows some examples of the resulting images. This encoding scheme is based on a hypothesis that users tend to draw more “important” strokes first, and the resulting raster sketches can be considered as temporal-encoding attention maps. We trained a ResNet50 with such handcrafted attention maps as input, but found that this encoding scheme, with a recognition accuracy of 81.74% (Attentionusing-Sketching-Order in Table 2), is not effective and even slightly worse than the baseline with binary image inputs (ResNet50 in Table 1). This indicates that, for CNN-based recognition networks, stroke importance may not always be properly aligned with stroke order under such a straightforward encoding scheme, due to different drawing styles used by different users, and this encoding scheme may even pose challenges to CNNs for learning effective patterns. Thus, instead of “hard-coding” temporal information into images, a more adaptive and robust encoder (e.g., RNN) is needed to accommodate sequential variations in vector sketches. Next, we discuss arrangements of RNN and CNN in the network architecture design. As mentioned before, Xu et al. [42] use a two-branch late-fusion architecture, which fuses the features extracted from a CNN branch and a parallel RNN branch, for sketch retrieval. In contrast, our design combines an RNN encoder and a CNN feature extractor sequentially in a single branch for sketch classification. We therefore set up another experiment to investigate which of the above two types of architecture is a better scheme to incorporate the addition temporal ordering and grouping information existing in vector sketches. Following [42], we built a similar model, named Two-BranchLate-Fusion in Table 2, which uses the same RNN cell and CNN backbone as Sketch-R2CNN (ResNet50) for fairness and consistency. The training procedure is the same as Sketch-R2CNN (ResNet50), with the softmax cross entropy loss [42]. The Two-Branch-Late-Fusion achieves a recognition accuracy of 81.43% on the TU-Berlin benchmark, which is about 2% lower than Sketch-R2CNN (ResNet50). This result reveals that our proposed single-branch architecture can make the CNN, which works as an abstract visual concept extractor, and the RNN, which models human sketching orders, complement each other better than the two-branch architecture. Surprisingly, another observation is that the recognition accuracy of Two-Branch-LateFusion, adapted to the sketch classification task from the original sketch retrieval task, is even slightly inferior to that of the single CNN branch (ResNet50 in Table 1). This is also observed from results on the QuickDraw benchmark, as presented in the following section. Due to the lack of implementation details of [42], we postulate that the differences of training strategies ([42]: multi-stage training for CNN and RNN; Ours: joint training of CNN and RNN), CNN backbones ([42]: AlexNet; Ours: ResNet50) and datasets ([42]: pruned QuickDraw dataset; Ours: original TU-Berlin and QuickDraw datasets) may affect the learning of the latefusion layer and cause the performance degradation. Complement to the above experiments on attention estimation with RNN as well as arrangements of RNN and CNN, we stretched the design choice exploration to studying an alternative way of injecting the learned attention from RNN into CNN. In our proposed architecture, the CNN directly takes the attention maps produced by the RNN as input. An alternative architecture is to weigh feature maps of a certain intermediate layer in CNN (which still takes binary sketch images as input) with the attention map by RNN that leverages vector sketches as input. In our implementation, we inject the attention map produced by RNN, which is of size 56 × 56 with stroke width threshold ǫ = 0.5, into the output of the C2 residual block [9, 21] of ResNet50. Following the same training procedures as those in Table 2, this alternative architecture, named Two-BranchEarly-Fusion, achieves a recognition accuracy of 81.84% on the TU-Berlin benchmark, performing slightly better than Two-Branch-Late-Fusion. However the recognition accuracy of Two-Branch-Early-Fusion is still slightly inferior to that of the ResNet50-only model. This may be due to non-stroke pixels in the attention map from RNN having an attention value of zero, which, during the injection, would make convolution features at those corresponding locations vanish, reducing the feature information learned by previous convolutional layers from the input. Results on QuickDraw Benchmark. We further Model Accuracy Sketch-a-Net v2 [44] ResNet50 [9] Two-Branch-Late-Fusion [42] Sketch-R2CNN (Sketch-a-Net v2) Sketch-R2CNN (ResNet50) 74.84% 82.48 % 82.11% 77.29% 84.41% Table 3. Evaluations on the QuickDraw benchmark. compared the proposed Sketch-R2CNN with Sketch-aNet v2 [44], ResNet50-only model, and Two-Branch-LateFusion [42] on the QuickDraw benchmark. Note the ResNet50 is pre-trained on ImageNet [29] and served as the CNN backbone in Sketch-R2CNN and Two-Branch-LateFusion. Quantitative results are summarized in Table 3, and the performance of each competing method on the QuickDraw benchmark agrees well with those on the TU-Berlin benchmark. Compared to the competitors, Sketch-R2CNN (ResNet50) achieves the highest recognition accuracy on the QuickDraw benchmark, echoing its performance on the TU-Berlin benchmark. It is a similar case for the ResNet50only model, which still achieves better recognition performance than both Sketch-a-Net v2 and Two-Branch-LateFusion. Sketch-R2CNN (ResNet50) and Sketch-R2CNN (Sketch-a-Net v2) improve ResNet50 and Sketch-a-Net v2 respectively by about 2%. Although the sketch quality of QuickDraw may not be as good as that of TU-Berlin, thanks to the voluminous data of QuickDraw (24.15M sketches for training, 862.5K sketches for validation or testing), we still have seen consistent performance improvement of SketchR2CNN over CNN-only models, showing the generality of our proposed architecture. Qualitative Results. Fig. 4 shows some qualitative recognition comparisons between the CNN-only method (ResNet50) and our Sketch-R2CNN (ResNet50). Through visualization, it is observed that the attention maps produced by the RNN in Sketch-R2CNN can help the CNN to focus on more effective stroke parts of the inputs and ignore the interference of irrelevant strokes (e.g., the circle around the crab in Fig. 4) to make better classifications. In contrast, the CNN-only model cannot access the additional ordering and grouping cues existing in vector sketches and thus tends to struggle with sketches that have similar shapes but different category labels. Fig. 5 visualizes the attention maps by our method and the ones encoding sketching order (used in Attention-using-Sketching-Order in Table 2). It is observed that our attention maps estimated by RNN share a certain degree of similarity with the ones using sketching order, but the attention magnitudes by RNN are more adaptively biased. Limitation. As shown in Fig. 6, in some cases, the RNN in Sketch-R2CNN may fail to produce correct attention ResNet50 lobster scorpion panda teddy bear palm tree windmill spider crab Ours 1 0 scorpion lobster teddy bear panda windmill palm tree TU-Berlin crab spider QuickDraw Figure 4. Recognition comparisons between the CNN-only method (ResNet50) and our Sketch-R2CNN (ResNet50 as the CNN backbone). Failures are in red and successes are in green. Attention maps produced by our RNN are shown in the second row and are color coded. Note that our RNN only predicts attention for stroke pixels; non-stroke pixels are set to have an attention value of zero and are not color-coded. church castle kangaroo squirrel Sketching Order Ours 1 0 TU-Berlin QuickDraw ResNet50 Figure 5. The first row shows color-coded attention maps produced by our Sketch-R2CNN (ResNet50) for specific object categories. Correspondingly in the second row, we directly encode the sketching order as attention maps, higher attention values for strokes drawn earlier. Note that non-stroke pixels are set to have an attention value of zero and are not color-coded. 1 armchair pig toaster Ours pumpkin 5. Conclusion 0 banana backpack TU-Berlin cow ble solution to address the ambiguity is to put the sketched objects in context (i.e., scenes), and integrate our method with the context-based recognition methods [47, 48]. suitcase QuickDraw Figure 6. Recognition failures of our Sketch-R2CNN (ResNet50). guidance for the subsequent CNN, leading to recognition failures (e.g., the pumpkin), possibly due to the inability in extracting effective sequential features from inputs with similar temporal ordering and grouping cues as other training sketches in different categories. Some sketches that are seemingly with ambiguous categories (e.g., the toaster) may also pose challenges to our method. It is expected that human would make similar mistakes on such cases. One possi- In this work, we have proposed a novel single-branch attentive network architecture named Sketch-R2CNN for vector sketch recognition. Our RNN-Rasterization-CNN design consistently improves the recognition accuracy of CNN-only models by 1-2% on two existing large-scale sketch recognition benchmarks. The key enabler for joining RNN and CNN together is a novel differentiable neural line rasterization module that performs in-network vectorto-raster sketch conversion. Applying Sketch-R2CNN to other tasks like sketch retrieval or sketch synthesis that need descriptive line-drawing features could be interesting to explore in the future. References [1] C. Alvarado and R. Davis. SketchREAD: A multi-domain sketch recognition engine. In Proc. ACM UIST. ACM, 2004. 2 [2] R. Arandjelović and T. M. Sezgin. Sketch recognition by fusion of temporal and image-based features. Pattern Recogn., 44(6):1225–1234, 2011. 2 [3] M. Eitz, J. Hays, and M. Alexa. How do humans sketch objects? ACM TOG, 31(4):44:1–44:10, July 2012. 1, 2, 3, 5 [4] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M. Alexa. Sketch-based shape retrieval. ACM TOG, 31(4):31:1–31:10, July 2012. 1 [5] B. Graham, M. Engelcke, and L. van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In Proc. IEEE CVPR, 2018. 3 [6] B. Graham and L. van der Maaten. Submanifold sparse convolutional networks. CoRR, abs/1706.01307, 2017. 3 [7] A. Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013. 2, 3 [8] D. Ha and D. Eck. A neural representation of sketch drawings. In Proc. ICLR, 2018. 2, 3, 4, 5 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE CVPR, June 2016. 4, 5, 6, 7 [10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proc. IEEE CVPR, 2018. 3 [11] Z. Huang, H. Fu, and R. W. H. Lau. Data-driven segmentation and labeling of freehand sketches. ACM TOG, 33(6):175:1–175:10, Nov. 2014. 1 [12] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proc. IEEE CVPR, 2018. 2, 4 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 5 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pages 1097–1105. 2012. 1, 2, 5 [15] J. J. LaViola, Jr. and R. C. Zeleznik. MathPad2: A system for the creation and exploration of mathematical sketches. ACM TOG, 23(3):432–440, Aug. 2004. 2 [16] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Neural Networks: Tricks of the Trade: Second Edition, pages 9–48. 2012. 5 [17] K. Li, K. Pang, J. Song, Y.-Z. Song, T. Xiang, T. M. Hospedales, and H. Zhang. Universal sketch perceptual grouping. In Proc. ECCV, 2018. 1 [18] L. Li, H. Fu, and C.-L. Tai. Fast sketch segmentation and labeling with deep learning. CoRR, abs/1807.11847, 2018. 1 [19] Y. Li, T. M. Hospedales, Y.-Z. Song, and S. Gong. Free-hand sketch recognition by multi-kernel feature learning. CVIU, 137:1 – 11, 2015. 1, 2, 5 [20] Y. Li, Y.-Z. Song, and S. Gong. Sketch recognition by ensemble matching of structured features. In Proc. BMVC, 2013. 5 [21] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proc. IEEE CVPR, July 2017. 5, 6, 7 [22] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proc. IEEE CVPR, 2017. 3 [23] T. Lu, C.-L. Tai, F. Su, and S. Cai. A new recognition model for electronic architectural drawings. CAD, 37(10):1053 – 1069, 2005. 2 [24] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, pages 2204–2212. 2014. 3 [25] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. In Proc. IEEE CVPR, 2017. 3 [26] L. Olsen, F. F. Samavati, M. C. Sousa, and J. A. Jorge. Sketch-based modeling: A survey. Comput. & Graph., 33(1):85 – 103, 2009. 1 [27] T. Y. Ouyang and R. Davis. ChemInk: A natural real-time recognition system for chemical drawings. In Proc. ACM IUI. ACM, 2011. 2 [28] U. Riaz Muhammad, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Learning deep sketch abstraction. In Proc. IEEE CVPR, June 2018. 3 [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, Dec 2015. 5, 7 [30] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The Sketchy Database: Learning to retrieve badly drawn bunnies. ACM TOG, 35(4):119:1–119:12, July 2016. 1, 2, 3 [31] R. G. Schneider and T. Tuytelaars. Sketch classification and classification-driven analysis using Fisher Vectors. ACM TOG, 33(6):174:1–174:9, Nov. 2014. 1, 2, 3, 5 [32] T. M. Sezgin and R. Davis. Sketch recognition in interspersed drawings using time-based graphical models. Comput. & Graph., 32(5):500–510, 2008. 2 [33] J. Song, K. Pang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Learning to sketch with shortcut cycle consistency. In Proc. IEEE CVPR, June 2018. 2 [34] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deep spatial-semantic attention for fine-grained sketchbased image retrieval. In Proc. IEEE ICCV, 2017. 3, 4, 6 [35] Z. Sun, C. Wang, L. Zhang, and L. Zhang. Free handdrawn sketch segmentation. In Proc. ECCV, pages 626–639. Springer, 2012. 1 [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. IEEE CVPR, 2015. 2 [37] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In Proc. IEEE CVPR, 2017. 3, 4, 6 [38] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval using convolutional neural networks. In Proc. IEEE CVPR, 2015. 1 [39] X. Wang, X. Chen, and Z. Zha. SketchPointNet: A compact network for robust sketch recognition. In Proc. ICIP, pages 2994–2998, 2018. 2 [40] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convo- [41] [42] [43] [44] [45] [46] [47] [48] lutional neural network for fine-grained image classification. In Proc. IEEE CVPR, 2015. 3 K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015. 3 P. Xu, Y. Huang, T. Yuan, K. Pang, Y.-Z. Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo. SketchMate: Deep hashing for million-scale human sketch retrieval. In Proc. IEEE CVPR, June 2018. 1, 2, 3, 4, 6, 7 E. Yanık and T. M. Sezgin. Active learning for sketch recognition. Comput. & Graph., 52:93 – 105, 2015. 2 Q. Yu, Y. Yang, F. Liu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Sketch-a-Net: A deep neural network that beats humans. IJCV, 122(3):411–425, May 2017. 1, 2, 3, 4, 5, 7 Q. Yu, Y. Yang, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Sketch-a-net that beats humans. In Proc. BMVC, pages 7.1– 7.12, 2015. 5 H. Zhang, S. Liu, C. Zhang, W. Ren, R. Wang, and X. Cao. SketchNet: Sketch classification with web images. In Proc. IEEE CVPR, 2016. 2 J. Zhang, Y. Chen, L. Li, H. Fu, and C.-L. Tai. Context-based sketch classification. In Proc. Expressive, pages 3:1–3:10. ACM, 2018. 1, 2, 8 C. Zou, Q. Yu, R. Du, H. Mo, Y.-Z. Song, T. Xiang, C. Gao, B. Chen, and H. Zhang. SketchyScene: Richly-annotated scene sketches. In Proc. ECCV, September 2018. 2, 8