Academia.eduAcademia.edu

SpatialFlow: Bridging All Tasks for Panoptic Segmentation

2019, arXiv (Cornell University)

JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1 SpatialFlow: Bridging All Tasks for Panoptic Segmentation arXiv:1910.08787v3 [cs.CV] 27 Aug 2020 Qiang Chen, Anda Cheng, Xiangyu He, Peisong Wang, and Jian Cheng Abstract—Object location is fundamental to panoptic segmentation as it is related to all things and stuff in the image scene. Knowing the locations of objects in the image provides clues for segmenting and helps the network better understand the scene. How to integrate object location in both thing and stuff segmentation is a crucial problem. In this paper, we propose spatial information flows to achieve this objective. The flows can bridge all sub-tasks in panoptic segmentation by delivering the object’s spatial context from the box regression task to others. More importantly, we design four parallel sub-networks to get a preferable adaptation of object spatial information in sub-tasks. Upon the sub-networks and the flows, we present a location-aware and unified framework for panoptic segmentation, denoted as SpatialFlow. We perform a detailed ablation study on each component and conduct extensive experiments to prove the effectiveness of SpatialFlow. Furthermore, we achieve state-ofthe-art results, which are 47.9 PQ and 62.5 PQ respectively on MS-COCO and Cityscapes panoptic benchmarks. Code will be available at https://github.com/chensnathan/SpatialFlow. Index Terms—Panoptic segmentation, Scene understanding, Location-aware I. I NTRODUCTION R EAL-WORLD vision systems, such as autonomous driving or augmented reality, require a rich and complete understanding of the image scene. However, neither detect and segment the objects in the image nor segment the image semantically can provide a global view of the image scene. Considering the tasks as a whole is a step forward to real-world vision systems. In the pre-deep learning era, there are classical vision tasks, such as scene understanding [1], [2], considering object detection and semantic segmentation jointly. With the development of deep learning, instance and semantic segmentation have been widely studied and improved, while studies of the joint task have been left behind. Recently, [3] proposed the panoptic segmentation task to unify two segmentation tasks. In this task, countable objects such as persons, animals, and tools are considered as things, while amorphous regions of similar texture or material such as grass, sky, and road are referred to as stuff. It draws the attention of the vision community and pushes the deep vision systems a step forward towards applications in the real-world scenarios. Qiang Chen, Anda Cheng, Xiangyu He, Peisong Wang, and Jian Cheng are with the National Laboratory of Pattern Recognition (NLPR), Institute of Automation Chinese Academy of Sciences (CASIA) and School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China. (e-mail: qiang.chen@nlpr.ia.ac.cn; chenganda2017@ia.ac.cn; xiangyu.he@nlpr.ia.ac.cn; peisong.wang@nlpr.ia.ac.cn; jcheng@nlpr.ia.ac.cn). Corresponding author: Jian Cheng (jcheng@nlpr.ia.ac.cn) Copyright c 20xx IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. s oc xl on ati Bo Thing segmentation Merge results Bo xl oc Panoptic segmentation ati on s Stuff segmentation Fig. 1. An illustration of the panoptic segmentation task. We also provide the bouding box for each object in the image and add process to integrate box location to both thing and stuff segmentation. Panoptic segmentation aims to assign all pixels in an image with a semantic and an instance id, which is a challenging task as it requires a global view of segmentation. In [3], the authors tried to solve the task by adopting two independent models, Mask R-CNN [4] and PSPNet [5], for thing and stuff segmentation1 respectively. Then, they applied a heuristic post-processing method to merge the segmentation outputs of two tasks, as illustrated on the right side of Figure 1. These two independent models ignore the underlying relationship between things and stuff and bring computation budder into the framework. Recently, several works [6], [7], [8], [9], [10], [11] follow [3] and try to build a unified pipeline for panoptic segmentation via sharing the backbone between two segmentation tasks. However, most of the recent works focus on how to combine the outputs of segmentation tasks properly, failing to highlight the significance of object location when training networks. As demonstrated in the literature, the spatial information of objects can boost the performance of algorithms in object detection [12], [13], instance segmentation [14], [15], and semantic segmentation [16], [17]. Our key insight is that, as a combination of these tasks, panoptic segmentation can benefit from delivering spatial information of objects among its sub-tasks. We illustrate the process of performing panoptic segmentation with box locations in Figure 1. A crucial question then arises: how to integrate spatial information into the segmentation tasks seamlessly? To fulfill this goal, we propose to combine object location by explicitly delivering the spatial context from the box regression task to others. Based on this, we introduce a new unified framework 1 Refer to as instance and semantic segmentation, in this paper, we use the thing and the stuff to emphasize the tasks in panoptic segmentation. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY for panoptic segmentation by fully leveraging the reciprocal relationship among detection, thing segmentation, and stuff segmentation. Two primary principles are considered as follows. First, keep the spatial context in pixel-level before segmenting things and stuff. Although thing and stuff segmentation can complement one another, the format of dominated features in these two segmentation tasks may be inconsistent. The instance-level features control the thing segmentation, while the pixel-level features guide the stuff segmentation. The instance-level spatial context may not be suitable for stuff segmentation, given the format of its dominant feature. Besides, instances can be overlapping, which makes it hard to map them back to pixel-level. Based on this principle, we resort to one-stage detector RetinaNet [18] instead of two-stage detector Faster R-CNN [19]. It prevents the spatial context of objects from being in the format of instance-level before performing segmentation tasks. Then, we extend RetinaNet with taskspecific heads - thing head [20] and stuff head [8] to perform thing and stuff segmentation. In task-specific heads, the spatial context can be instance-level for things and be pixel-level for stuff. Second, integrate the spatial context into segmentation by fully leveraging feature interweaving among tasks. The spatial context plays a significant role in improving the quality of segmentation. It is plentiful in the box regression sub-task but insufficient in others. To make other sub-tasks locationaware, we propose the information flows to deliver the spatial context from the box regression task to others and integrate it by feature interweaving. However, the absence of multi-stage features in thing and stuff segmentation makes it inconvenient to absorb the spatial context. To solve this dilemma, we design four parallel sub-networks for four sub-tasks in the framework, enabling the model to leverage feature interweaving among tasks. The overall design fully leverages the spatial context, bridges all the tasks in panoptic segmentation by integrating features among them, and builds a global view for the image scene, leading to better refinement of features, more robust representations for image segmentation, and higher prediction results. Our contributions are three-fold: • In this paper, we present a new unified framework for panoptic segmentation. Our framework is built on the one-stage detector RetinaNet, which facilitates feature interweaving in pixel-level. • Based on the proposed framework, we design four parallel sub-networks to refine sub-task features. Among the sub-networks, we propose the spatial information flows to bridge all sub-tasks by making them location-aware. Our framework is denoted as SpatialFlow. • We perform a detailed ablation study on various components of SpatialFlow. Extensive experimental results show that SpatialFlow achieves state-of-the-art results, which are 47.9 PQ and 62.5 PQ on COCO [21] and Cityscapes [22] panoptic benchmarks. The rest of our paper is organized as follow: In Section II, we briefly revisit recent progress related to this paper; in 2 Section III, we first present the proposed unified framework for panoptic segmentation based on RetinaNet, then we illustrate all the details of the designed parallel sub-networks and the spatial information flows; in Section IV, V, VI, we present all details and results of the experiments, analyze the effect of each component, and make further discussions; finally, we conclude the paper in Section VII. II. R ELATED W ORKS After the pioneering application of AlexNet [23] on the ImageNet datasets [24], deep learning methods have come to dominate computer vision. These methods have dramatically improved the state-of-the-art in many vision tasks, including image recognition[23], [25], [26], [27], [28], [29], [30], image retrieval [31], [32], metric learning [33], [34], [35], object detection [36], [37], [19], image segmentation [38], [39], [4], human pose estimation [40], [41], and many other tasks. Our work builds on prior works in object detection and image segmentation. We apply multi-task learning [42], [43] in our model, which makes things and stuff segmentation tasks benefit each other and builds a global view for the image scene. Next, we review some works that are closest to our work as follows. A. Object Detection Our community has witnessed remarkable progress in object detection. Works, such as [37], [19], [44], tackled the detection problem by a two-stage approach. They first generated a number of object proposals as candidates, followed by a classification head and a regression head on each RoI. Numerous recent breakthroughs has been made, such as adjusting network structures [45], [46] and searching for better training strategies [47], [48], [49]. Another type of detector followed the single-stage pipeline, such as [50], [51], [18]. They directly predict object categories and regress bounding box locations based on pre-defined anchors. Recently, researchers focus on improving the localization quality of one-stage detectors and propose anchor-free algorithms [13], [52], [53], [54], [55]. In [18], the authors designed two parallel sub-networks for classification and regression, respectively. In this paper, SpatialFlow extends RetineNet by adopting the design of parallel sub-networks. B. Instance Segmentation Instance segmentation is a task that requires a pixel-level mask for each instance. Existing methods can be divided into two main categories, segmentation-based and regionbased methods. Segmentation-based approaches, such as [56], [57], first generate a pixel-level segmentation map over the image and then perform grouping to identify the instance mask of each object. While region-based methods, such as [4], [58], [14], are closely related to object detection algorithms. They predict the instance masks within the bounding boxes generated by detectors. Region-based methods can achieve higher performance than their segmentation-based counterparts, which motivates us to resort to the region-based methods. In SpatialFlow, we adopt a thing branch upon RetinaNet for thing segmentation. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY thing sub-networks classification sub-networks 3 thing head classification head backbone regression sub-networks stuff sub-networks aBackbone bSub-networks regression head stuff head cTask Heads Fig. 2. An illustration of the overall architecture. The SpatialFlow consists of three parts: (a) Backbone with FPN. (b) Four parallel sub-networks: We propose the spatial information flow and feature fusion among tasks in this part. The spatial flows are illustrated as orange dashed arrows, and the feature fusion is not shown in this figure for an elegant presentation; (c) Four heads for specific tasks: The classification head and regression head predict detection box together for thing head. The final result of SpatialFlow is a combination of the detected boxes and the outputs of thing head and stuff head. C. Semantic Segmentation Fully convolutional networks are essential to semantic segmentation [59], and its variants achieve state-of-the-art results on various segmentation benchmarks. It has been proven that contextual information plays a vital role in segmentation [60]. A bunch of works followed this idea: dilated convolution [38] was invented to keep feature resolution and maintain contextual details; Deeplab series [61], [62] proposed Atrous Spatial Pyramid Pooling (ASPP) to capture global and multi-scale contextual information; PSPNet [5] used spatial pyramid pooling to collect contextual priors; the encoderdecoder networks [39], [63] are designed to capture contextual information in encoder and gradually recover the details in decoder. Our SpatialFlow, built upon FPN [45], uses an encoder-decoder architecture for stuff segmentation to capture the contextual information. We take the spatial context of object detection into consideration and build a connection for thing and stuff segmentation. D. Panoptic Segmentation The panoptic segmentation task was proposed in [3], where the authors provided a baseline method with two separate networks, then used a heuristic post-processing method to merge two outputs. Later, Li et al. [64] followed this task and introduced a weakly- and semi-supervised panoptic segmentation method. Recently, several unified frameworks have been proposed. De Geus et al. [6] used a shared backbone for both things and stuff segmentation, while Li et al. [7] took a step further by considering things and stuff consistency and proposed a unified network named TASCNet. Kirillov et al. [8] introduced PanopticFPN by endowing Mask R-CNN [4] with a stuff branch, which ignores the connection between things and stuff. Li et al. [9] aimed to capture the connection by utilizing the attention module. To solve the conflicts in the result merging process, Liu et al. [11] designed a spatial ranking module. Also, Xiong et al. [10] proposed a parameter-free panoptic head to resolve the conflicts. Thinking differently, Yang et al. presented a single-shot approach for panoptic segmentation. However, most of these methods ignored to highlight the significance of the spatial features. Our SpatialFlow proposes the information flows to enable all tasks to be location-aware, which helps build a panoptic view for image segmentation. III. S PATIAL F LOW Object location is one of the key factors when building a global view for panoptic segmentation. While recent works [6], [7], [8], [10], [11] for panoptic segmentation focus on how to combine the outputs of segmentation tasks properly but ignore to highlight the significance of the object location in the training phase. In this work, we propose a new unified framework, SpatialFlow, which enables all sub-tasks to be locationaware. Our SpatialFlow is conceptually simple: RetinaNet [18] with two added sub-networks and two extra heads for thing and stuff segmentation. More importantly, we add multi-stage spatial information flows among the sub-networks. We begin by reviewing the RetinaNet detector. RetinaNet is one of the most successful fully convolutional one-stage detectors. It consists of three parts: backbone with FPN [45], two parallel sub-networks, and two task-specific heads for box classification and regression. In SpatialFlow, we adopt the main network structure of RetinaNet. We illustrate the sketch of our framework in Figure 2. A. Naive Implementation As we discussed in Section I, RetinaNet shows its merits in pixel-level feature integration, which is beneficial for segmentation tasks. To develop a unified framework for panoptic segmentation based on RetinaNet, the most naive way is to add one thing head and one stuff head upon FPN features to enable thing and stuff segmentation. In this section, we introduce the naive implementation of the unified framework, which ignores the task feature refinement and the integration of box locations JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (a) 4 (b) P7 1/32x P6 P5 C5 1/4x 1/16x 1/4x 1/8x C4 P4 C3 P3 1/4x 1x 1/4x P3, P4, P5 upsampled 128-d feature P3, P4, P5 256-d feature Conv-GN-ReLU-2x Combined 128-d feature Stuff predicted 54-d logits Elementwise add Conv-4x (c) thing feature classification feature Conv-ReLU regression feature Spatial Flow stuff feature Semantic Flow Fig. 3. The designs for each part in SpatialFlow. In the dashed rectangle (a), we show the output features of FPN, which are the features named {P3 , P4 , P5 , P6 , P7 }. In the dashed rectangle (b), we present the architecture of the stuff head. More importantly, all the information flows in sub-networks are illustrated in the dashed box (c). like previous methods [8], [11], [10] but built on RetinaNet. Next, we will introduce the detailed design of each element in the naive implementation. 1) Backbone: We adopt the same backbone structure as RetinaNet. The backbone contains FPN, whose outputs are five levels of features named {P3 , P4 , P5 , P6 , P7 } with a downsample rate of 8, 16, 32, 64, 128 respectively. In FPN, all features have 256 channels. We show the details in Figure 3 (a). Following [20], we treat these features differently against various tasks: we use all the five levels to predict the bounding boxes in detection but only send {P3 , P4 , P5 } to thing and stuff segmentation. 2) RetinaNet-based sub-networks: We present the parallel sub-networks in RetinaNet - classification sub-network (cls sub-net for short) and regression sub-network (reg sub-net for short). The operations in these sub-networks, which transform the output features of FPN to the inputs of downstream heads, can be formulated as follows: Pregi,j = φ(Pregi,j−1 ), Pclsi,j = φ(Pclsi,j−1 ). (1) Here, i represents the level index of FPN levels, j is the layer stage index in sub-networks, and φ denotes to a network block that contains a 3 × 3 convolution layer and a ReLU layer. In the cls and reg sub-networks, i ∈ {3, 4, 5, 6, 7}, j ∈ {1, 2, 3, 4}, and we have Pclsi,0 = Pregi,0 = Pi , while i ∈ {3, 4, 5} for thing and stuff segmentation. 3) Task-specific heads: As illustrated in Figure 2 (c), we apply four heads for box classification, box regression, thing segmentation, and stuff segmentation, respectively. In the classification and the regression head, the final outputs of the detection can be obtained by Oclsi = ψ(Pclsi,4 ), Oregi = φ(Pregi,4 ), where Oclsi and Oregi represent the outputs of the classification head and the regression head in the FPN level i. We implement one 3 × 3 convolution layer φ on the outputs of classification and regression sub-nets. For the thing head, we apply it to each predicted box and adopt the same design as Mask R-CNN [4]. For each RoI feature, ORoIk = ψ(ζ(φ(PRoIk )), where ORoIk is the output of the k-th predicted box, φ represents for four 3 × 3 convolution layers with ReLU, one 2 × 2 stride 2 deconvolution layer with ReLU is denoted as ζ, and ψ is a 1 × 1 output convolution layer. After the stuff sub-net, we obtain three levels of feature maps with scales of 1/8, 1/16, 1/32 of the original image. We perform upsampling on each feature map gradually by blocks, each of which contains a 3 × 3 convolution layer, a group norm [65] layer, a ReLU layer, and a 2× bilinear upsampling operation. All the features are upsampled to the scale of 1/4, which are then element-wise summed. A final 1 × 1 convolution layer, a 4× bilinear upsampling operation, and a softmax are applied to get the segmentation result. The stuff head is shown in Figure 3 (b) with details. To generate the final output of SpatialFlow, we first perform a heuristic post-processing method [8] to merge the results of thing and stuff segmentation, then fill the unassigned area in the merged map with the predicted boxes’ locations and categories. We show the key components of the proposed unified framework. The adaptation of RetinaNet [18] enables the feature in pixel-level before performing segmentation tasks. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY There remain obstacles preventing the unified framework from building a global view for the image scene, e.g., lack of feature intersection between things and stuff. The naive implementation also has practicality problems regarding the refinement of the FPN feature for things and stuff segmentation. To further improve the quality of learned features for things and stuff segmentation and strengthen the intersection between things and stuff in the image, we propose two techniques: adding things and stuff parallel sub-networks and proposing spatial information flows. B. Thing and stuff parallel sub-networks In RetinaNet, the parallel sub-networks refine the FPN features with multi-stage convolution layers, which transform the FPN features to task-specific features and lead to better performance. But there is no refinement for the input features in thing and stuff segmentation in the naive implementation. In this section, we apply the same mechanism to these two segmentation tasks. Moreover, the created multi-stage features facilitate the delivery of the spatial context from box regression task to others. We show the details of this part in Figure 3 (c). In this section, we propose to add two additional subnetworks - thing sub-network and stuff sub-network. We adopt the similar structure as in cls and reg sub-networks. Until now, there are four parallel sub-networks between the FPN and the task-specific heads. We present the modifications in thing and stuff sub-networks below: Pthingi,j = φ(Pthingi,j−1 ); Pstuf fi,j = φ(Pstuf fi,j−1 ). (2) Where Pthingi,0 = Pstuf fi,0 = Pi . As the dominated features in thing and stuff segmentation are different, the number of stages required by the sub-networks depends on tasks. More stages are needed in stuff segmentation than in thing segmentation to do feature refinement. We conjecture that the reason for the phenomenon is that pixel-level features are more sensitive to the details than instance-level features. In the final version, we adopt four stages in stuff sub-network and keep only one in thing sub-network. This setting gives the best performance of segmentation. In each stage, we implement a 3 × 3 convolutional layer and a ReLU layer. We illustrate the overall structure of sub-networks in Figure 3 (c). And the experimental results for the number of stages in thing and stuff sub-networks can be found in Table VII and Table VIII. C. Spatial information flows As illustrated in Figure 1, all sub-tasks in our proposed panoptic segmentation framework are related to the locations of objects. The box location information is implied in the multi-stage feature representations of the box regression subnetwork. We propose the spatial information flows to support feature refinement in sub-networks. The spatial information flows can make other sub-tasks aware of box locations. Furthermore, adding the semantic feature in the thing segmentation has been proved to be effective in HTC [14]. We also add a semantic flow that adopts a 3 × 3 convolution layer to transform the stuff feature to the thing feature. It brings slight 5 improvements in our SpatialFlow, as shown in Table IX and Table X. Then we display the detailed structure of the spatial flows in Figure 3 (c). They can be implemented as follows: Pregi,j = φ(Pregi,j−1 ); Pclsi,j = φ(Pclsi,j−1 + ψ(Pregi,j )); Pstuf fi,j = φ(Pstuf fi,j−1 + ψ(Pregi,j )); (3) Pthingi,1 = φ(Pi + ζ(Pstuf fi,4 ), ψof f set (Pregi,4 + Pi )); Pregi,0 = Pclsi,0 = Pthingi,0 = Pstuf fi,0 = Pi . Here, Pregi,0 = Pclsi,0 = Pthingi,0 = Pstuf fi,0 = Pi and ψ denotes an adaptation convolution from box regression task to others; ζ denotes an adaptation convolution from stuff sub-net to thing sub-net. We use a 3 × 3 convolution layer for both ψ and ζ. All features have 256 channels in this part. Moreover, to make a fair comparison with UPSNet [10] on COCO, we introduce deformable convolutions [66] layers to the sub-networks. We further adopt a method to incorporate the spatial context into deformable convolution more appropriately. We first combine the spatial information flow and the task-specific feature, then use the combined feature to generate the offsets for the deformable convolution on the task-specific sub-networks. The process can be formulated as follow: Pregi,j = φ(Pregi,j−1 ); Pclsi,j = φdcn (Pclsi,j−1 , ψof f set (Pregi,j + Pclsi,j−1 )); Pstuf fi,j = φdcn (Pstuf fi,j−1 , ψof f set (Pregi,j + Pstuf fi,j−1 )); Pthingi,1 = φ(Pi + ζ(Pstuf fi,4 ), ψof f set (Pregi,4 + Pi )); Pregi,0 = Pclsi,0 = Pthingi,0 = Pstuf fi,0 = Pi . (4) In the equation, φdcn represents a deformable convolution layer, ψof f set means an adaptation convolution layer, which generates offsets for the deformable convolution. Unless specified, we do not adopt the setting with deformable convolution. IV. E XPERIMENTS A. Dataset and Evaluation metric 1) Dataset: We evaluate our model on both COCO [21] and Cityscapes [22]. COCO consists of 80 things and 53 stuff classes. We use the 2017 data splits with 118k/5k/20k train/val/test images. We use train split for training, and report leision and sensitive studies by evaluating on val split. For our main results, we report our panoptic performance on the test-dev split. Cityscapes has 5k high-resolution images with fine pixel-accurate annotations: 2975 train, 500 val, and 1525 test. There are 19 classes on Cityscapes, 8 with instance-level masks. For all experiments on Cityscapes, we report our performance on val split with 11 stuff classes and 8 things classes. 2) Evaluation metric: We adopt the panoptic quality (PQ) as the metric. As proposed in [3], PQ can be formulated as follow: P |T P | (p,g)∈T P IoU (p, g) × PQ = |T P | |T P | + 12 |F P | + 12 |F N | | {z } | {z } segmentation quality (SQ) recognition quality (RQ) (5) JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY where p and g are predicted and ground truth segments, TP (true positives), FP (false positives), and FN (false negatives) represent matched pairs of segments (IoU (p, g) > 0.5), unmatched predicted segments, and unmatched ground truth segments, respectively. Besides, PQ can be explained as the multiplication of a segmentation quality (SQ) and a recognition quality (RQ). We also use SQ and RQ to measure the performance in our experiments. 6 regions as 4900. When performing integration with detection box results, we propose a new hyper-parameter, which is the overlap between the detection box and the unassigned area in the segmentation map. We fix the threshold of the box overlap as 0.6. For the hyper-parameters on Cityscapes, we modify the overlap threshold of the instance masks to 0.25 and change the area limit threshold of stuff regions to 2048. C. Main Results B. Implementation Details 1) Training: As a unified framework for panoptic segmentation, there are four different losses for SpatialFlow to optimize during the training stage. The loss function can be formulated as follow: L = (Lcls + Lreg + Lthing ) + λ · Lstuf f (6) where Lcls , Lreg and Lthing belong to the thing segmentation task, and Lstuf f is the loss of the stuff segmentation. We add a hyper-parameter λ to balance the losses between thing and stuff segmentation. We implement our SpatialFlow with a toolbox [67] based on PyTorch [68]. We inherit all the hyper-parameters from RetinaNet except that we set the threshold of NMS to 0.4 when generating proposals during training. For thing prediction, we add the ground truth boxes to the proposals set and run the thing head for all proposals. For training strategies, we fix the batch norm layer in the backbone and train all models over 4 GPUs with a total of 8 images per minibatch. On MS-COCO [21], we use the training strategy of training longer that adopted by RetinaNet(1.5×) [18] and RetinaMask(2×) [20]. All models are trained for 20 epochs with an initial learning rate of 5 × 10−3 , which is decreased by 10 after 16 and 19 epochs; on Cityscapes [22], we set the initial learning rate as 1.25 × 10−2 and borrow the number of iterations from [8]. Unless specified, we resize the shorter edge of the image to 800 pixels on COCO, while on Cityscapes, we adopt 512 × 1024 image crops after scaling each image by 0.5 to 2.0×. As Kirillov et al. [8] did, we also predict a particular ‘other’ class for all things categories in stuff head on COCO benchmark. 2) Inference: Our model follows a pipeline in the inference stage: (1) generate the detection results; (2) obtain the maps of thing and stuff segmentation; (3) merge the two outputs to form a panoptic segmentation map; (4) fill the unassigned area in the result with the detected boxes and its categories. In detection, we set the threshold of NMS to 0.4 for each class separately, and choose the top-100 scoring bounding boxes to send to thing head. During merging, we first ignore the stuff regions labeled ‘other’; then we resolve the overlap problem between instances based on their scores, and merge the thing and stuff map in favor of things; at last, we fill the unassigned area with detection boxes in the result segmentation map to form the final output. For the hyper-parameters of SpatialFlow on in the inference stage, we fixed the confidence score threshold for the instance masks as 0.37, set the overlap threshold of instance masks as 0.37, and set the area limit threshold of stuff In this section, we compare our SpatialFlow with the stateof-the-art methods in panoptic segmentation. We show all the main results in Table I, Table II, and Table III. SpatialFlow achieves state-of-the-art results on both COCO [21] and Cityscapes [22] panoptic benchmark. 1) MS COCO: To make a fair comparison, we report the results in Table I and Table II, where the experiment settings are different. In Table I, we present the prediction results without bells and whistles. With a single ResNet-101-FPN backbone, our SpatialFlow can achieve 42.9 PQ on COCO test-dev split, which outperforms PanopticFPN [8] by 2.0 PQ and OANet [11] by 1.6 PQ. More importantly, SpatialFlow achieves a new state-of-the-art performance on PQSt , 33.0 PQ, which outperforms other models by a large margin (3.3 PQ and 5.3 PQ respectively). The results demonstrate the effectiveness of integrating the spatial features in pixel-level, which significantly impact stuff segmentation. However, SpatialFlow is lagging behind OANet [11] in PQT h . In OANet, the authors focus on solving the overlapping problem of instances when rendering the thing results to the final panoptic result. SpatialFlow applies a simple method to this problem, causing the inferior performance in PQth . Then we apply deformable convolution [66] to both backbone and sub-networks and report its results with the multiscale strategy in Table II. When training, the scales of short edges are randomly sampled from [400, 1400], and the scales of long edges are fixed as 1600. For inference, we feed multi-scale images to SpatialFlow, and the scales are (1500, 1000), (1800, 1200), and (2100, 1400) with horizontal flip. We achieve 47.9 PQ, which is the state-of-the-art result on COCO panoptic benchmark. As shown in Table II, our method outperforms the runner-up of the COCO 2018 challenge by 1.1 PQ with a single model, demonstrating the effectiveness of SpatialFlow. Although AUNet outperforms our method in PQth , they use ResNeXt-152-DCN as their backbone. In fact, with a stronger backbone (ResNeXt-101-DCN) and model ensemble, our method can achieve 50.2 PQ on COCO testdev split. 2) Cityscapes: We also report the results under different experiment settings in Table III. Without the COCO pretrained model, SpatialFlow can achieve 59.6 PQ on Cityscapes val split, which is 1.5 PQ and 0.6 PQ higher than PanopticFPN [8] and AUNet [9] respectively. With the COCO pretrained model, SpatialFlow can achieve 62.5 PQ on Cityscapes with multi-scale testing, which is 0.7 PQ higher than UPSNet [10] under the same setting. SpatialFlow outperforms all other methods in terms of PQst while getting inferior performance on PQth comparing to UPSNet and AdaptIS. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 7 TABLE I C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON COCO 2017 test-dev SPLIT. W E ONLY COMPARE WITH THE STATE - OF - THE - ART METHODS THAT WITHOUT DEFORMABLE CONVOLUTIONS HERE . model backbone PQ PQT h PQSt SQ RQ JSIS-Net [6] DeeperLab [69] PanopticFPN [8] OANet [11] SSAP [70] SpatialFlow ResNet-50 Xception-71 ResNet-101-FPN ResNet-101-FPN ResNet-101-FPN ResNet-101-FPN 27.2 34.3 40.9 41.3 36.9 42.9 29.6 37.5 48.3 50.4 40.1 49.5 23.4 29.6 29.7 27.7 32.0 33.0 71.9 77.1 78.8 35.9 43.1 52.3 TABLE II C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON COCO 2017 test-dev SPLIT. I N THIS TABLE , WE REPORT OUR RESULTS WITH DEFORMABLE CONVOLUTION AND MULTI - SCALE STRATEGY. T HE TOP 3 ROWS CONTAIN RESULTS OF TOP 3 MODELS TAKEN FROM THE OFFICIAL LEADERBOARD OF COCO 2018 PANOPTIC S EGMENTATION C HALLENGE . model backbone PQ PQT h PQSt SQ RQ Megvii (Face++) Caribbean PKU 360 ensemble model ensemble model ResNeXt-152-DCN 53.2 46.8 46.3 62.2 54.3 58.6 39.5 35.5 27.6 83.2 80.5 79.6 62.9 57.1 56.1 AdaptIS [71] AUNet [9] UPSNet [10] SOGNet [72] SpatialFlow ResNeXt-101 ResNeXt-152-DCN ResNet-101-DCN ResNet-101-DCN ResNet-101-DCN 42.8 46.5 46.6 47.8 47.9 50.1 55.9 53.2 54.5 31.8 32.5 36.7 38.0 81.0 80.5 80.7 81.7 56.1 56.9 57.6 57.6 TABLE III C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON C ITYSCAPES val SPLIT. I N THIS TABLE , ‘-R101’ REPRESENTS THAT THE BACKBONE IS R ES N ET-101 AND ‘-X101‘ FOR R ES N E X T-101 [73]; ‘-COCO’ MEANS USING COCO PRETRAINED MODEL ; ‘-M’ IS THE MULTI - SCALE TESTING . model PQ PQT h PQSt PanopticFPN-R101 [8] AUNet-R101 [9] TASCNet-R101-COCO [7] UPSNet-R101-COCO-M [10] SSAP-R101-M [70] AdaptIS-X101-M [71] SpatialFlow-R101 SpatialFlow-R101-COCO-M 58.1 59.0 59.2 61.8 61.1 62.0 59.6 62.5 52.0 54.8 56.0 57.6 55.0 58.7 55.0 56.6 62.5 62.1 61.5 64.8 64.4 63.1 66.8 We conjecture that the phenomenon is caused by the inferior detection performance of RetinaNet on Cityscapes. To obtain the result of 62.5 PQ on Cityscapes val split, we first replace the convolution layers in stuff with deformable convolutions as UPSNet does, then we follow the steps below: (1) Finetune the COCO pre-trained model. As the number of things and stuff classes in Cityscapes is smaller than the number in COCO, 11/8 vs. 80/53, we have to finetune the layers that related to the number of classes. We freeze the rest layers and use a learning rate of 2.5 × 10−3 to train for 2 epochs. (2) Train the finetuned model as the standard SpatialFlow does. (3) Apply the multi-scale testing trick. The scales that we use in Cityscapes are (2304, 1152), (2432, 1216), (2560, 1280), and (2688, 1344) with horizontal flip. D. Ablation Experiments We run a number of ablations to analyze the SpatialFlow. Unless specified, we use the naive implementation of Spa- TABLE IV L OSS BALANCE : T HE RESULTS OF THE BASELINE MODEL ON COCO val FOR DIFFERENT VALUES OF λ BASED ON R ES N ET-50 WITH IMAGE SIZE OF 600 PX . T HE PROPER λ BRINGS A LARGE GAIN . λ 1.0 0.75 0.5 0.3 0.25 0.2 PQ PQT h PQSt 37.5 41.8 30.9 38.2 43.0 31.1 38.8 44.0 31.0 39.1 44.5 30.9 39.3 45.1 30.5 39.0 44.9 30.0 tialFlow presented in Section III-A as our baseline model for all experiments in this section. We discuss the details below. Loss Balance. We first investigate the best value of the hyperparameter λ. We adopt the baseline model in this section. Table IV shows the model results of using various λ on COCO. We demonstrate the power of λ and discover that the best value to balance the losses on COCO is 0.25, with which the baseline model achieves 39.3 PQ with an image size of 600px and earns a 1.8 PQ gain compared with λ = 1.0. While for Cityscapes, we set λ = 1.0 by following [8]. Contribution of Components. In this section, we evaluate the sub-networks and the spatial flows of SpatialFlow on both COCO and Cityscapes. The results are shown in Table V and Table VI respectively. From the experiment results, we can see that both the sub-networks and the spatial flows demonstrate their contribution. The sub-networks improve PQ by 0.6 points and 0.2 points on COCO and Cityscapes. In particular, we obtain a significant gain on stuff (1.2 PQ on COCO) with the sub-networks as in which we refine the pixel-level feature before sending it to stuff head. For the spatial flows, they can improve the performance of things and stuff simultaneously JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 8 TABLE V TABLE VI C ONTRIBUTION OF C OMPONENTS : A BLATION RESULTS ON COCO val C ONTRIBUTION OF C OMPONENTS : A BLATION RESULTS ON C ITYSCAPES SPLIT WITH R ES N ET-50. B OTH SUB - NETWORKS AND SPATIAL FLOWS val SPLIT WITH R ES N ET-50. S IMILAR GAINS CAN BE OBTAINED ON BRING SIGNIFICANT GAINS BASED ON THE BASELINE MODEL . C ITYSCAPES . sub-nets spatial-flows PQ PQT h PQSt X X 39.7 40.3 40.2 40.9 46.0 46.2 46.5 46.8 30.2 31.4 30.7 31.9 X X TABLE VII D ESIGN OF S UB - NETWORKS : A BLATION RESULTS ON NUMBER OF STAGES IN THING SUB - NETWORK . O NLY ONE STAGE IS NEEDED TO REFINE THE FPN FEATURE FOR THE INPUT OF THE THING HEAD . sub-nets spatial-flows PQ PQT h PQSt X X 57.3 57.5 58.0 58.6 53.5 53.6 54.3 54.9 60.0 60.3 60.8 61.4 X X TABLE VIII D ESIGN OF S UB - NETWORKS : R ESULTS ON NUMBER OF STAGES IN STUFF SUB - NETWORK . M ORE STAGES BRING MORE GAINS . I NPUT FEATURE OF THE STUFF HEAD NEED TO BE FULLY REFINED BY SUB - NETWORK . num stages PQ PQT h PQSt num stages PQ PQT h PQSt 0 1 2 3 4 39.7 39.9 39.7 39.9 39.7 46.0 46.3 46.0 46.2 45.9 30.2 30.3 30.2 30.3 30.3 0 1 2 3 4 39.7 39.9 40.1 40.2 40.3 46.0 45.9 46.0 46.1 46.2 30.2 30.9 31.1 31.4 31.5 TABLE IX TABLE X S PATIAL F LOWS : T HE RESULTS OF THE SPATIAL FLOWS ON COCO. E ACH S PATIAL F LOWS : T HE RESULTS OF THE SPATIAL FLOWS ON C ITYSCAPES . ROW ADDS AN EXTRA COMPONENT TO THE ABOVE . E ACH ROW ADDS AN EXTRA COMPONENT TO THE ABOVE . flows PQ PQT h PQSt flows PQ PQT h PQSt + reg-cls +reg-stuff +reg-thing +stuff-thing 40.3 40.5 40.7 40.7 40.9 46.2 46.6 46.3 46.4 46.8 31.4 31.4 32.0 31.8 31.9 + reg-cls +reg-stuff +reg-thing +stuff-thing 57.5 58.0 58.3 58.5 58.6 53.6 55.1 54.6 54.7 54.9 60.3 60.1 60.9 61.3 61.4 by improving 0.5 PQ and 0.7 PQ on COCO and Cityscapes. Moreover, the spatial flows can bring further gains with the sub-networks compared with the obtained benefits on the baseline model. The results indicate that the integration of the spatial context can benefit from the feature refinement in sub-networks. Design of Sub-networks. We search the best number of stages for thing and stuff sub-networks. We conduct experiments on the COCO dataset with ResNet-50 based on the baseline model. The results are shown in Table VII and Table VIII. According to the results, we choose to add only one stage in thing sub-network and add four stages in stuff sub-network. We obtain 0.2 PQ and 0.6 PQ improvements with thing subnetwork and stuff sub-network. The different number of blocks in sub-networks are related to the difference of the dominated feature in thing and stuff segmentation. Spatial Flows. We conduct experiments to highlight the significance of the proposed spatial information flows between tasks. The baseline model marked with ‘-’ in Table IX and Table X is the one with all sub-networks. There are three paths to deliver the spatial context from the box regression task to others: the path from the reg sub-net to the cls sub-net (reg-cls flow), the path to the stuff sub-net (reg-stuff flow), and the path to the thing sub-net (reg-thing flow). The results are reported in Table IX and Table X. At first, we add the reg-cls path, and we obtain a 0.4 PQT h improvement on COCO and a 1.5 PQT h gain on Cityscapes, which are brought by better detection results. Adding spatial context helps cls sub-net to extract discriminative features, which is essential for detection. Then we build a spatial path for stuff sub-net, as shown in the 4th row of Table IX and Table X, we earn a 0.6 PQSt gain on COCO and a 0.8 PQSt gain on Cityscapes compared with the former model, which indicates that the spatial context begets a positive effect on stuff segmentation. The reg-thing path and the semantic path also show their effectiveness on both things and stuff segmentation. Comparing with the original model, SpatialFlow can achieve a consistent gain in both thing and stuff segmentation. The results prove the significance of the spatial context in panoptic segmentation to some extent. It is worth noting that we only apply the element-wise sum operation to integrate the spatial context in this work. We believe further improvement could be achieved with a more deliberate design like attention modules. V. F URTHER D ISCUSSION In this section, we provide further discussions about the spatial flows and give an overview of how the spatial flows work, how fast the SpatialFlow is, and how to apply the spatial flows to other vision tasks. Spatial Flows vs. Trivial Feature Fusion: Our main idea is to integrate spatial information into all sub-tasks and make them JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (a) PQ score 9 (b) PQ score for things (b) PQ score for stuff Fig. 4. The PQ, PQT h , and PQSt results of the base model, the cls flows model, and the spatial flows model with image size of 600px on COCO val split. TABLE XI ACCURACY vs. S PEED : C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON ACCURACY AND SPEED BALANCE . W E ILLUSTRATE S PATIAL F LOW PERFORMANCE WITH DIFFERENT IMAGE SCALES . * INDICATES THAT UPSN ET APPLY DEFORMABLE CONVOLUTION ON THE STUFF HEAD . image w/o flows w/ flows (a) heatmaps for things Model Backbone Scale PQ PQT h PQSt FPS PanopticFPN [8] UPSNet* [10] DeeperLab [69] ResNet-50 ResNet-50 Xception-71 800 800 641 39.0 42.5 34.3 45.9 48.5 37.5 27.9 33.4 29.6 18.9 9.1 10.6 SpatialFlow SpatialFlow SpatialFlow ResNet-50 ResNet-50 ResNet-50 800 600 400 40.9 40.3 37.4 46.8 45.6 41.5 31.9 32.2 31.4 10.3 13.0 19.6 TABLE XII D ETECTION R ESULTS : T HE RESULTS OF R ETINA N ET WITH OR WITHOUT SPATIAL FLOWS ON COCO val SPLIT WITH R ES N ET-50 AS THE BACKBONE . T HE SHORTED EDGES OF IMAGES ARE 800 PX . image w/o flows w/ flows (b) heatmaps for stuff Fig. 5. An illustration of the cls-head heatmap and the stuff-head heatmap. We provide a comparison between the model with and without spatial flows. Detectors mAP AP50 AP75 RetinaNet RetinaNet w/ flows 35.6 36.7 55.5 57.1 37.7 39.4 stuff. The spatial flows bridge all tasks and help build a global view of the image in panoptic segmentation. aware of the locations of the objects, which is different from trivial feature integration among sub-networks. To prove the effectiveness of the spatial flows, we design an experiment on COCO by delivering the feature of cls sub-networks to other three sub-tasks, denoted as the cls flows model in Figure 4. We conduct an experiment on it with the image input size of 600px. As shown in Figure 4, our method outperform the cls-based feature integration method by 0.7 PQ, 0.8 PQT h , and 0.5 PQSt respectively. The results suggest that trivial feature integration can not bring consistent improvements to the baseline model as our method does. Accuracy vs. Speed: In Table XI, we compare our method with the state-of-the-art methods in terms of accuracy and speed balance on COCO val split. The FPS is measured on a single Tesla V100 GPU. We show the results of different image sizes and different inference speed. Although SpatialFlow is not the fastest among all the methods, the results show good accuracy and speed balance of SpatialFlow. Larger image size yield higher accuracy, in slower inference speeds. Also, we find that thing segmentation benefits from large image sizes, while stuff segmentation is robust to the image size. Thanks to this, SpatialFlow can achieve 19.6 FPS and remain 37.4 PQ when we set the image size to 400px. Effects of Spatial Flows: We choose to study the effects of spatial flows using two models, which are the models with or without spatial flows. We visualize the last feature map in the cls-head and the stuff-head of both models via CAM [74] in Figure 5. The visualized heatmaps illustrate that the spatial flows can help the thing branch focus on objects and make the stuff branch aware of the precise boundary of things and Detection Results: We also conduct experiments on RetinaNet [18] to investigate the generalization of the spatial flows. We deliver the spatial context from reg sub-network to cls subnetwork. The detection result is shown in Table XII. With the help of the spatial context, the multi-stage features in sub-networks can be more discriminative, which boosted the performance. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Image Ground Truth SpatialFlow Image 10 Ground Truth SpatialFlow Fig. 6. An illustration of visualization examples of SpatialFlow on COCO val split using a single ResNet-101 network. VI. V ISUALIZATION We show some visualization examples of SpatialFlow on COCO and Cityscapes in Figure 6 and Figure 7 respectively. VII. C ONCLUSION In this work, we focus on the box locations in panoptic segmentation and propose a new location-aware and unified framework, denoted as SpatialFlow. We emphasize the importance of the spatial context and bridge all the tasks by building spatial information flows, then achieve state-of-the-art performance on both COCO test-dev split and Cityscapes val split, which prove the effectiveness of our model. Moreover, we find that the spatial flows can improve the performance of detection models, indicating the importance of spatial information. We expect that SpatialFlow can provide valuable insights on how to integrate spatial information in vision tasks. ACKNOWLEDGMENT This work was supported in part by National Natural Science Foundation of China (No.61972396, 61876182, 61906193), National Key Research and Development Program of China (No. 2019AAA0103402), the Strategic Priority Research Program of Chinese Academy of Science(No.XDB32050200), the Advance Research Program (No. 31511130301), and Jiangsu Frontier Technology Basic Research Project (No. BK20192004). Moreover, the authors would like to thank Jiaying Guo at Nanjing Institute of Geography and Limnology, Chinese Academy of Sciences for valuable discussions about the writing. R EFERENCES [1] L. Ladickỳ, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr, “What, where and how many? combining object detectors and crfs,” in European conference on computer vision. Springer, 2010, pp. 424–437. [2] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole: joint object detection,” in Proceedings of CVPR. Citeseer, 2012. [3] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic segmentation,” arXiv preprint arXiv:1801.00868, 2018. [4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969. [5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Image Ground Truth 11 SpatialFlow Fig. 7. An illustration of visualization examples of SpatialFlow on Cityscapes val split using a single ResNet-101 network. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY [6] D. de Geus, P. Meletis, and G. Dubbelman, “Panoptic segmentation with a joint semantic and instance segmentation network,” arXiv preprint arXiv:1809.02110, 2018. [7] J. Li, A. Raventos, A. Bhargava, T. Tagawa, and A. Gaidon, “Learning to fuse things and stuff,” arXiv preprint arXiv:1812.01192, 2018. [8] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” arXiv preprint arXiv:1901.02446, 2019. [9] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang, “Attention-guided unified network for panoptic segmentation,” arXiv preprint arXiv:1812.03904, 2018. [10] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, “Upsnet: A unified panoptic segmentation network,” arXiv preprint arXiv:1901.03784, 2019. [11] H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu, and W. Jiang, “An end-to-end network for panoptic segmentation,” arXiv preprint arXiv:1903.05027, 2019. [12] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille, “Singleshot object detection with enriched semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5813–5821. [13] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, “Region proposal by guided anchoring,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2965–2974. [14] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4974–4983. [15] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance segmentation,” in ICCV, 2019. [16] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1635–1643. [17] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Box-driven classwise region masking and filling rate guided loss for weakly supervised semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3136–3145. [18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. [19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99. [20] C.-Y. Fu, M. Shvets, and A. C. Berg, “Retinamask: Learning to predict masks improves state-of-the-art single-shot detection for free,” arXiv preprint arXiv:1901.03353, 2019. [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755. [22] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213– 3223. [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255. [25] J. Yu, Y. Rui, Y. Y. Tang, and D. Tao, “High-order distance-based multiview stochastic learning in image classification,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2431–2442, 2014. [26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [29] F. Zhou and Y. Lin, “Fine-grained image classification by exploring bipartite-graph labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1124–1133. 12 [30] J. Yu, M. Tan, H. Zhang, D. Tao, and Y. Rui, “Hierarchical deep click feature prediction for fine-grained image recognition,” IEEE transactions on pattern analysis and machine intelligence, 2019. [31] J. Yu, D. Tao, M. Wang, and Y. Rui, “Learning to rank using user clicks and visual features for image retrieval,” IEEE transactions on cybernetics, vol. 45, no. 4, pp. 767–779, 2014. [32] H. Wang, Y. Cai, Y. Zhang, H. Pan, W. Lv, and H. Han, “Deep learning for image retrieval: What works and what doesn’t,” in 2015 IEEE International Conference on Data Mining Workshop (ICDMW). IEEE, 2015, pp. 1576–1583. [33] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in International Workshop on Similarity-Based Pattern Recognition. Springer, 2015, pp. 84–92. [34] J. Yu, X. Yang, F. Gao, and D. Tao, “Deep multimodal distance metric learning using click constraints for image ranking,” IEEE transactions on cybernetics, vol. 47, no. 12, pp. 4014–4024, 2016. [35] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4004–4012. [36] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2015. [37] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448. [38] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015. [39] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. [40] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291– 7299. [41] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal deep autoencoder for human pose recovery,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5659–5670, 2015. [42] Y. Zhang and Q. Yang, “An overview of multi-task learning,” National Science Review. [43] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7482–7491. [44] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via regionbased fully convolutional networks,” in Advances in neural information processing systems, 2016, pp. 379–387. [45] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. [46] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162. [47] B. Singh and L. S. Davis, “An analysis of scale invariance in object detection snip,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3578–3587. [48] B. Singh, M. Najibi, and L. S. Davis, “Sniper: Efficient multi-scale training,” in Advances in neural information processing systems, 2018, pp. 9310–9320. [49] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks for object detection,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 6054–6063. [50] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779– 788. [51] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37. [52] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module for single-shot object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 840–849. [53] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 9627–9636. JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY [54] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750. [55] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019. [56] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation for autonomous driving with deep densely connected mrfs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 669–677. [57] Z. Wu, C. Shen, and A. v. d. Hengel, “Bridging category-level and instance-level semantic image segmentation,” arXiv preprint arXiv:1605.06885, 2016. [58] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768. [59] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440. [60] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille, “The role of context for object detection and semantic segmentation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 891– 898. [61] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018. [62] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017. [63] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528. [64] Q. Li, A. Arnab, and P. H. Torr, “Weakly-and semi-supervised panoptic segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 102–118. [65] Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19. [66] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773. [67] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019. [68] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017. [69] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, and L.-C. Chen, “Deeperlab: Single-shot image parser,” arXiv preprint arXiv:1902.05093, 2019. [70] N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang, and K. Huang, “Ssap: Single-shot instance segmentation with affinity pyramid,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 642–651. [71] K. Sofiiuk, O. Barinova, and A. Konushin, “Adaptis: Adaptive instance selection network,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7355–7363. [72] Y. Yang, H. Li, X. Li, Q. Zhao, J. Wu, and Z. Lin, “Sognet: Scene overlap graph network for panoptic segmentation,” arXiv preprint arXiv:1911.07527, 2019. [73] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492– 1500. [74] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921– 2929. 13