JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
1
SpatialFlow: Bridging All Tasks for Panoptic
Segmentation
arXiv:1910.08787v3 [cs.CV] 27 Aug 2020
Qiang Chen, Anda Cheng, Xiangyu He, Peisong Wang, and Jian Cheng
Abstract—Object location is fundamental to panoptic segmentation as it is related to all things and stuff in the image scene.
Knowing the locations of objects in the image provides clues
for segmenting and helps the network better understand the
scene. How to integrate object location in both thing and stuff
segmentation is a crucial problem. In this paper, we propose
spatial information flows to achieve this objective. The flows
can bridge all sub-tasks in panoptic segmentation by delivering
the object’s spatial context from the box regression task to
others. More importantly, we design four parallel sub-networks
to get a preferable adaptation of object spatial information in
sub-tasks. Upon the sub-networks and the flows, we present a
location-aware and unified framework for panoptic segmentation,
denoted as SpatialFlow. We perform a detailed ablation study on
each component and conduct extensive experiments to prove the
effectiveness of SpatialFlow. Furthermore, we achieve state-ofthe-art results, which are 47.9 PQ and 62.5 PQ respectively on
MS-COCO and Cityscapes panoptic benchmarks. Code will be
available at https://github.com/chensnathan/SpatialFlow.
Index Terms—Panoptic segmentation, Scene understanding,
Location-aware
I. I NTRODUCTION
R
EAL-WORLD vision systems, such as autonomous driving or augmented reality, require a rich and complete
understanding of the image scene. However, neither detect
and segment the objects in the image nor segment the image
semantically can provide a global view of the image scene.
Considering the tasks as a whole is a step forward to real-world
vision systems. In the pre-deep learning era, there are classical
vision tasks, such as scene understanding [1], [2], considering
object detection and semantic segmentation jointly. With the
development of deep learning, instance and semantic segmentation have been widely studied and improved, while studies of
the joint task have been left behind. Recently, [3] proposed the
panoptic segmentation task to unify two segmentation tasks. In
this task, countable objects such as persons, animals, and tools
are considered as things, while amorphous regions of similar
texture or material such as grass, sky, and road are referred
to as stuff. It draws the attention of the vision community
and pushes the deep vision systems a step forward towards
applications in the real-world scenarios.
Qiang Chen, Anda Cheng, Xiangyu He, Peisong Wang, and Jian Cheng
are with the National Laboratory of Pattern Recognition (NLPR), Institute of
Automation Chinese Academy of Sciences (CASIA) and School of Artificial
Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing, China. (e-mail: qiang.chen@nlpr.ia.ac.cn; chenganda2017@ia.ac.cn; xiangyu.he@nlpr.ia.ac.cn; peisong.wang@nlpr.ia.ac.cn; jcheng@nlpr.ia.ac.cn).
Corresponding author: Jian Cheng (jcheng@nlpr.ia.ac.cn)
Copyright c 20xx IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
s
oc
xl
on
ati
Bo
Thing segmentation
Merge results
Bo
xl
oc
Panoptic segmentation
ati
on
s
Stuff segmentation
Fig. 1. An illustration of the panoptic segmentation task. We also provide the
bouding box for each object in the image and add process to integrate box
location to both thing and stuff segmentation.
Panoptic segmentation aims to assign all pixels in an image
with a semantic and an instance id, which is a challenging
task as it requires a global view of segmentation. In [3], the
authors tried to solve the task by adopting two independent
models, Mask R-CNN [4] and PSPNet [5], for thing and
stuff segmentation1 respectively. Then, they applied a heuristic
post-processing method to merge the segmentation outputs of
two tasks, as illustrated on the right side of Figure 1. These
two independent models ignore the underlying relationship
between things and stuff and bring computation budder into
the framework. Recently, several works [6], [7], [8], [9],
[10], [11] follow [3] and try to build a unified pipeline for
panoptic segmentation via sharing the backbone between two
segmentation tasks.
However, most of the recent works focus on how to combine
the outputs of segmentation tasks properly, failing to highlight
the significance of object location when training networks.
As demonstrated in the literature, the spatial information of
objects can boost the performance of algorithms in object
detection [12], [13], instance segmentation [14], [15], and
semantic segmentation [16], [17]. Our key insight is that,
as a combination of these tasks, panoptic segmentation can
benefit from delivering spatial information of objects among
its sub-tasks. We illustrate the process of performing panoptic
segmentation with box locations in Figure 1.
A crucial question then arises: how to integrate spatial
information into the segmentation tasks seamlessly? To fulfill
this goal, we propose to combine object location by explicitly
delivering the spatial context from the box regression task to
others. Based on this, we introduce a new unified framework
1 Refer to as instance and semantic segmentation, in this paper, we use the
thing and the stuff to emphasize the tasks in panoptic segmentation.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
for panoptic segmentation by fully leveraging the reciprocal
relationship among detection, thing segmentation, and stuff
segmentation. Two primary principles are considered as follows.
First, keep the spatial context in pixel-level before segmenting things and stuff. Although thing and stuff segmentation
can complement one another, the format of dominated features
in these two segmentation tasks may be inconsistent. The
instance-level features control the thing segmentation, while
the pixel-level features guide the stuff segmentation. The
instance-level spatial context may not be suitable for stuff segmentation, given the format of its dominant feature. Besides,
instances can be overlapping, which makes it hard to map
them back to pixel-level. Based on this principle, we resort to
one-stage detector RetinaNet [18] instead of two-stage detector
Faster R-CNN [19]. It prevents the spatial context of objects
from being in the format of instance-level before performing
segmentation tasks. Then, we extend RetinaNet with taskspecific heads - thing head [20] and stuff head [8] to perform
thing and stuff segmentation. In task-specific heads, the spatial
context can be instance-level for things and be pixel-level for
stuff.
Second, integrate the spatial context into segmentation by
fully leveraging feature interweaving among tasks. The spatial
context plays a significant role in improving the quality of
segmentation. It is plentiful in the box regression sub-task
but insufficient in others. To make other sub-tasks locationaware, we propose the information flows to deliver the spatial
context from the box regression task to others and integrate it
by feature interweaving. However, the absence of multi-stage
features in thing and stuff segmentation makes it inconvenient
to absorb the spatial context. To solve this dilemma, we design
four parallel sub-networks for four sub-tasks in the framework,
enabling the model to leverage feature interweaving among
tasks.
The overall design fully leverages the spatial context,
bridges all the tasks in panoptic segmentation by integrating
features among them, and builds a global view for the image
scene, leading to better refinement of features, more robust
representations for image segmentation, and higher prediction
results.
Our contributions are three-fold:
• In this paper, we present a new unified framework for
panoptic segmentation. Our framework is built on the
one-stage detector RetinaNet, which facilitates feature
interweaving in pixel-level.
• Based on the proposed framework, we design four parallel sub-networks to refine sub-task features. Among the
sub-networks, we propose the spatial information flows
to bridge all sub-tasks by making them location-aware.
Our framework is denoted as SpatialFlow.
• We perform a detailed ablation study on various components of SpatialFlow. Extensive experimental results
show that SpatialFlow achieves state-of-the-art results,
which are 47.9 PQ and 62.5 PQ on COCO [21] and
Cityscapes [22] panoptic benchmarks.
The rest of our paper is organized as follow: In Section II,
we briefly revisit recent progress related to this paper; in
2
Section III, we first present the proposed unified framework for
panoptic segmentation based on RetinaNet, then we illustrate
all the details of the designed parallel sub-networks and the
spatial information flows; in Section IV, V, VI, we present
all details and results of the experiments, analyze the effect
of each component, and make further discussions; finally, we
conclude the paper in Section VII.
II. R ELATED W ORKS
After the pioneering application of AlexNet [23] on the
ImageNet datasets [24], deep learning methods have come to
dominate computer vision. These methods have dramatically
improved the state-of-the-art in many vision tasks, including
image recognition[23], [25], [26], [27], [28], [29], [30], image
retrieval [31], [32], metric learning [33], [34], [35], object
detection [36], [37], [19], image segmentation [38], [39], [4],
human pose estimation [40], [41], and many other tasks.
Our work builds on prior works in object detection and
image segmentation. We apply multi-task learning [42], [43]
in our model, which makes things and stuff segmentation tasks
benefit each other and builds a global view for the image scene.
Next, we review some works that are closest to our work as
follows.
A. Object Detection
Our community has witnessed remarkable progress in object
detection. Works, such as [37], [19], [44], tackled the detection
problem by a two-stage approach. They first generated a
number of object proposals as candidates, followed by a
classification head and a regression head on each RoI. Numerous recent breakthroughs has been made, such as adjusting
network structures [45], [46] and searching for better training
strategies [47], [48], [49]. Another type of detector followed
the single-stage pipeline, such as [50], [51], [18]. They directly
predict object categories and regress bounding box locations
based on pre-defined anchors. Recently, researchers focus on
improving the localization quality of one-stage detectors and
propose anchor-free algorithms [13], [52], [53], [54], [55].
In [18], the authors designed two parallel sub-networks
for classification and regression, respectively. In this paper,
SpatialFlow extends RetineNet by adopting the design of
parallel sub-networks.
B. Instance Segmentation
Instance segmentation is a task that requires a pixel-level
mask for each instance. Existing methods can be divided
into two main categories, segmentation-based and regionbased methods. Segmentation-based approaches, such as [56],
[57], first generate a pixel-level segmentation map over the
image and then perform grouping to identify the instance
mask of each object. While region-based methods, such as [4],
[58], [14], are closely related to object detection algorithms.
They predict the instance masks within the bounding boxes
generated by detectors. Region-based methods can achieve
higher performance than their segmentation-based counterparts, which motivates us to resort to the region-based methods. In SpatialFlow, we adopt a thing branch upon RetinaNet
for thing segmentation.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
thing
sub-networks
classification
sub-networks
3
thing head
classification
head
backbone
regression sub-networks
stuff
sub-networks
aBackbone
bSub-networks
regression
head
stuff head
cTask Heads
Fig. 2. An illustration of the overall architecture. The SpatialFlow consists of three parts: (a) Backbone with FPN. (b) Four parallel sub-networks: We propose
the spatial information flow and feature fusion among tasks in this part. The spatial flows are illustrated as orange dashed arrows, and the feature fusion is not
shown in this figure for an elegant presentation; (c) Four heads for specific tasks: The classification head and regression head predict detection box together
for thing head. The final result of SpatialFlow is a combination of the detected boxes and the outputs of thing head and stuff head.
C. Semantic Segmentation
Fully convolutional networks are essential to semantic segmentation [59], and its variants achieve state-of-the-art results
on various segmentation benchmarks. It has been proven
that contextual information plays a vital role in segmentation [60]. A bunch of works followed this idea: dilated
convolution [38] was invented to keep feature resolution and
maintain contextual details; Deeplab series [61], [62] proposed
Atrous Spatial Pyramid Pooling (ASPP) to capture global and
multi-scale contextual information; PSPNet [5] used spatial
pyramid pooling to collect contextual priors; the encoderdecoder networks [39], [63] are designed to capture contextual
information in encoder and gradually recover the details in
decoder. Our SpatialFlow, built upon FPN [45], uses an
encoder-decoder architecture for stuff segmentation to capture
the contextual information. We take the spatial context of
object detection into consideration and build a connection for
thing and stuff segmentation.
D. Panoptic Segmentation
The panoptic segmentation task was proposed in [3], where
the authors provided a baseline method with two separate
networks, then used a heuristic post-processing method to
merge two outputs. Later, Li et al. [64] followed this task
and introduced a weakly- and semi-supervised panoptic segmentation method. Recently, several unified frameworks have
been proposed. De Geus et al. [6] used a shared backbone for
both things and stuff segmentation, while Li et al. [7] took
a step further by considering things and stuff consistency and
proposed a unified network named TASCNet. Kirillov et al. [8]
introduced PanopticFPN by endowing Mask R-CNN [4] with a
stuff branch, which ignores the connection between things and
stuff. Li et al. [9] aimed to capture the connection by utilizing
the attention module. To solve the conflicts in the result merging process, Liu et al. [11] designed a spatial ranking module.
Also, Xiong et al. [10] proposed a parameter-free panoptic
head to resolve the conflicts. Thinking differently, Yang et al.
presented a single-shot approach for panoptic segmentation.
However, most of these methods ignored to highlight the
significance of the spatial features. Our SpatialFlow proposes
the information flows to enable all tasks to be location-aware,
which helps build a panoptic view for image segmentation.
III. S PATIAL F LOW
Object location is one of the key factors when building a
global view for panoptic segmentation. While recent works [6],
[7], [8], [10], [11] for panoptic segmentation focus on how
to combine the outputs of segmentation tasks properly but
ignore to highlight the significance of the object location in the
training phase. In this work, we propose a new unified framework, SpatialFlow, which enables all sub-tasks to be locationaware. Our SpatialFlow is conceptually simple: RetinaNet [18]
with two added sub-networks and two extra heads for thing
and stuff segmentation. More importantly, we add multi-stage
spatial information flows among the sub-networks.
We begin by reviewing the RetinaNet detector. RetinaNet
is one of the most successful fully convolutional one-stage
detectors. It consists of three parts: backbone with FPN [45],
two parallel sub-networks, and two task-specific heads for
box classification and regression. In SpatialFlow, we adopt the
main network structure of RetinaNet. We illustrate the sketch
of our framework in Figure 2.
A. Naive Implementation
As we discussed in Section I, RetinaNet shows its merits
in pixel-level feature integration, which is beneficial for segmentation tasks. To develop a unified framework for panoptic
segmentation based on RetinaNet, the most naive way is to add
one thing head and one stuff head upon FPN features to enable
thing and stuff segmentation. In this section, we introduce the
naive implementation of the unified framework, which ignores
the task feature refinement and the integration of box locations
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(a)
4
(b)
P7
1/32x
P6
P5
C5
1/4x
1/16x
1/4x
1/8x
C4
P4
C3
P3
1/4x
1x
1/4x
P3, P4, P5 upsampled
128-d feature
P3, P4, P5
256-d feature
Conv-GN-ReLU-2x
Combined
128-d feature
Stuff predicted
54-d logits
Elementwise add
Conv-4x
(c)
thing feature
classification feature
Conv-ReLU
regression feature
Spatial Flow
stuff feature
Semantic Flow
Fig. 3. The designs for each part in SpatialFlow. In the dashed rectangle (a), we show the output features of FPN, which are the features named
{P3 , P4 , P5 , P6 , P7 }. In the dashed rectangle (b), we present the architecture of the stuff head. More importantly, all the information flows in sub-networks
are illustrated in the dashed box (c).
like previous methods [8], [11], [10] but built on RetinaNet.
Next, we will introduce the detailed design of each element
in the naive implementation.
1) Backbone: We adopt the same backbone structure
as RetinaNet. The backbone contains FPN, whose outputs
are five levels of features named {P3 , P4 , P5 , P6 , P7 } with
a downsample rate of 8, 16, 32, 64, 128 respectively. In
FPN, all features have 256 channels. We show the details
in Figure 3 (a). Following [20], we treat these features
differently against various tasks: we use all the five levels
to predict the bounding boxes in detection but only send
{P3 , P4 , P5 } to thing and stuff segmentation.
2) RetinaNet-based sub-networks: We present the parallel
sub-networks in RetinaNet - classification sub-network (cls
sub-net for short) and regression sub-network (reg sub-net for
short). The operations in these sub-networks, which transform
the output features of FPN to the inputs of downstream heads,
can be formulated as follows:
Pregi,j = φ(Pregi,j−1 ),
Pclsi,j = φ(Pclsi,j−1 ).
(1)
Here, i represents the level index of FPN levels, j is the layer
stage index in sub-networks, and φ denotes to a network
block that contains a 3 × 3 convolution layer and a ReLU
layer. In the cls and reg sub-networks, i ∈ {3, 4, 5, 6, 7},
j ∈ {1, 2, 3, 4}, and we have Pclsi,0 = Pregi,0 = Pi , while
i ∈ {3, 4, 5} for thing and stuff segmentation.
3) Task-specific heads: As illustrated in Figure 2 (c), we
apply four heads for box classification, box regression, thing
segmentation, and stuff segmentation, respectively. In the
classification and the regression head, the final outputs of the
detection can be obtained by Oclsi = ψ(Pclsi,4 ), Oregi =
φ(Pregi,4 ), where Oclsi and Oregi represent the outputs of
the classification head and the regression head in the FPN
level i. We implement one 3 × 3 convolution layer φ on
the outputs of classification and regression sub-nets. For the
thing head, we apply it to each predicted box and adopt
the same design as Mask R-CNN [4]. For each RoI feature,
ORoIk = ψ(ζ(φ(PRoIk )), where ORoIk is the output of the
k-th predicted box, φ represents for four 3 × 3 convolution
layers with ReLU, one 2 × 2 stride 2 deconvolution layer with
ReLU is denoted as ζ, and ψ is a 1 × 1 output convolution
layer. After the stuff sub-net, we obtain three levels of feature
maps with scales of 1/8, 1/16, 1/32 of the original image.
We perform upsampling on each feature map gradually by
blocks, each of which contains a 3 × 3 convolution layer,
a group norm [65] layer, a ReLU layer, and a 2× bilinear
upsampling operation. All the features are upsampled to the
scale of 1/4, which are then element-wise summed. A final
1 × 1 convolution layer, a 4× bilinear upsampling operation,
and a softmax are applied to get the segmentation result. The
stuff head is shown in Figure 3 (b) with details. To generate
the final output of SpatialFlow, we first perform a heuristic
post-processing method [8] to merge the results of thing and
stuff segmentation, then fill the unassigned area in the merged
map with the predicted boxes’ locations and categories.
We show the key components of the proposed unified
framework. The adaptation of RetinaNet [18] enables the
feature in pixel-level before performing segmentation tasks.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
There remain obstacles preventing the unified framework from
building a global view for the image scene, e.g., lack of feature
intersection between things and stuff. The naive implementation also has practicality problems regarding the refinement of
the FPN feature for things and stuff segmentation. To further
improve the quality of learned features for things and stuff
segmentation and strengthen the intersection between things
and stuff in the image, we propose two techniques: adding
things and stuff parallel sub-networks and proposing spatial
information flows.
B. Thing and stuff parallel sub-networks
In RetinaNet, the parallel sub-networks refine the FPN
features with multi-stage convolution layers, which transform
the FPN features to task-specific features and lead to better
performance. But there is no refinement for the input features
in thing and stuff segmentation in the naive implementation.
In this section, we apply the same mechanism to these two
segmentation tasks. Moreover, the created multi-stage features
facilitate the delivery of the spatial context from box regression
task to others. We show the details of this part in Figure 3 (c).
In this section, we propose to add two additional subnetworks - thing sub-network and stuff sub-network. We adopt
the similar structure as in cls and reg sub-networks. Until now,
there are four parallel sub-networks between the FPN and the
task-specific heads. We present the modifications in thing and
stuff sub-networks below:
Pthingi,j = φ(Pthingi,j−1 );
Pstuf fi,j = φ(Pstuf fi,j−1 ).
(2)
Where Pthingi,0 = Pstuf fi,0 = Pi . As the dominated features
in thing and stuff segmentation are different, the number
of stages required by the sub-networks depends on tasks.
More stages are needed in stuff segmentation than in thing
segmentation to do feature refinement. We conjecture that the
reason for the phenomenon is that pixel-level features are more
sensitive to the details than instance-level features. In the final
version, we adopt four stages in stuff sub-network and keep
only one in thing sub-network. This setting gives the best
performance of segmentation. In each stage, we implement
a 3 × 3 convolutional layer and a ReLU layer. We illustrate
the overall structure of sub-networks in Figure 3 (c). And the
experimental results for the number of stages in thing and stuff
sub-networks can be found in Table VII and Table VIII.
C. Spatial information flows
As illustrated in Figure 1, all sub-tasks in our proposed
panoptic segmentation framework are related to the locations
of objects. The box location information is implied in the
multi-stage feature representations of the box regression subnetwork. We propose the spatial information flows to support
feature refinement in sub-networks. The spatial information
flows can make other sub-tasks aware of box locations. Furthermore, adding the semantic feature in the thing segmentation has been proved to be effective in HTC [14]. We also
add a semantic flow that adopts a 3 × 3 convolution layer to
transform the stuff feature to the thing feature. It brings slight
5
improvements in our SpatialFlow, as shown in Table IX and
Table X. Then we display the detailed structure of the spatial
flows in Figure 3 (c). They can be implemented as follows:
Pregi,j = φ(Pregi,j−1 );
Pclsi,j = φ(Pclsi,j−1 + ψ(Pregi,j ));
Pstuf fi,j = φ(Pstuf fi,j−1 + ψ(Pregi,j ));
(3)
Pthingi,1 = φ(Pi + ζ(Pstuf fi,4 ), ψof f set (Pregi,4 + Pi ));
Pregi,0 = Pclsi,0 = Pthingi,0 = Pstuf fi,0 = Pi .
Here, Pregi,0 = Pclsi,0 = Pthingi,0 = Pstuf fi,0 = Pi and ψ
denotes an adaptation convolution from box regression task to
others; ζ denotes an adaptation convolution from stuff sub-net
to thing sub-net. We use a 3 × 3 convolution layer for both ψ
and ζ. All features have 256 channels in this part.
Moreover, to make a fair comparison with UPSNet [10] on
COCO, we introduce deformable convolutions [66] layers to
the sub-networks. We further adopt a method to incorporate
the spatial context into deformable convolution more appropriately. We first combine the spatial information flow and the
task-specific feature, then use the combined feature to generate
the offsets for the deformable convolution on the task-specific
sub-networks. The process can be formulated as follow:
Pregi,j = φ(Pregi,j−1 );
Pclsi,j = φdcn (Pclsi,j−1 , ψof f set (Pregi,j + Pclsi,j−1 ));
Pstuf fi,j = φdcn (Pstuf fi,j−1 , ψof f set (Pregi,j + Pstuf fi,j−1 ));
Pthingi,1 = φ(Pi + ζ(Pstuf fi,4 ), ψof f set (Pregi,4 + Pi ));
Pregi,0 = Pclsi,0 = Pthingi,0 = Pstuf fi,0 = Pi .
(4)
In the equation, φdcn represents a deformable convolution
layer, ψof f set means an adaptation convolution layer, which
generates offsets for the deformable convolution. Unless specified, we do not adopt the setting with deformable convolution.
IV. E XPERIMENTS
A. Dataset and Evaluation metric
1) Dataset: We evaluate our model on both COCO [21]
and Cityscapes [22]. COCO consists of 80 things and 53
stuff classes. We use the 2017 data splits with 118k/5k/20k
train/val/test images. We use train split for training, and
report leision and sensitive studies by evaluating on val split.
For our main results, we report our panoptic performance on
the test-dev split. Cityscapes has 5k high-resolution images
with fine pixel-accurate annotations: 2975 train, 500 val,
and 1525 test. There are 19 classes on Cityscapes, 8 with
instance-level masks. For all experiments on Cityscapes, we
report our performance on val split with 11 stuff classes and
8 things classes.
2) Evaluation metric: We adopt the panoptic quality (PQ)
as the metric. As proposed in [3], PQ can be formulated as
follow:
P
|T P |
(p,g)∈T P IoU (p, g)
×
PQ =
|T P |
|T P | + 12 |F P | + 12 |F N |
|
{z
} |
{z
}
segmentation quality (SQ)
recognition quality (RQ)
(5)
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
where p and g are predicted and ground truth segments, TP
(true positives), FP (false positives), and FN (false negatives)
represent matched pairs of segments (IoU (p, g) > 0.5),
unmatched predicted segments, and unmatched ground truth
segments, respectively. Besides, PQ can be explained as the
multiplication of a segmentation quality (SQ) and a recognition quality (RQ). We also use SQ and RQ to measure the
performance in our experiments.
6
regions as 4900. When performing integration with detection
box results, we propose a new hyper-parameter, which is the
overlap between the detection box and the unassigned area in
the segmentation map. We fix the threshold of the box overlap
as 0.6. For the hyper-parameters on Cityscapes, we modify the
overlap threshold of the instance masks to 0.25 and change the
area limit threshold of stuff regions to 2048.
C. Main Results
B. Implementation Details
1) Training: As a unified framework for panoptic segmentation, there are four different losses for SpatialFlow to
optimize during the training stage. The loss function can be
formulated as follow:
L = (Lcls + Lreg + Lthing ) + λ · Lstuf f
(6)
where Lcls , Lreg and Lthing belong to the thing segmentation
task, and Lstuf f is the loss of the stuff segmentation. We
add a hyper-parameter λ to balance the losses between thing
and stuff segmentation. We implement our SpatialFlow with a
toolbox [67] based on PyTorch [68].
We inherit all the hyper-parameters from RetinaNet except
that we set the threshold of NMS to 0.4 when generating
proposals during training. For thing prediction, we add the
ground truth boxes to the proposals set and run the thing head
for all proposals. For training strategies, we fix the batch
norm layer in the backbone and train all models over 4 GPUs
with a total of 8 images per minibatch. On MS-COCO [21],
we use the training strategy of training longer that adopted
by RetinaNet(1.5×) [18] and RetinaMask(2×) [20]. All
models are trained for 20 epochs with an initial learning
rate of 5 × 10−3 , which is decreased by 10 after 16 and 19
epochs; on Cityscapes [22], we set the initial learning rate as
1.25 × 10−2 and borrow the number of iterations from [8].
Unless specified, we resize the shorter edge of the image
to 800 pixels on COCO, while on Cityscapes, we adopt
512 × 1024 image crops after scaling each image by 0.5 to
2.0×. As Kirillov et al. [8] did, we also predict a particular
‘other’ class for all things categories in stuff head on COCO
benchmark.
2) Inference: Our model follows a pipeline in the inference
stage: (1) generate the detection results; (2) obtain the maps
of thing and stuff segmentation; (3) merge the two outputs
to form a panoptic segmentation map; (4) fill the unassigned
area in the result with the detected boxes and its categories. In
detection, we set the threshold of NMS to 0.4 for each class
separately, and choose the top-100 scoring bounding boxes to
send to thing head. During merging, we first ignore the stuff
regions labeled ‘other’; then we resolve the overlap problem
between instances based on their scores, and merge the thing
and stuff map in favor of things; at last, we fill the unassigned
area with detection boxes in the result segmentation map to
form the final output. For the hyper-parameters of SpatialFlow
on in the inference stage, we fixed the confidence score threshold for the instance masks as 0.37, set the overlap threshold of
instance masks as 0.37, and set the area limit threshold of stuff
In this section, we compare our SpatialFlow with the stateof-the-art methods in panoptic segmentation. We show all
the main results in Table I, Table II, and Table III. SpatialFlow achieves state-of-the-art results on both COCO [21]
and Cityscapes [22] panoptic benchmark.
1) MS COCO: To make a fair comparison, we report the
results in Table I and Table II, where the experiment settings
are different. In Table I, we present the prediction results
without bells and whistles. With a single ResNet-101-FPN
backbone, our SpatialFlow can achieve 42.9 PQ on COCO
test-dev split, which outperforms PanopticFPN [8] by 2.0 PQ
and OANet [11] by 1.6 PQ. More importantly, SpatialFlow
achieves a new state-of-the-art performance on PQSt , 33.0
PQ, which outperforms other models by a large margin (3.3
PQ and 5.3 PQ respectively). The results demonstrate the
effectiveness of integrating the spatial features in pixel-level,
which significantly impact stuff segmentation. However, SpatialFlow is lagging behind OANet [11] in PQT h . In OANet, the
authors focus on solving the overlapping problem of instances
when rendering the thing results to the final panoptic result.
SpatialFlow applies a simple method to this problem, causing
the inferior performance in PQth .
Then we apply deformable convolution [66] to both backbone and sub-networks and report its results with the multiscale strategy in Table II. When training, the scales of
short edges are randomly sampled from [400, 1400], and the
scales of long edges are fixed as 1600. For inference, we
feed multi-scale images to SpatialFlow, and the scales are
(1500, 1000), (1800, 1200), and (2100, 1400) with horizontal
flip. We achieve 47.9 PQ, which is the state-of-the-art result on
COCO panoptic benchmark. As shown in Table II, our method
outperforms the runner-up of the COCO 2018 challenge by
1.1 PQ with a single model, demonstrating the effectiveness
of SpatialFlow. Although AUNet outperforms our method in
PQth , they use ResNeXt-152-DCN as their backbone. In fact,
with a stronger backbone (ResNeXt-101-DCN) and model
ensemble, our method can achieve 50.2 PQ on COCO testdev split.
2) Cityscapes: We also report the results under different
experiment settings in Table III. Without the COCO pretrained model, SpatialFlow can achieve 59.6 PQ on Cityscapes
val split, which is 1.5 PQ and 0.6 PQ higher than PanopticFPN [8] and AUNet [9] respectively. With the COCO pretrained model, SpatialFlow can achieve 62.5 PQ on Cityscapes
with multi-scale testing, which is 0.7 PQ higher than UPSNet [10] under the same setting. SpatialFlow outperforms
all other methods in terms of PQst while getting inferior
performance on PQth comparing to UPSNet and AdaptIS.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
7
TABLE I
C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON COCO 2017 test-dev SPLIT. W E ONLY COMPARE WITH THE STATE - OF - THE - ART METHODS
THAT WITHOUT DEFORMABLE CONVOLUTIONS HERE .
model
backbone
PQ
PQT h
PQSt
SQ
RQ
JSIS-Net [6]
DeeperLab [69]
PanopticFPN [8]
OANet [11]
SSAP [70]
SpatialFlow
ResNet-50
Xception-71
ResNet-101-FPN
ResNet-101-FPN
ResNet-101-FPN
ResNet-101-FPN
27.2
34.3
40.9
41.3
36.9
42.9
29.6
37.5
48.3
50.4
40.1
49.5
23.4
29.6
29.7
27.7
32.0
33.0
71.9
77.1
78.8
35.9
43.1
52.3
TABLE II
C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON COCO 2017 test-dev SPLIT. I N THIS TABLE , WE REPORT OUR RESULTS WITH DEFORMABLE
CONVOLUTION AND MULTI - SCALE STRATEGY. T HE TOP 3 ROWS CONTAIN RESULTS OF TOP 3 MODELS TAKEN FROM THE OFFICIAL LEADERBOARD OF
COCO 2018 PANOPTIC S EGMENTATION C HALLENGE .
model
backbone
PQ
PQT h
PQSt
SQ
RQ
Megvii (Face++)
Caribbean
PKU 360
ensemble model
ensemble model
ResNeXt-152-DCN
53.2
46.8
46.3
62.2
54.3
58.6
39.5
35.5
27.6
83.2
80.5
79.6
62.9
57.1
56.1
AdaptIS [71]
AUNet [9]
UPSNet [10]
SOGNet [72]
SpatialFlow
ResNeXt-101
ResNeXt-152-DCN
ResNet-101-DCN
ResNet-101-DCN
ResNet-101-DCN
42.8
46.5
46.6
47.8
47.9
50.1
55.9
53.2
54.5
31.8
32.5
36.7
38.0
81.0
80.5
80.7
81.7
56.1
56.9
57.6
57.6
TABLE III
C OMPARISON WITH THE STATE - OF - THE - ART METHODS ON C ITYSCAPES
val SPLIT. I N THIS TABLE , ‘-R101’ REPRESENTS THAT THE BACKBONE IS
R ES N ET-101 AND ‘-X101‘ FOR R ES N E X T-101 [73]; ‘-COCO’ MEANS
USING COCO PRETRAINED MODEL ; ‘-M’ IS THE MULTI - SCALE TESTING .
model
PQ
PQT h
PQSt
PanopticFPN-R101 [8]
AUNet-R101 [9]
TASCNet-R101-COCO [7]
UPSNet-R101-COCO-M [10]
SSAP-R101-M [70]
AdaptIS-X101-M [71]
SpatialFlow-R101
SpatialFlow-R101-COCO-M
58.1
59.0
59.2
61.8
61.1
62.0
59.6
62.5
52.0
54.8
56.0
57.6
55.0
58.7
55.0
56.6
62.5
62.1
61.5
64.8
64.4
63.1
66.8
We conjecture that the phenomenon is caused by the inferior
detection performance of RetinaNet on Cityscapes. To obtain
the result of 62.5 PQ on Cityscapes val split, we first replace
the convolution layers in stuff with deformable convolutions as
UPSNet does, then we follow the steps below: (1) Finetune the
COCO pre-trained model. As the number of things and stuff
classes in Cityscapes is smaller than the number in COCO,
11/8 vs. 80/53, we have to finetune the layers that related
to the number of classes. We freeze the rest layers and use a
learning rate of 2.5 × 10−3 to train for 2 epochs. (2) Train
the finetuned model as the standard SpatialFlow does. (3)
Apply the multi-scale testing trick. The scales that we use in
Cityscapes are (2304, 1152), (2432, 1216), (2560, 1280), and
(2688, 1344) with horizontal flip.
D. Ablation Experiments
We run a number of ablations to analyze the SpatialFlow.
Unless specified, we use the naive implementation of Spa-
TABLE IV
L OSS BALANCE : T HE RESULTS OF THE BASELINE MODEL ON COCO val
FOR DIFFERENT VALUES OF λ BASED ON R ES N ET-50 WITH IMAGE SIZE OF
600 PX . T HE PROPER λ BRINGS A LARGE GAIN .
λ
1.0
0.75
0.5
0.3
0.25
0.2
PQ
PQT h
PQSt
37.5
41.8
30.9
38.2
43.0
31.1
38.8
44.0
31.0
39.1
44.5
30.9
39.3
45.1
30.5
39.0
44.9
30.0
tialFlow presented in Section III-A as our baseline model for
all experiments in this section. We discuss the details below.
Loss Balance. We first investigate the best value of the hyperparameter λ. We adopt the baseline model in this section.
Table IV shows the model results of using various λ on COCO.
We demonstrate the power of λ and discover that the best
value to balance the losses on COCO is 0.25, with which the
baseline model achieves 39.3 PQ with an image size of 600px
and earns a 1.8 PQ gain compared with λ = 1.0. While for
Cityscapes, we set λ = 1.0 by following [8].
Contribution of Components. In this section, we evaluate
the sub-networks and the spatial flows of SpatialFlow on both
COCO and Cityscapes. The results are shown in Table V and
Table VI respectively. From the experiment results, we can see
that both the sub-networks and the spatial flows demonstrate
their contribution. The sub-networks improve PQ by 0.6 points
and 0.2 points on COCO and Cityscapes. In particular, we
obtain a significant gain on stuff (1.2 PQ on COCO) with
the sub-networks as in which we refine the pixel-level feature
before sending it to stuff head. For the spatial flows, they can
improve the performance of things and stuff simultaneously
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
8
TABLE V
TABLE VI
C ONTRIBUTION OF C OMPONENTS : A BLATION RESULTS ON COCO val C ONTRIBUTION OF C OMPONENTS : A BLATION RESULTS ON C ITYSCAPES
SPLIT WITH R ES N ET-50. B OTH SUB - NETWORKS AND SPATIAL FLOWS
val SPLIT WITH R ES N ET-50. S IMILAR GAINS CAN BE OBTAINED ON
BRING SIGNIFICANT GAINS BASED ON THE BASELINE MODEL .
C ITYSCAPES .
sub-nets
spatial-flows
PQ
PQT h
PQSt
X
X
39.7
40.3
40.2
40.9
46.0
46.2
46.5
46.8
30.2
31.4
30.7
31.9
X
X
TABLE VII
D ESIGN OF S UB - NETWORKS : A BLATION RESULTS ON NUMBER OF
STAGES IN THING SUB - NETWORK . O NLY ONE STAGE IS NEEDED TO
REFINE THE FPN FEATURE FOR THE INPUT OF THE THING HEAD .
sub-nets
spatial-flows
PQ
PQT h
PQSt
X
X
57.3
57.5
58.0
58.6
53.5
53.6
54.3
54.9
60.0
60.3
60.8
61.4
X
X
TABLE VIII
D ESIGN OF S UB - NETWORKS : R ESULTS ON NUMBER OF STAGES IN STUFF
SUB - NETWORK . M ORE STAGES BRING MORE GAINS . I NPUT FEATURE OF
THE STUFF HEAD NEED TO BE FULLY REFINED BY SUB - NETWORK .
num stages
PQ
PQT h
PQSt
num stages
PQ
PQT h
PQSt
0
1
2
3
4
39.7
39.9
39.7
39.9
39.7
46.0
46.3
46.0
46.2
45.9
30.2
30.3
30.2
30.3
30.3
0
1
2
3
4
39.7
39.9
40.1
40.2
40.3
46.0
45.9
46.0
46.1
46.2
30.2
30.9
31.1
31.4
31.5
TABLE IX
TABLE X
S PATIAL F LOWS : T HE RESULTS OF THE SPATIAL FLOWS ON COCO. E ACH S PATIAL F LOWS : T HE RESULTS OF THE SPATIAL FLOWS ON C ITYSCAPES .
ROW ADDS AN EXTRA COMPONENT TO THE ABOVE .
E ACH ROW ADDS AN EXTRA COMPONENT TO THE ABOVE .
flows
PQ
PQT h
PQSt
flows
PQ
PQT h
PQSt
+ reg-cls
+reg-stuff
+reg-thing
+stuff-thing
40.3
40.5
40.7
40.7
40.9
46.2
46.6
46.3
46.4
46.8
31.4
31.4
32.0
31.8
31.9
+ reg-cls
+reg-stuff
+reg-thing
+stuff-thing
57.5
58.0
58.3
58.5
58.6
53.6
55.1
54.6
54.7
54.9
60.3
60.1
60.9
61.3
61.4
by improving 0.5 PQ and 0.7 PQ on COCO and Cityscapes.
Moreover, the spatial flows can bring further gains with the
sub-networks compared with the obtained benefits on the
baseline model. The results indicate that the integration of
the spatial context can benefit from the feature refinement in
sub-networks.
Design of Sub-networks. We search the best number of stages
for thing and stuff sub-networks. We conduct experiments
on the COCO dataset with ResNet-50 based on the baseline
model. The results are shown in Table VII and Table VIII.
According to the results, we choose to add only one stage in
thing sub-network and add four stages in stuff sub-network.
We obtain 0.2 PQ and 0.6 PQ improvements with thing subnetwork and stuff sub-network. The different number of blocks
in sub-networks are related to the difference of the dominated
feature in thing and stuff segmentation.
Spatial Flows. We conduct experiments to highlight the
significance of the proposed spatial information flows between
tasks. The baseline model marked with ‘-’ in Table IX and
Table X is the one with all sub-networks. There are three
paths to deliver the spatial context from the box regression
task to others: the path from the reg sub-net to the cls sub-net
(reg-cls flow), the path to the stuff sub-net (reg-stuff flow), and
the path to the thing sub-net (reg-thing flow). The results are
reported in Table IX and Table X. At first, we add the reg-cls
path, and we obtain a 0.4 PQT h improvement on COCO and
a 1.5 PQT h gain on Cityscapes, which are brought by better
detection results. Adding spatial context helps cls sub-net to
extract discriminative features, which is essential for detection.
Then we build a spatial path for stuff sub-net, as shown in the
4th row of Table IX and Table X, we earn a 0.6 PQSt gain on
COCO and a 0.8 PQSt gain on Cityscapes compared with the
former model, which indicates that the spatial context begets
a positive effect on stuff segmentation. The reg-thing path and
the semantic path also show their effectiveness on both things
and stuff segmentation. Comparing with the original model,
SpatialFlow can achieve a consistent gain in both thing and
stuff segmentation. The results prove the significance of the
spatial context in panoptic segmentation to some extent. It
is worth noting that we only apply the element-wise sum
operation to integrate the spatial context in this work. We
believe further improvement could be achieved with a more
deliberate design like attention modules.
V. F URTHER D ISCUSSION
In this section, we provide further discussions about the
spatial flows and give an overview of how the spatial flows
work, how fast the SpatialFlow is, and how to apply the spatial
flows to other vision tasks.
Spatial Flows vs. Trivial Feature Fusion: Our main idea is to
integrate spatial information into all sub-tasks and make them
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
(a) PQ score
9
(b) PQ score for things
(b) PQ score for stuff
Fig. 4. The PQ, PQT h , and PQSt results of the base model, the cls flows model, and the spatial flows model with image size of 600px on COCO val split.
TABLE XI
ACCURACY vs. S PEED : C OMPARISON WITH THE STATE - OF - THE - ART
METHODS ON ACCURACY AND SPEED BALANCE . W E ILLUSTRATE
S PATIAL F LOW PERFORMANCE WITH DIFFERENT IMAGE SCALES . *
INDICATES THAT UPSN ET APPLY DEFORMABLE CONVOLUTION ON THE
STUFF HEAD .
image
w/o flows
w/ flows
(a) heatmaps for things
Model
Backbone
Scale
PQ
PQT h
PQSt
FPS
PanopticFPN [8]
UPSNet* [10]
DeeperLab [69]
ResNet-50
ResNet-50
Xception-71
800
800
641
39.0
42.5
34.3
45.9
48.5
37.5
27.9
33.4
29.6
18.9
9.1
10.6
SpatialFlow
SpatialFlow
SpatialFlow
ResNet-50
ResNet-50
ResNet-50
800
600
400
40.9
40.3
37.4
46.8
45.6
41.5
31.9
32.2
31.4
10.3
13.0
19.6
TABLE XII
D ETECTION R ESULTS : T HE RESULTS OF R ETINA N ET WITH OR WITHOUT
SPATIAL FLOWS ON COCO val SPLIT WITH R ES N ET-50 AS THE
BACKBONE . T HE SHORTED EDGES OF IMAGES ARE 800 PX .
image
w/o flows
w/ flows
(b) heatmaps for stuff
Fig. 5. An illustration of the cls-head heatmap and the stuff-head heatmap.
We provide a comparison between the model with and without spatial flows.
Detectors
mAP
AP50
AP75
RetinaNet
RetinaNet w/ flows
35.6
36.7
55.5
57.1
37.7
39.4
stuff. The spatial flows bridge all tasks and help build a global
view of the image in panoptic segmentation.
aware of the locations of the objects, which is different from
trivial feature integration among sub-networks. To prove the
effectiveness of the spatial flows, we design an experiment on
COCO by delivering the feature of cls sub-networks to other
three sub-tasks, denoted as the cls flows model in Figure 4.
We conduct an experiment on it with the image input size
of 600px. As shown in Figure 4, our method outperform the
cls-based feature integration method by 0.7 PQ, 0.8 PQT h ,
and 0.5 PQSt respectively. The results suggest that trivial
feature integration can not bring consistent improvements to
the baseline model as our method does.
Accuracy vs. Speed: In Table XI, we compare our method
with the state-of-the-art methods in terms of accuracy and
speed balance on COCO val split. The FPS is measured on a
single Tesla V100 GPU. We show the results of different image
sizes and different inference speed. Although SpatialFlow is
not the fastest among all the methods, the results show good
accuracy and speed balance of SpatialFlow. Larger image size
yield higher accuracy, in slower inference speeds. Also, we
find that thing segmentation benefits from large image sizes,
while stuff segmentation is robust to the image size. Thanks
to this, SpatialFlow can achieve 19.6 FPS and remain 37.4 PQ
when we set the image size to 400px.
Effects of Spatial Flows: We choose to study the effects of
spatial flows using two models, which are the models with or
without spatial flows. We visualize the last feature map in the
cls-head and the stuff-head of both models via CAM [74] in
Figure 5. The visualized heatmaps illustrate that the spatial
flows can help the thing branch focus on objects and make
the stuff branch aware of the precise boundary of things and
Detection Results: We also conduct experiments on RetinaNet [18] to investigate the generalization of the spatial flows.
We deliver the spatial context from reg sub-network to cls subnetwork. The detection result is shown in Table XII. With
the help of the spatial context, the multi-stage features in
sub-networks can be more discriminative, which boosted the
performance.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Image
Ground Truth
SpatialFlow
Image
10
Ground Truth
SpatialFlow
Fig. 6. An illustration of visualization examples of SpatialFlow on COCO val split using a single ResNet-101 network.
VI. V ISUALIZATION
We show some visualization examples of SpatialFlow on
COCO and Cityscapes in Figure 6 and Figure 7 respectively.
VII. C ONCLUSION
In this work, we focus on the box locations in panoptic
segmentation and propose a new location-aware and unified
framework, denoted as SpatialFlow. We emphasize the importance of the spatial context and bridge all the tasks by
building spatial information flows, then achieve state-of-the-art
performance on both COCO test-dev split and Cityscapes val
split, which prove the effectiveness of our model. Moreover,
we find that the spatial flows can improve the performance
of detection models, indicating the importance of spatial
information. We expect that SpatialFlow can provide valuable
insights on how to integrate spatial information in vision tasks.
ACKNOWLEDGMENT
This work was supported in part by National Natural Science Foundation of China (No.61972396, 61876182,
61906193), National Key Research and Development Program of China (No. 2019AAA0103402), the Strategic Priority Research Program of Chinese Academy of Science(No.XDB32050200), the Advance Research Program (No.
31511130301), and Jiangsu Frontier Technology Basic Research Project (No. BK20192004). Moreover, the authors
would like to thank Jiaying Guo at Nanjing Institute of
Geography and Limnology, Chinese Academy of Sciences for
valuable discussions about the writing.
R EFERENCES
[1] L. Ladickỳ, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr, “What,
where and how many? combining object detectors and crfs,” in European
conference on computer vision. Springer, 2010, pp. 424–437.
[2] J. Yao, S. Fidler, and R. Urtasun, “Describing the scene as a whole:
joint object detection,” in Proceedings of CVPR. Citeseer, 2012.
[3] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic
segmentation,” arXiv preprint arXiv:1801.00868, 2018.
[4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
2017, pp. 2961–2969.
[5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2881–2890.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
Image
Ground Truth
11
SpatialFlow
Fig. 7. An illustration of visualization examples of SpatialFlow on Cityscapes val split using a single ResNet-101 network.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
[6] D. de Geus, P. Meletis, and G. Dubbelman, “Panoptic segmentation with
a joint semantic and instance segmentation network,” arXiv preprint
arXiv:1809.02110, 2018.
[7] J. Li, A. Raventos, A. Bhargava, T. Tagawa, and A. Gaidon, “Learning
to fuse things and stuff,” arXiv preprint arXiv:1812.01192, 2018.
[8] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid
networks,” arXiv preprint arXiv:1901.02446, 2019.
[9] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang,
“Attention-guided unified network for panoptic segmentation,” arXiv
preprint arXiv:1812.03904, 2018.
[10] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun,
“Upsnet: A unified panoptic segmentation network,” arXiv preprint
arXiv:1901.03784, 2019.
[11] H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu, and W. Jiang,
“An end-to-end network for panoptic segmentation,” arXiv preprint
arXiv:1903.05027, 2019.
[12] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille, “Singleshot object detection with enriched semantics,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 5813–5821.
[13] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, “Region proposal by
guided anchoring,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2019, pp. 2965–2974.
[14] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng,
Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance
segmentation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2019, pp. 4974–4983.
[15] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance
segmentation,” in ICCV, 2019.
[16] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp.
1635–1643.
[17] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Box-driven classwise region masking and filling rate guided loss for weakly supervised
semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 3136–3145.
[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
for dense object detection,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 2980–2988.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99.
[20] C.-Y. Fu, M. Shvets, and A. C. Berg, “Retinamask: Learning to predict
masks improves state-of-the-art single-shot detection for free,” arXiv
preprint arXiv:1901.03353, 2019.
[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in European conference on computer vision. Springer, 2014,
pp. 740–755.
[22] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
for semantic urban scene understanding,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 3213–
3223.
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE conference on
computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[25] J. Yu, Y. Rui, Y. Y. Tang, and D. Tao, “High-order distance-based
multiview stochastic learning in image classification,” IEEE transactions
on cybernetics, vol. 44, no. 12, pp. 2431–2442, 2014.
[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[29] F. Zhou and Y. Lin, “Fine-grained image classification by exploring
bipartite-graph labels,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 1124–1133.
12
[30] J. Yu, M. Tan, H. Zhang, D. Tao, and Y. Rui, “Hierarchical deep click
feature prediction for fine-grained image recognition,” IEEE transactions
on pattern analysis and machine intelligence, 2019.
[31] J. Yu, D. Tao, M. Wang, and Y. Rui, “Learning to rank using user
clicks and visual features for image retrieval,” IEEE transactions on
cybernetics, vol. 45, no. 4, pp. 767–779, 2014.
[32] H. Wang, Y. Cai, Y. Zhang, H. Pan, W. Lv, and H. Han, “Deep learning
for image retrieval: What works and what doesn’t,” in 2015 IEEE
International Conference on Data Mining Workshop (ICDMW). IEEE,
2015, pp. 1576–1583.
[33] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,”
in International Workshop on Similarity-Based Pattern Recognition.
Springer, 2015, pp. 84–92.
[34] J. Yu, X. Yang, F. Gao, and D. Tao, “Deep multimodal distance metric
learning using click constraints for image ranking,” IEEE transactions
on cybernetics, vol. 47, no. 12, pp. 4014–4024, 2016.
[35] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric
learning via lifted structured feature embedding,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2016, pp.
4004–4012.
[36] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
convolutional networks for accurate object detection and segmentation,”
IEEE transactions on pattern analysis and machine intelligence, vol. 38,
no. 1, pp. 142–158, 2015.
[37] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1440–1448.
[38] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[39] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention. Springer,
2015, pp. 234–241.
[40] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d
pose estimation using part affinity fields,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 7291–
7299.
[41] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal deep
autoencoder for human pose recovery,” IEEE Transactions on Image
Processing, vol. 24, no. 12, pp. 5659–5670, 2015.
[42] Y. Zhang and Q. Yang, “An overview of multi-task learning,” National
Science Review.
[43] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2018, pp. 7482–7491.
[44] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via regionbased fully convolutional networks,” in Advances in neural information
processing systems, 2016, pp. 379–387.
[45] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 2117–2125.
[46] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality
object detection,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 6154–6162.
[47] B. Singh and L. S. Davis, “An analysis of scale invariance in object
detection snip,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 3578–3587.
[48] B. Singh, M. Najibi, and L. S. Davis, “Sniper: Efficient multi-scale
training,” in Advances in neural information processing systems, 2018,
pp. 9310–9320.
[49] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks
for object detection,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 6054–6063.
[50] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
788.
[51] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European conference on
computer vision. Springer, 2016, pp. 21–37.
[52] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module
for single-shot object detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 840–849.
[53] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional
one-stage object detection,” in Proceedings of the IEEE international
conference on computer vision, 2019, pp. 9627–9636.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
[54] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European Conference on Computer Vision
(ECCV), 2018, pp. 734–750.
[55] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv
preprint arXiv:1904.07850, 2019.
[56] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation for
autonomous driving with deep densely connected mrfs,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 669–677.
[57] Z. Wu, C. Shen, and A. v. d. Hengel, “Bridging category-level
and instance-level semantic image segmentation,” arXiv preprint
arXiv:1605.06885, 2016.
[58] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
for instance segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
[59] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440.
[60] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler,
R. Urtasun, and A. Yuille, “The role of context for object detection
and semantic segmentation in the wild,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pp. 891–
898.
[61] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE transactions on
pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
2018.
[62] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
atrous convolution for semantic image segmentation,” arXiv preprint
arXiv:1706.05587, 2017.
[63] H. Noh, S. Hong, and B. Han, “Learning deconvolution network
for semantic segmentation,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1520–1528.
[64] Q. Li, A. Arnab, and P. H. Torr, “Weakly-and semi-supervised panoptic
segmentation,” in Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 102–118.
[65] Y. Wu and K. He, “Group normalization,” in Proceedings of the
European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
[66] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable
convolutional networks,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 764–773.
[67] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and
benchmark,” arXiv preprint arXiv:1906.07155, 2019.
[68] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” 2017.
[69] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang, V. Sze,
G. Papandreou, and L.-C. Chen, “Deeperlab: Single-shot image parser,”
arXiv preprint arXiv:1902.05093, 2019.
[70] N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang, and K. Huang,
“Ssap: Single-shot instance segmentation with affinity pyramid,” in
Proceedings of the IEEE International Conference on Computer Vision,
2019, pp. 642–651.
[71] K. Sofiiuk, O. Barinova, and A. Konushin, “Adaptis: Adaptive instance
selection network,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 7355–7363.
[72] Y. Yang, H. Li, X. Li, Q. Zhao, J. Wu, and Z. Lin, “Sognet: Scene
overlap graph network for panoptic segmentation,” arXiv preprint
arXiv:1911.07527, 2019.
[73] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
transformations for deep neural networks,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 1492–
1500.
[74] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
deep features for discriminative localization,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 2921–
2929.
13