0% found this document useful (0 votes)
1 views9 pages

1-s2.0-S1077314222000637-main

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

Computer Vision and Image Understanding 221 (2022) 103450

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Camouflaged object detection via Neighbor Connection and Hierarchical


Information Transfer
Cong Zhang 1 , Kang Wang 1 , Hongbo Bi ∗, Ziqi Liu, Lina Yang
Department of Electrical Information Engineering, Northeast Petroleum University, Daqing, 163318, China

ARTICLE INFO ABSTRACT


Communicated by Nikos Paragios Camouflaged Object Detection (COD) aims to detect objects with high similarity to the background. Unlike
general object detection, COD is a more challenging task because the target boundaries are vague and the
Keywords:
location is difficult to determine. In this paper, we propose a novel COD framework, which consists of two main
Deep learning
Camouflaged Object Detection
components, namely, Neighbor Connection Mode (NCM) and Hierarchical Information Transfer (HIT). NCM
Salient Object Detection aggregates the features from the neighboring layers of the encoder network to enhance the complementation
of various level information. Our NCM not only reduces the burden of dense connection that consumes a lot of
computing memory and redundant features but also weakens the phenomenon of the long-term transmission
of context. We also propose a HIT module to transfer the features of different dilated rates inside each
level hierarchically, which expands the receptive field of each branch and enhances the relationship between
different features. Our method accurately detects camouflaged objects by considering full level information and
a large receptive field. The experiments on three COD datasets show that our model achieves state-of-the-art
performance.

1. Introduction relies on an extreme multitude of visual recognition knowledge to


distinguish hidden foreground objects. The camouflaged objects may be
In the field of computer vision, object detection has always been an the animals concealed in the natural environment by their skin color
active research topic. As illustrated in Fig. 1, the objects relevant to the or body pattern, or human camouflage in the background with the
object detection task (Fan et al., 2020a) involve salient object, generic help of modern equipment, or some pieces of information embedded
object and camouflaged object, and corresponding tasks are salient ob- in the arts (Ge et al., 2018). COD benefits various applications (Fan
ject detection (SOD), generic object detection (GOD) and camouflaged et al., 2020b,c; Wu et al., 2020), such as catching prey, medical image
object detection (COD), respectively. With the development of deep segmentation, smokescreen camouflage, etc.
learning in recent years, the performance of the first two tasks (GOD With the powerful feature extraction capabilities of the convo-
and SOD) has been greatly improved. GOD is a popular computer vision lutional neural network, it has been utilized in camouflaged object
task, and the objects to be detected involves salient objects, generic detection. At present, the camouflaged object detection model based
objects, and even camouflaged objects. GOD not only needs to find on deep learning has achieved relatively satisfactory results, including
the location of the target object accurately in the given scene but also DPCP (Zheng et al., 2019), ANet (Le et al., 2019), and SINet (Fan
needs to assign the correct label and score. SOD (Itti et al., 1998; Zhang et al., 2020a). In Zheng et al. (2019), Zheng et al. proposed a dense
et al., 2017; Fan et al., 2018a; Zhao et al., 2019; Borji et al., 2019; deconvolutional network to distinguish the people with camouflage
Cheng et al., 2014; Wang et al., 2019b; Pang et al., 2020) aims to patterns in the messy natural environment. The network extracts high-
identify the most distinguishing objects from given images. Benefiting level semantic features and introduces a few short connections into
from the powerful computing capability, SOD plays an important role the deconvolution phase to address the corruption of low-level fea-
in computer vision tasks. Whereas, relatively few works focus on COD tures caused by the discrimination of camouflage patterns and messy
tasks due to the lack of sufficient training and testing data sets. Cam- background. Le et al. (2019) proposed Anabranch network (ANet) to
ouflage (Chu et al., 2010) is a widespread biological phenomenon in research the camouflaged object segmentation problem. ANet contains
the nature. In order to thrive in nature, some organisms often disguise two branches, the main branch and the second branch. The second
their appearances and physical structures as objects similar to their branch is to forecast the probability of containing camouflaged ob-
surroundings, which is conducive to reduce the risk of harm. COD jects in given images and fuse the probability into the main branch

∗ Corresponding author.
E-mail address: bihongbo@nepu.edu.can (H. Bi).
1
Cong Zhang and Kang Wang are equally contributed in this article.

https://doi.org/10.1016/j.cviu.2022.103450
Received 3 July 2021; Received in revised form 6 February 2022; Accepted 11 May 2022
Available online 23 May 2022
1077-3142/© 2022 Elsevier Inc. All rights reserved.
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

2. Related works

Because of the similarities among GOD, SOD and COD, and some
SOD algorithms that can be applied to COD tasks, this section first
reviews the existing works on object detection. And then, we briefly
introduce related connection mode, attention mechanism models and
dilated convolution models.

2.1. Generic object detection

According to the way of localizing the object’s location of the target,


GOD can be roughly divided into two categories: the anchor based
approach and the anchor free approach. The anchor based approach
needs a large number of anchor boxes in the input images, and then
trains a classifier to determine the object and the category corre-
sponding to the anchor box according to these given anchor boxes. In
order to improve the performance of the model, researchers have made
great efforts in many aspects. For example, the network with stronger
feature representation ability (He et al., 2016; Huang et al., 2017), and
multilevel features are used to optimize the target to be detected (Lin
Fig. 1. Visual examples of object detection tasks, which can be roughly divided et al., 2017; Ren et al., 2015), the alignment relationship (Chen et al.,
into three categories: salient object detection (SOD), generic object detection (GOD) 2019a; Dai et al., 2017) between the features and the anchor to be
and camouflaged object detection (COD). It is increasing challenge to distinguish the selected. The anchor free approach does not use the anchor to extract
foreground from the background as the arrow from the left to the right. We use the
red dotted line to separate each category. The left column of categories represents the
the candidate proposal. Law and Deng (2018), Tian et al. (2019), Zhou
input images, and the right column denotes the corresponding label. et al. (2019) are several typical models without anchor, which employ
keypoints to lock the object position.

to improve the segmentation accuracy of this network. Although the 2.2. Salient object detection
existing models have achieved satisfactory performances, there are still
The early networks of salient object detection mainly relied on the
some challenges needed to be addressed. In the field of COD, how to
handicrafted local features (Itti et al., 1998; Klein and Frintrop, 2011;
extract features and how to use the extracted features/information to
Xie et al., 2012), global features (Liu et al., 2010; Perazzi et al., 2012;
distinguish the camouflaged object from the background is a crucial
Wang and Shen, 2018; Cheng et al., 2014) or both (Borji and Itti, 2012)
problem. From the perspective of information reuse, most methods feed
to generate salient maps. In Xie et al. (2012), Xie et al. proposed a
the information extracted at the current level to the successive level
subspace clustering method based on Laplacian sparse to detect the
or achieve a forward cross-layer transfer through varied connections
salient object regions in the given images by dividing superpixels with
(dense connection or short connection). The previous methods lacked
local features into groups. Cheng et al. (Cheng et al., 2014) proposed
the use of information from the latter level to refine or strengthen the a region comparison arithmetic to detect salient objects by dividing
current level prediction (equivalent to using the generated camouflage images into clips with significance value and then calculated saliency
map to guide the detection). In addition, because of the diversity of maps using a global contrast value. However, the shortcoming of the
information provided by different receptive fields, it may not achieve traditional method is that it can hardly obtain complete information
satisfactory results if the features of different receptive fields branches accurately.
are directly concatenated in the channel dimension. To address these Benefiting from the ability of convolutional neural network(CNN) to
issues, we propose a novel framework for COD, which consists of capture high-level and global semantic feature information, it is able
two main components, namely, Neighbor Connection Mode (NCM) and to supplement and enhance the performance of traditional features.
Hierarchical Information Transfer (HIT). NCM connects neighbor layer As a result, the CNN has been widely used in salient object detec-
features to enhance feature representation, while HIT uses a variety of tion (Hou et al., 2017; Wang et al., 2019a, 2018, 2019b). Qin et al.
different dilation rates of dilated convolution to extract more abundant (2019) proposed a predict architecture to exactly segment the salient
information. Our main contributions are three folds: object regions, meanwhile predict the detailed structures with clear
boundaries. Zhao and Wu (2019) adopt context-aware pyramid feature
1. We proposed a novel framework for COD with a double branch,
extraction module and spatial attention module to capture high-level
which utilizes the large stride dilated convolution and an ef- feature and low-level feature, respectively. To address the problem of
ficient neighbor layer connection mode. We evaluate the per- blurred boundary, Zhang et al. (2019) introduced a novel weighted
formance of the proposed model on three benchmark datasets. structural loss function into symmetrical fully convolutional network to
Compared with 12 typical deep learning based models, our make sure the object has clear boundaries. Chen et al. (2020) proposed
model achieves state-of-the-art performance. a network which can capture high-level semantic feature, low-level
2. A Neighbor Connection Mode (NCM) is proposed to aggregate appearance features and global context features.
information from the adjacent levels effectively, which provides
more boundary and location information to locate camouflage 2.3. Camouflaged object detection
objects precisely.
3. An effective module, Hierarchical Information Transfer (HIT), Reviewing the traditional methods of camouflaged object detection,
is presented, which not only employs the large scale dilated Tankus and Yeshurun (2001) realized a breakthrough in the area of
rates convolution, but also considers the relationship between visual camouflage by Drag operator. The Drag operator is able to detect
each branch of our proposed module. As a result, the receptive the camouflaged object effectively in concave or smooth convex 3D
fields are enlarged, and the location of the camouflaged object objects images. Galun et al. (2003) tried to use the texture segmentation
is refined. technology to combine the structural features of texture elements with

2
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

the corresponding filter, and use various statistical data to distinguish In this paper, we try to use a new adjacency method to combine
different textures. However, how to combine various statistical data features. Meanwhile, we expand the receptive field by using dilation
into a single weight is an unsolved problem. Bhajantri and Nagab- convolution to extract a clearer target structure. Specifically, the fea-
hushan (2007) proposed a method to identify camouflaged defects by tures of the current level, the latter level (further fine features relative
extracting co-occurrence matrix from image blocks. However, when to the current level) and the intermediate feature of the previous level
there are many kinds of defects in the input image, this method will are fused as a new intermediate feature. As shown in the nattier blue
lose its effectiveness. Song and Geng (2010) presented a method that rectangle in Fig. 2, we call this fusion method as Neighbor Connection
evaluated the surrounding by weighted structural similarity to create Mode (NCM). Considering that different receptive fields can highlight
the camouflaged image and this method can also be used to distinguish the area to be detected, richer information can be extracted, we pro-
the camouflage texture. Rao et al. (2020) proposed a constructive pose a dilated convolution group module with hierarchical information
method for entity texture characterization and statistical modeling of transfer (HIT). A more discriminative feature is obtained by further op-
camouflaged images under the condition of texture smoothing to iden- timizing the intermediate feature. Next, the decoder module (DC) (Wu
tify one or more camouflaged targets in camouflaged images. However, et al., 2019a) is used to generate the initial camouflaged maps. Since
the detection accuracy will be affected by different kinds of images, camouflaged scenes are complicated and contain noise interference,
atmospheric turbulence, and target size. it is unconducive to locate targets from inputs accurately. Therefore,
To effectively detect the camouflaged objects or regions in given inspired by Wu et al. (2019a), we introduce the Gaussian Convolution
images, more research attention is put on introducing a convolutional function to filter out environment interferences and enlarge the detec-
neural network in COD. Inspired by the observation that humans can tion coverage to improve the effectiveness of the initial camouflaged
hardly guarantee whether the camouflaged objects always exist in maps. The output of Gaussian convolution is denoted as 𝐹𝑖𝑛 .
given images, Le et al. (2019) presented an end-to-end network to In order to make use of the generated camouflage maps to refine
segment the camouflaged object accurately according to whether the the feature maps, which can also be understood as using the posterior
images contain camouflaged target or not. To fuse multi-scale semantic feature to optimize the detection. We reuse conv4 and conv5 to build
features effectively, Zheng et al. (2019) presented a dense deconvolu- a new optimization branch, the camouflage maps generated by this
branch are named as enhanced camouflage maps 𝐹𝑒 , (𝑒 = 4, 5). In
tion network based on the visual attention mechanism. The network
addition, instead of directly upsampling and outputting the enhanced
adopted short connections in the deconvolution phase to detect the
camouflage map of the optimized branch, we comprehensively consider
human with a camouflage pattern in messy natural circumstances.
the global information (from enhanced camouflage maps) and local
Recently, Fan et al. (2020a) proposed a novel framework that consists
information (from all levels of rough features), and adopt a similar
of the search module (SM) and the identification module (IM). The
aggregation method with GCPANet (Chen et al., 2020) to generate the
search module saves information from various layers by a densely
final camouflage maps.
linked strategy and the identification module leverages the information
to detect the camouflaged object accurately.
3.2. Neighbor connection mode

2.4. Dilated convolution and connection mode


Recently, PFANet (Zhao and Wu, 2019) and CPD (Wu et al., 2019a)
have proved that extracted lower level features in the network contains
Dilated convolution can increase the receptive field (Liu et al., more details, such as target boundary, texture, etc., which can refine
2018b) of the model by injecting holes in the standard convolution ker- the target to be detected. The higher-level features contain more and
nel, and can also ensure that the size of the output feature map remains deeper semantic information, which beneficial to locate the object accu-
unchanged. Benefiting from the advantage of dilated convolution, Yu rately. Because the camouflaged objects are hidden in the background
and Koltun (2016) proposed a network based on dilated convolution to environment, the boundary and position information is not distinguish-
achieve semantic segmentation. This network can expand the receptive able clearly from the background in the feature maps of each level, it
field while no losing the resolution or coverage. In Wu et al. (2019a), is necessary to combine high-level features with low-level features in
Wu et al. added one more branch to expand the receptive field to obtain a suitable way. In the field of SOD, dense connection (Huang et al.,
global contrast information. 2017) and cross-layer connection are common feature combination
In order to make full use of all levels of features extracted from methods. In this paper, we attempt to aggregate the feature maps of
the network and improve the representation ability of the model, neighbor levels. This not only reduces the burden of dense connection
researchers have proposed a variety of connection methods, such as that consuming a lot of computing memory and redundant features, but
ResNet (He et al., 2016) and Densenet (Huang et al., 2017), etc. In also weakens the phenomenon of long-terms transmission of context.
Densenet, the current level information in the backbone comes from Firstly, the features of the current level, next level and previous
all the previous layers, although the performance of the model is level are fused, and the generated feature maps are denoted by 𝐹𝑗𝑐 , (𝑗 =
improved, this method consumes a lot of computation. In ResNet, to 1, … , 4). Theoretically, this feature not only contains the prediction
solve the problem of network degradation, researchers introduced short information of the current level, but also contains the low-level infor-
connections, which also increased the degree of feature reuse. mation such as the texture and boundary of the shallow layer, as well
as the high-level semantic information. The high integration of these
3. Proposed method three kinds of information is more conducive to the accurate prediction
of the network.
3.1. Our proposed framework When given the input X, a series of coarse features can be output
from the side of the backbone. The above process can be expressed as
Our proposed network is based on ResNet50 (He et al., 2016) follows:
(including five levels, corresponding to conv1, . . . , conv5, respectively).
𝐹𝑗𝑐 = 𝐶𝑜𝑛𝑐𝑎𝑡(𝐹𝑗−1
𝑐
, 𝐹𝑖𝑏 , 𝐹𝑖+1
𝑏
) (1)
Given an input image X ∈ 𝑅𝐶∗𝐻∗𝑊 (C, H and W represent the number
Where 𝑐 represents the intermediate feature generated in the
of channels, height, width of the input image, respectively), we can 𝐹𝑗−1
get the coarse feature maps of all levels of the backbone. Because of previous level, and 𝐹𝑖𝑏 represents feature of the current level in the
the high similarity between camouflaged objects and background, if 𝑏
network, 𝐹𝑖+1 represents feature of the latter level in the backbone
the high-level features and low-level features are combined in a way network relative to the current level. 𝐹𝑗𝑐 represents the intermediate
similar to DenseNet (Huang et al., 2017), the performance may not be feature after fusion. When 𝑗 − 1 = 0, 𝐹𝑗−1 𝑐 = 0. 𝐶𝑜𝑛𝑐𝑎𝑡(⋅) indicates
improved effectively. concatenate operation on channel dimension.

3
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

Fig. 2. The overall architecture of our proposed network, which consists of two main modules, namely, Neighbor Connection Mode (NCM) and Hierarchical Information Transfer
(HIT). NCM is to aggregate the features from adjacent layers. The purpose of HIT is to enhance the information transmission among different convolution layers with various
dilated rates.

Fig. 3. The details of our proposed HIT module. 𝑆𝐴 denoted Spatial Attention module.

3.3. Hierarchical Information Transfer (HIT) Our HIT module consists of five branches, four of which constitute
the main part of the module, as shown in Fig. 3. In addition, we
Many previous works, such as RFB (Liu et al., 2018b) and SINet (Fan introduce the idea of residual learning, and add the attention mech-
et al., 2020a), have proved that increasing the receptive field can make anism (Zhao and Wu, 2019) to suppress the interference of irrelevant
features, which represent the fifth branch.
the features extracted more diverse and detection results are more
Specifically, the input of the HIT module is the intermediate feature
satisfactory. Thus, we propose a Hierarchical Information Transfer
(𝐹𝑗𝑐 , (𝑗 = 1, … , 4)) of the four levels mentioned in Section 3.3 Neighbor
(HIT) module with the dilated convolution and large stride convolution
Connection Mode. Because the number of channels in each level is
kernel.
different, in order to facilitate the calculation, we first compress the
Because the receptive fields of each branch are different, we con- channel number of input features into 32 by using 1 × 1 convolution
sider that adding bridge connections between different receptive fields for each of the five branches, and the corresponding output is denoted
branches can increase the degree of feature reuse. On the other hand, by 𝑏𝑘 (k=1, . . . , 5). The first four branches employ convolution kernels
it can supplement the lack of information in the large receptive field of different strides and dilate rates (𝑠 = 2𝑘 − 1, 𝑠 represents the strides
relative to the small receptive field. In this way, the characteristics of of convolution kernel, 𝑘 represents the different levels of branches) for
the larger receptive field not only contain the information of current feature extraction. When 𝑘 = 1, the stride of convolution kernel is 1,
branches, but also contain the information of small receptive field and the dilate rate is 1. Output of this convolution was denoted by 𝑏1 .
branches. The structure of the module is shown in Fig. 3. When 𝑘 = 2, 𝑥2 is generated by convolution with the stride size of 3

4
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

Table 1
Quantitative results on benchmark datasets (COD) compared with 14 typical deep learning based models. The best performance is highlighted in red, the second is marked in
green and the third is marked in blue, respectively. ↑ indicates the higher the score the better. ↓ indicates the lower the score the better. ∗ represents that the source code is not
accessible publicly. We reevaluate the camouflage maps (Which comes from https://github.com/DengPingFan/SINet) of the compared model on public evaluation indicators.
Model CAMO CHAMELEON COD10K
S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑
PiCANet (Liu et al., 2018a) 0.609 0.156 0.584 0.529 0.769 0.085 0.749 0.698 0.649 0.090 0.643 0.485
UNet++ (Zhou et al., 2018) 0.599 0.149 0.653 0.501 0.695 0.094 0.762 0.569 0.623 0.086 0.672 0.425
ANet (Le et al., 2019) 0.682 0.126 0.685 0.566 * * * * * * * *
BASNet (Qin et al., 2019) 0.618 0.159 0.661 0.519 0.687 0.118 0.721 0.581 0.634 0.105 0.678 0.451
CPD-R (Wu et al., 2019a) 0.726 0.115 0.729 0.663 0.853 0.052 0.866 0.800 0.747 0.059 0.770 0.631
HTC (Chen et al., 2019b) 0.476 0.172 0.442 0.218 0.517 0.129 0.489 0.236 0.548 0.088 0.520 0.253
MSRCNN (Huang et al., 2019) 0.617 0.133 0.669 0.528 0.637 0.091 0.686 0.506 0.641 0.073 0.706 0.479
PFANet (Zhao and Wu, 2019) 0.659 0.172 0.622 0.564 0.679 0.144 0.648 0.552 0.636 0.128 0.618 0.466
PoolNet (Liu et al., 2019) 0.702 0.129 0.698 0.629 0.776 0.081 0.779 0.706 0.705 0.074 0.713 0.870
EGNet (Zhao et al., 2019) 0.732 0.104 0.768 0.680 0.848 0.050 0.870 0.795 0.737 0.056 0.779 0.613
SCRN (Wu et al., 2019b) 0.744 0.105 0.742 0.703 0.864 0.047 0.872 0.823 0.773 0.051 0.783 0.674
SINet (Fan et al., 2020a) 0.751 0.100 0.771 0.706 0.869 0.044 0.891 0.832 0.771 0.051 0.806 0.676
MirrorNet (ResNet-50) 0.741 0.100 * * * * * * * * * *
(Yan et al., 2021)
SINet-V2(ResNet-50) 0.725 0.103 0.754 0.718 0.770 0.059 0.871 0.808 0.719 0.053 0.766 0.620
(Fan et al., 2021)
Ours 0.780 0.088 0.803 0.730 0.874 0.041 0.891 0.834 0.790 0.046 0.817 0.695

(the input of the convolution is 𝑏2 ). Based on the above analysis, we where 𝛼𝑚 (𝑚 = 1, … , 4) represents the different stage weights mentioned
combine 𝑏1 and 𝑥2 in the way of element addition, to highlight the above.
target position and suppress the interference feature, followed by the
sigmoid function. In this way, we can get soft attention weights. Next, 4. Experiments and results
we fuse the weights with 𝑥2 in the way of element multiplication, and
take the original 𝑥2 as the bias to obtain the weighted feature map. The In this section, we describe the benchmark datasets in COD, evalu-
process can be expressed as follows: ation metrics and the implementation details of our proposed network.
In addition, we compare our propose model with other typical deep
𝐹2𝑤 = 𝑚𝑢𝑙(𝑥2 , 𝜎(𝑥2 + 𝑏1 )) + 𝑥2 (2)
learning based models.
𝐹2𝑤 represents the weighted feature map of the corresponding level.
𝜎 represents the sigmoid function. Next, 𝐹2𝑤 is further processed by 4.1. Datasets
convolution unit with kernel size of 3 × 3 and dilate rate of 2𝑘 − 1
to generate the final camouflage map of this branch (𝑘 = 2), which is CAMO (Le et al., 2019) is a subdataset of CAMO-COCO (Le et al.,
denoted by 𝑥𝑜𝑢𝑡 . When 𝑘 = 1, 𝑥𝑜𝑢𝑡 = 𝑏1 . 2019) proposed by Le et al. in 2018 for camouflaged objects seg-
𝑘 𝑘
When 𝑘 = 3 and 4, the process are similar to 𝑘 = 2. The details can mentation. CAMO contains 1250 images (1000 for training, 250 for
be expressed as follows: testing), where each image contains at least one camouflaged object.
Furthermore, it involves varieties of challenging scenarios, such as
𝑥𝑘 = 𝐶𝑜𝑛𝑣(𝑏𝑘 , 𝑠 = 2𝑘 − 1) (3) shape complexity, background clutter, object appearance, and so on.
𝐹𝑘𝑤 = 𝑚𝑢𝑙(𝑥𝑘 , 𝜎(𝑥𝑘 + 𝑥𝑜𝑢𝑡
𝑘−1
)) + 𝑥𝑘 (4) CHAMELEON (Shen et al., 2018) dataset contains 76 images and
all of them are natural camouflage where animals are hidden in the
𝑥𝑜𝑢𝑡 𝑤
𝑘 = 𝐶𝑜𝑛𝑣(𝐹𝑘 , 𝑠 = 3, 𝑑 = 2𝑘 − 1) (5)
natural environment by their skin color or body shape. It is assiduous
where 𝑑 represents dilate rates. Next, 𝑥𝑜𝑢𝑡
are fused in the channel
𝑘
to find these camouflaged animals out from natural camouflage images.
dimension using concatenation operation and followed a convolution As a result, these images are suitable for camouflaged object detection.
with the kernel size is 1 ∗ 1 to reduce dimension. Finally, it is fused with COD10K (Fan et al., 2020a) is the lastest dataset proposed by
the fifth branch by element addition to generating camouflage map, and Fan et al. for camouflaged object detection, which contains 10000
the output denoted by 𝐹𝑗+1 𝑐 . images (6000 for training, 4000 for testing). It consists of 10 super-
classes, and 78 subclasses which are gathered from several photography
3.4. Loss function websites. Besides, COD10K contains two kinds of camouflage images,
natural camouflage, and artificial camouflage. Natural camouflage im-
BCE is a common loss function used to optimize the model, which ages involve animal camouflage hidden in the natural environment,
forces the network detection results to be more accurate. BCE can be while artificial camouflage images include many occasions, such as
calculated by the following formula: operations, games, and so on.

1 ∑ [
𝑁
( ) ( ) ( )] 4.2. Evaluation metrics
𝐿𝑜𝑠𝑠 = − 𝑦𝑞 𝑙𝑜𝑔 𝑝𝑞 + 1 − 𝑦𝑞 𝑙𝑜𝑔 1 − 𝑝𝑞 (6)
𝑁 𝑚=1
where 𝑁 represents the number of all pixels in the image, 𝑦𝑞 denotes F-Measure (Arbelaez et al., 2011) is the weighted harmonic average
the ground truth of pixel 𝑞, and 𝑝𝑞 represents the probability that pixel of Precision and Recall. It is widely used in the field of IR (Information
𝑞 belongs to the camouflaged objects. 𝑁 can be obtained by multiplying Retrieval) as an evaluation criterion and is mainly used to evaluate
the width and height of the image. the quality of classification models. F-measure is calculated by the
In this paper, we use BCE for multistage supervision. The closer following formula:
( 2 )
to the output of the network, the greater the weight is given to it 𝛽 + 1 𝑃𝑅
(1, 0.8, 0.6, 0.4, respectively). Therefore, the final loss function can be 𝐹𝛽 = (8)
𝛽2𝑃 + 𝑅
calculated by the following formula:
∑ where 𝛽 is a hyperparameter to trade off precision and recall. According
𝐿𝑎𝑙𝑙 = 𝛼𝑚 𝐿𝑜𝑠𝑠 (7) to Zhang et al. (2016), we set 𝛽 2 = 0.3. 𝑃 and 𝑅 represent precision

5
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

Table 2
Comparison of our model and others on three largest subclasses of the COD10K dataset in terms of four public evaluation indicators. The best performence are marked in red.
Model COD10K-Aquatic 474 images COD10K-Terrestria 699 images COD10K-Flying 714 images
S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑
PiCANet (Liu et al., 2018a) 0.629 0.120 0.623 0.509 0.625 0.084 0.628 0.424 0.677 0.076 0.663 0.517
UNet++ (Zhou et al., 2018) 0.599 0.121 0.659 0.437 0.593 0.081 0.637 0.366 0.659 0.068 0.708 0.467
BASNet (Qin et al., 2019) 0.620 0.134 0.666 0.465 0.601 0.109 0.645 0.382 0.664 0.086 0.710 0.492
CPD-R (Wu et al., 2019a) 0.739 0.082 0.770 0.652 0.714 0.058 0.735 0.565 0.777 0.046 0.796 0.672
HTC (Chen et al., 2019b) 0.507 0.129 0.494 0.223 0.530 0.078 0.484 0.196 0.582 0.070 0.558 0.309
MSRCNN (Huang et al., 2019) 0.614 0.107 0.685 0.465 0.611 0.070 0.671 0.418 0.674 0.058 0.742 0.523
PFANet (Zhao and Wu, 2019) 0.629 0.162 0.614 0.490 0.609 0.123 0.600 0.404 0.657 0.113 0.632 0.496
PoolNet (Liu et al., 2019) 0.689 0.102 0.705 0.572 0.677 0.070 0.688 0.509 0.733 0.062 0.733 0.614
EGNet (Zhao et al., 2019) 0.725 0.080 0.775 0.630 0.704 0.054 0.748 0.550 0.768 0.044 0.803 0.651
SCRN (Wu et al., 2019b) 0.767 0.071 0.787 0.695 0.746 0.051 0.756 0.629 0.803 0.040 0.812 0.714
SINet (Fan et al., 2020a) 0.758 0.073 0.803 0.685 0.743 0.050 0.778 0.623 0.798 0.040 0.828 0.709
SINetL-V2(ResNet-50) 0.617 0.148 0.627 0.456 0.770 0.059 0.871 0.808 0.719 0.053 0.766 0.620
(Fan et al., 2021)
Ours 0.786 0.062 0.823 0.725 0.763 0.045 0.785 0.643 0.818 0.034 0.837 0.735

Fig. 4. Visual comparison of COD detection results with 12 models on the three public datasets. The first column represents the input image. The second column is the Ground
Truth (GT). The results of our model are shown in the third column. As shown in the above figure, our results are closest to GT. The completely black detection result in the
figure indicating that the model does not detect any camouflaged objects. Since ANet and MirroeNet did not provide predicted maps on responding datasets or the complete code
based on PyTorch, we cannot reoccur these models and exhibit their camouflaged maps.

and recall, respectively. In this paper, we use 𝐹 𝑚𝑎𝑥 to denote the max According to Fan et al. (2017), we set 𝛼 as 0.5. 𝑆𝑟 is the region aware
F-measure. structural similarity, which captures object part structure information
Mean absolute error (MAE) is designed to directly calculate the and 𝑆𝑜 denotes the object aware structural similarity.
average absolute difference between the feature map output by the S-measure is mainly used to measure structural similarity in non-
model and the groundtruth. MAE is computed as: binary foreground maps, while E-measure (Fan et al., 2018b) is mainly
used to measure image-level statistics and local pixel matching infor-
1 ∑𝐻 ∑ 𝑊
𝑀𝐴𝐸 = |𝑆(𝑥, 𝑦) − 𝐺(𝑥, 𝑦)| (9) mation. The E-measure is defined as:
𝐻 × 𝑊 𝑥=1 𝑦=1
1 ∑𝑊 ∑ 𝐻
𝐸= 𝜙 (𝑥, 𝑦) (11)
where 𝑊 and 𝐻 represent the width and height of the feature map, 𝑊 × 𝐻 𝑥=1 𝑦=1 𝐹 𝑀
respectively. 𝑥 and 𝑦 denote the coordinate of each pixel. 𝑆 refers to
the feature map and 𝐺 denotes the groundtruth. where 𝐻 and 𝑊 are the height and the width of the predicted map,
S-Measure (Fan et al., 2017) is the structural similarity measure, respectively. 𝜙 is the enhanced alignment matrix. In this paper, we
which was proposed by Fan et al. to evaluate non-binary foreground use 𝐸𝑚𝑒𝑎𝑛 to represent mean E-Measure. PR curve is to obtain the
maps. In this paper, we employ S-measure to assess the similarity relationship between the precision and recall, so we employ the PR
between the feature map and the groundtruth map. The S-measure is curve to measure the performance. The camouflage map is segmented
computed as: by a set of fixed threshold that ranges from 0 to 255. The PR curve can
be obtained by calculating the recall rate and precision score of each
𝑆 = 𝛼𝑆𝑜 + (1 − 𝛼) 𝑆𝑟 (10) threshold within the range.

6
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

Fig. 5. The PR Curves on three benchmark datasets (COD).

4.3. Implementation details

We implement our proposed model based on Pytorch toolbox and


with NVIDIA 1080 Ti for acceleration. The backbone is based on
ResNet50, which is pretrained on Imagenet. We train our proposed
model with the dataset of CAMO + COD10K (contains 4040 images).2
For training, the spatial resolution of input images is resized into
320*320 and then randomly cropped to 288*288. Our model uses SGD
as optimizer with the Momentum and weight decay of 0.9 and 0.0005.
Initial learning rate and batchsize are set to 0.001 and 20, respectively.
The inference for a 320*320 image takes only 0.045 s (about 22.3FPS).
Our model with 185.603G FLOPS and 24.765M parameters.

4.4. Comparison with state-of-the-arts (SOTA)


Fig. 6. Visual Comparison of our model, backbone with NCM and Bi-branch, backbone
We compare the proposed model with 12 typical deep learning with Bi-branch, and backbone.
based models: PiCANet (Liu et al., 2018a), UNet++ (Zhou et al., 2018),
ANet (Le et al., 2019), BASNet (Qin et al., 2019), CPD-R (Wu et al.,
2019a), HTC (Chen et al., 2019b), MSRCNN (Huang et al., 2019),
missing. In the CHAMELEON dataset (the fourth row), a hole in the
FPANet (Zhao and Wu, 2019), PoolNet (Liu et al., 2019), EGNet (Zhao
camouflage maps is generated by the SINet model, which may be the
et al., 2019), SINet (Fan et al., 2020a), SCRN (Wu et al., 2019b). Among
result of dilated convolution unit. However, our model can detect the
them, ANet and SINet are specially designed for COD detection, other
camouflaged objects more accurately, which is due to the connection
networks are designed for SOD tasks. As for the camouflage maps of the
between different dilated rates of our HIT module. From the PR curve
compared model, part of them is provided by the author, and part of
(as shown in Fig. 5), we can see that our model is superior to most
them is generated by training and testing based on the released source
models under different thresholds.
code of the relevant model.
4.5. Ablation study
4.4.1. Quantitative evaluation
First, as listed in Table 1, our model has achieved SOTA perfor-
In this section, we carry on experiments on benchmark datasets to
mance in all data sets (based on four widely used evaluation indicators).
verify the effectiveness of the proposed model. The results are listed in
Compared with the other models, the performance of the proposed
Table 3 and Fig. 6.
model shows different degrees of improvement on the three benchmark
We first report the performance of the backbone model, as a list
datasets. In particular, compared with SINet on CAMO dataset, 𝑆 score,
in the first column of Table 3. Bi-branch indicates that we combine
𝐸𝑚𝑒𝑎𝑛 score and 𝐹𝑚𝑎𝑥 score of our model increases by (0.021) 2.8%,
the bifurcation structure, DC and FIA modules (in other words, bi-
(0.017) 2.2%, (0.024) 3.4% and 𝑀𝐴𝐸 score decreases by (0.008) 8%,
branch model represents that we remove NCM and HIT from our
respectively. We have made similar evaluations on four subclasses of
complete model). Next, when we combine NCM on the basis of Bi-
the COD10K dataset with other models (as listed in Table 2). It can be
branch, from the perspective of evaluation indicators, the performance
seen that our model also achieves the SOTA performance. Especially,
of the model can be improved. Specifically, on CAMO dataset, 𝑆 score,
for COD10K-Aquatic, 𝐹 𝑚𝑎𝑥 score increases by 0.038, 𝑀𝐴𝐸 decreases
𝐸𝑚𝑒𝑎𝑛 score and 𝐹𝑚𝑎𝑥 score increase 0.069, 0.101, 0.085 and 𝑀𝐴𝐸
by 0.01. In all, the superiority of our proposed model has been proved.
score decreases 0.03, respectively. On COD10K dataset, 𝑆 score, 𝐸𝑚𝑒𝑎𝑛
score and 𝐹𝑚𝑎𝑥 score increase by 0.034, 0.058, 0.048 and 𝑀𝐴𝐸 score
4.4.2. Qualitative evaluation decreases by 0.013. As shown in the fifth column of Fig. 6, NCM effec-
In order to make a more intuitive comparison with other models, tively integrates the three neighbor level information and highlights
and highlight the advantages of our model, we provide some visual the location of the object from the background. On the other hand,
contrast maps with other models. Specially, we compare two existing when we combine the HIT and NCM, the performance of our model
ranked highest models on COD (SINet and SCRN). As can be seen (the fourth column) has been further improved. On CAMO dataset, 𝑆
from Fig. 4, our model achieves the best performance. For example, score, 𝐸𝑚𝑒𝑎𝑛 score and 𝐹𝑚𝑎𝑥 score increase by 0.021, 0.033, 0.015, 𝑀𝐴𝐸
in the COD10K and CAMO datasets, the structure of the camouflaged score decreases by 0.010. As shown in the last column of Fig. 6, the HIT
object detected by SINet (except for the first row of the COD10K module effectively enhances the interaction among various features and
data set) and SCRN is incomplete, and there are different degrees of refines the boundary of targets.
In all, the experimental results verify the effectiveness of our pro-
2
Which comes from https://github.com/DengPingFan/SINet. posed model.

7
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

Table 3
Ablation study of our proposed model on different architectures.
Backbone Bi-branch NCM HIT CHAMELEON CAMO COD10K
S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑

0.526 0.272 0.432 0.335 0.511 0.291 0.428 0.356 0.529 0.219 0.457 0.294
√ √
0.799 0.072 0.772 0.744 0.690 0.128 0.669 0.630 0.742 0.062 0.735 0.632
√ √ √
0.858 0.048 0.860 0.811 0.759 0.098 0.770 0.715 0.776 0.049 0.793 0.680
√ √ √ √
0.874 0.041 0.891 0.834 0.780 0.088 0.803 0.730 0.790 0.046 0.817 0.675

References

Arbelaez, P., Maire, M., Fowlkes, C.C., Malik, J., 2011. Contour detection and
hierarchical image segmentation. 33, (5), pp. 898–916.
Bhajantri, N.U., Nagabhushan, P., 2007. Camouflage defect identification: A novel
approach. In: International Conference on Information Technology (ICIT). pp.
145–148.
Borji, A., Cheng, M.M., Hou, Q., Jiang, H., Li, J., 2019. Salient object detection: A
survey. 5, (2), pp. 117–150.
Borji, A., Itti, L., 2012. Exploiting local and global patch rarities for saliency detection.
pp. 478–485.
Chen, Y., Han, C., Wang, N., Zhang, Z., 2019a. Revisiting feature alignment for
one-stage object detection. arXiv preprint arXiv:1908.01570.
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J.,
Ouyang, W., et al., 2019b. Hybrid task cascade for instance segmentation. pp.
4974–4983.
Chen, Z., Xu, Q., Cong, R., Huang, Q., 2020. Global context-aware progressive
aggregation network for salient object detection. 34, (07), pp. 10599–10606.
Fig. 7. Some examples of failure cases. Because SINet achieved the best performance, Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M., 2014. Global contrast based
we only list the comparison example with SINet. salient region detection. 37, (3), pp. 569–582.
Chu, H.K., Hsu, W.H., Mitra, N.J., Cohen-Or, D., Lee, T.Y., 2010. Camouflage images.
ACM Trans. Graph. (ACM) 29 (4).
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y., 2017. Deformable
4.6. Failure cases convolutional networks. pp. 764–773.
Fan, D.P., Cheng, M.M., Liu, J.J., Gao, S.H., Hou, Q., Borji, A., 2018a. Salient objects
Although our model has achieved remarkable performance on the in clutter: Bringing salient object detection to the foreground. pp. 186–202.
benchmark datasets, there are also some failure cases. As shown in Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A., 2017. Structure-measure: A new way
to evaluate foreground maps. pp. 4548–4557.
Fig. 7, our model considers some objects in the background that are
Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A., 2018b. Enhanced-alignment
close to or highly similar to camouflaged objects as an integral part of measure for binary foreground map evaluation. pp. 698–704.
camouflaged objects. This may because we only consider the general Fan, D.P., Ji, G.P., Cheng, M.M., Shao, L., 2021. Concealed object detection. IEEE Trans.
complex scenarios and ignore the extremely complex ones. In future Pattern Anal. Mach. Intell..
work, we will consider all scenarios and propose a more applicable Fan, D.P., Ji, G.P., Sun, G., Cheng, M.M., Shen, J., Shao, L., 2020a. Camouflaged object
detection. pp. 2777–2787.
model.
Fan, D.P., Ji, G.P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L., 2020b. PraNet:
Parallel reverse attention network for polyp segmentation. Med. Image Comput.
5. Conclusion Comput.-Assisted Interv. (MICCAI) 263–273.
Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao, L., 2020c.
In this paper, we propose an end-to-end network with a double Inf-net: Automatic COVID-19 lung infection segmentation from ct images. 39, (8),
branch for COD. Our network consists of two main components, NCM pp. 2626–2637.
Galun, M., Sharon, E., Basri, R., Brandt, A., 2003. Texture segmentation by multiscale
and HIT. NCM simulates the process of visual exploration and im- aggregation of filter responses and shape elements. pp. 716–723.
proves feature reuse, which is beneficial to highlight the boundary Ge, S., Jin, X., Ye, Q., Luo, Z., Li, Q., 2018. Image editing by object-aware optimal
and position information of detected objects. As for HIT, large scale boundary searching and mixed-domain composition. 4, (1), pp. 71–82.
and different dilated rates convolution are used for feature extraction, He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.
and then the generated features are further refined by connecting pp. 770–778.
Hou, Q., Cheng, M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S., 2017. Deeply supervised
with different levels. We combine the two components to improve the salient object detection with short connections. pp. 5300–5309.
detection accuracy of COD tasks. We employ four metrics to compare Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X., 2019. Mask scoring R-CNN. pp.
the proposed scheme with 12 typical deep learning based models on 6409–6418.
three public benchmark datasets, the results have shown that our model Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q., 2017. Densely connected
achieves the best performance. In addition, we have carried out a series convolutional networks. pp. 4700–4708.
Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual attention for rapid
of ablation experiments, and the experimental results also proved the scene analysis. 20, (11), pp. 1254–1259.
effectiveness of our proposed model. Klein, D.A., Frintrop, S., 2011. Center-surround divergence of feature statistics for
salient object detection. pp. 2214–2219.
CRediT authorship contribution statement Law, H., Deng, J., 2018. Cornernet: Detecting objects as paired keypoints. pp. 734–750.
Le, T., Nguyen, T.V., Nie, Z., Tran, M., Sugimoto, A., 2019. Anabranch network for
camouflaged object segmentation. Comput. Vis. Image Underst. (CVIU) 184, 45–56.
Cong Zhang: Conceptualization, Methodology, Writing. Kang Wang:
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017. Feature
Software, Visualization. Hongbo Bi: Supervision. Ziqi Liu: Data cura- pyramid networks for object detection. pp. 2117–2125.
tion, Writing – original draft. Lina Yang: Reviewing and editing. Liu, N., Han, J., Yang, M., 2018a. PiCANet: Learning pixel-wise contextual attention
for saliency detection. pp. 3089–3098.
Declaration of competing interest Liu, J., Hou, Q., Cheng, M., Feng, J., Jiang, J., 2019. A simple pooling-based design
for real-time salient object detection. pp. 3917–3926.
Liu, S., Huang, D., Wang, Y., 2018b. Receptive field block net for accurate and fast
The authors declare that they have no known competing finan- object detection. pp. 404–419.
cial interests or personal relationships that could have appeared to Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.-Y., 2010. Learning
influence the work reported in this paper. to detect a salient object. 33, (2), pp. 353–367.

8
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450

Pang, Y., Zhao, X., Zhang, L., Lu, H., 2020. Multi-scale interactive network for salient Wu, Z., Su, L., Huang, Q., 2019a. Cascaded partial decoder for fast and accurate salient
object detection. object detection. pp. 3907–3916.
Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A., 2012. Saliency filters: Contrast Wu, Z., Su, L., Huang, Q., 2019b. Stacked cross refinement network for edge-aware
based filtering for salient region detection. pp. 733–740. salient object detection. pp. 7264–7273.
Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M., 2019. BASNet: Xie, Y., Lu, H., Yang, M.H., 2012. BayesIan saliency via low and mid level cues. 22,
Boundary-aware salient object detection. pp. 7479–7489. (5), pp. 1689–1698.
Rao, C.P., Reddy, A.G., Rao, C., 2020. Camouflaged object detection for machine vision Yan, J., Le, T.N., Nguyen, K.D., Tran, M.T., Do, T.T., Nguyen, T.V., 2021. Mirrornet:
applications. Int. J. Speech Technol. 23 (11). Bio-inspired camouflaged object segmentation. IEEE Access 9, 43290–43300.
Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time object Yu, F., Koltun, V., 2016. Multi-scale context aggregation by dilated convolutions.
detection with region proposal networks. pp. 91–99. Zhang, D., Han, J., Li, C., Wang, J., Li, X., 2016. Detection of co-salient objects by
Shen, Y., Ji, R., Zhang, S., Zuo, W., Yan, W., 2018. Generative adversarial learning looking deep and wide. 120, (2), pp. 215–232.
towards fast weakly supervised detection. pp. 5764–5773.
Zhang, P., Liu, W., Lu, H., Shen, C., 2019. Salient object detection with lossless feature
Song, L., Geng, W., 2010. A new camouflage texture evaluation method based on
reflection and weighted structural loss. 28 (6), 3048–3060.
WSSIM and nature image features. pp. 1–4.
Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X., 2017. Amulet: Aggregating multi-level
Tankus, A., Yeshurun, Y., 2001. Convexity-based visual camouflage breaking. Comput.
convolutional features for salient object detection. pp. 202–211.
Vis. Image Underst. (CVIU) 82 (3), 208–237.
Zhao, J., Liu, J.-J., Fan, D.P., Cao, Y., Yang, J., Cheng, M.M., 2019. EGNet: Edge
Tian, Z., Shen, C., Chen, H., He, T., 2019. Fcos: Fully convolutional one-stage object
guidance network for salient object detection. pp. 8779–8788.
detection. pp. 9627–9636.
Zhao, T., Wu, X., 2019. Pyramid feature attention network for saliency detection. pp.
Wang, W., Shen, J., 2018. Deep visual attention prediction. 27, (5), pp. 2368–2378.
Wang, W., Shen, J., Cheng, M., Shao, L., 2019a. An iterative and cooperative top-down 3085–3094.
and bottom-up inference network for salient object detection. pp. 5968–5977. Zheng, Y., Zhang, X., Wang, F., Cao, T., Sun, M., Wang, X., 2019. Detection of people
Wang, W., Shen, J., Dong, X., Borji, A., 2018. Salient object detection driven by fixation with camouflage pattern via dense deconvolution network. 26, (1), pp. 29–33.
prediction. pp. 1711–1720. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2018. Unet++: A nested u-net
Wang, W., Zhao, S., Shen, J., Hoi, S.C.H., Borji, A., 2019b. Salient object detection architecture for medical image segmentation. In: Deep Learning in Medical Image
with pyramid attention and salient edges. pp. 1448–1457. Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 3–11.
Wu, Y.H., Gao, S.H., Mei, J., Xu, J., Fan, D.P., Zhao, C.W., Cheng, M.M., 2020. JCS: Zhou, X., Zhuo, J., Krahenbuhl, P., 2019. Bottom-up object detection by grouping
An explainable COVID-19 diagnosis system by joint classification and segmentation. extreme and center points. pp. 850–859.
arXiv preprint arXiv:2004.07054.

You might also like