1-s2.0-S1077314222000637-main
1-s2.0-S1077314222000637-main
1-s2.0-S1077314222000637-main
∗ Corresponding author.
E-mail address: bihongbo@nepu.edu.can (H. Bi).
1
Cong Zhang and Kang Wang are equally contributed in this article.
https://doi.org/10.1016/j.cviu.2022.103450
Received 3 July 2021; Received in revised form 6 February 2022; Accepted 11 May 2022
Available online 23 May 2022
1077-3142/© 2022 Elsevier Inc. All rights reserved.
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
2. Related works
Because of the similarities among GOD, SOD and COD, and some
SOD algorithms that can be applied to COD tasks, this section first
reviews the existing works on object detection. And then, we briefly
introduce related connection mode, attention mechanism models and
dilated convolution models.
to improve the segmentation accuracy of this network. Although the 2.2. Salient object detection
existing models have achieved satisfactory performances, there are still
The early networks of salient object detection mainly relied on the
some challenges needed to be addressed. In the field of COD, how to
handicrafted local features (Itti et al., 1998; Klein and Frintrop, 2011;
extract features and how to use the extracted features/information to
Xie et al., 2012), global features (Liu et al., 2010; Perazzi et al., 2012;
distinguish the camouflaged object from the background is a crucial
Wang and Shen, 2018; Cheng et al., 2014) or both (Borji and Itti, 2012)
problem. From the perspective of information reuse, most methods feed
to generate salient maps. In Xie et al. (2012), Xie et al. proposed a
the information extracted at the current level to the successive level
subspace clustering method based on Laplacian sparse to detect the
or achieve a forward cross-layer transfer through varied connections
salient object regions in the given images by dividing superpixels with
(dense connection or short connection). The previous methods lacked
local features into groups. Cheng et al. (Cheng et al., 2014) proposed
the use of information from the latter level to refine or strengthen the a region comparison arithmetic to detect salient objects by dividing
current level prediction (equivalent to using the generated camouflage images into clips with significance value and then calculated saliency
map to guide the detection). In addition, because of the diversity of maps using a global contrast value. However, the shortcoming of the
information provided by different receptive fields, it may not achieve traditional method is that it can hardly obtain complete information
satisfactory results if the features of different receptive fields branches accurately.
are directly concatenated in the channel dimension. To address these Benefiting from the ability of convolutional neural network(CNN) to
issues, we propose a novel framework for COD, which consists of capture high-level and global semantic feature information, it is able
two main components, namely, Neighbor Connection Mode (NCM) and to supplement and enhance the performance of traditional features.
Hierarchical Information Transfer (HIT). NCM connects neighbor layer As a result, the CNN has been widely used in salient object detec-
features to enhance feature representation, while HIT uses a variety of tion (Hou et al., 2017; Wang et al., 2019a, 2018, 2019b). Qin et al.
different dilation rates of dilated convolution to extract more abundant (2019) proposed a predict architecture to exactly segment the salient
information. Our main contributions are three folds: object regions, meanwhile predict the detailed structures with clear
boundaries. Zhao and Wu (2019) adopt context-aware pyramid feature
1. We proposed a novel framework for COD with a double branch,
extraction module and spatial attention module to capture high-level
which utilizes the large stride dilated convolution and an ef- feature and low-level feature, respectively. To address the problem of
ficient neighbor layer connection mode. We evaluate the per- blurred boundary, Zhang et al. (2019) introduced a novel weighted
formance of the proposed model on three benchmark datasets. structural loss function into symmetrical fully convolutional network to
Compared with 12 typical deep learning based models, our make sure the object has clear boundaries. Chen et al. (2020) proposed
model achieves state-of-the-art performance. a network which can capture high-level semantic feature, low-level
2. A Neighbor Connection Mode (NCM) is proposed to aggregate appearance features and global context features.
information from the adjacent levels effectively, which provides
more boundary and location information to locate camouflage 2.3. Camouflaged object detection
objects precisely.
3. An effective module, Hierarchical Information Transfer (HIT), Reviewing the traditional methods of camouflaged object detection,
is presented, which not only employs the large scale dilated Tankus and Yeshurun (2001) realized a breakthrough in the area of
rates convolution, but also considers the relationship between visual camouflage by Drag operator. The Drag operator is able to detect
each branch of our proposed module. As a result, the receptive the camouflaged object effectively in concave or smooth convex 3D
fields are enlarged, and the location of the camouflaged object objects images. Galun et al. (2003) tried to use the texture segmentation
is refined. technology to combine the structural features of texture elements with
2
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
the corresponding filter, and use various statistical data to distinguish In this paper, we try to use a new adjacency method to combine
different textures. However, how to combine various statistical data features. Meanwhile, we expand the receptive field by using dilation
into a single weight is an unsolved problem. Bhajantri and Nagab- convolution to extract a clearer target structure. Specifically, the fea-
hushan (2007) proposed a method to identify camouflaged defects by tures of the current level, the latter level (further fine features relative
extracting co-occurrence matrix from image blocks. However, when to the current level) and the intermediate feature of the previous level
there are many kinds of defects in the input image, this method will are fused as a new intermediate feature. As shown in the nattier blue
lose its effectiveness. Song and Geng (2010) presented a method that rectangle in Fig. 2, we call this fusion method as Neighbor Connection
evaluated the surrounding by weighted structural similarity to create Mode (NCM). Considering that different receptive fields can highlight
the camouflaged image and this method can also be used to distinguish the area to be detected, richer information can be extracted, we pro-
the camouflage texture. Rao et al. (2020) proposed a constructive pose a dilated convolution group module with hierarchical information
method for entity texture characterization and statistical modeling of transfer (HIT). A more discriminative feature is obtained by further op-
camouflaged images under the condition of texture smoothing to iden- timizing the intermediate feature. Next, the decoder module (DC) (Wu
tify one or more camouflaged targets in camouflaged images. However, et al., 2019a) is used to generate the initial camouflaged maps. Since
the detection accuracy will be affected by different kinds of images, camouflaged scenes are complicated and contain noise interference,
atmospheric turbulence, and target size. it is unconducive to locate targets from inputs accurately. Therefore,
To effectively detect the camouflaged objects or regions in given inspired by Wu et al. (2019a), we introduce the Gaussian Convolution
images, more research attention is put on introducing a convolutional function to filter out environment interferences and enlarge the detec-
neural network in COD. Inspired by the observation that humans can tion coverage to improve the effectiveness of the initial camouflaged
hardly guarantee whether the camouflaged objects always exist in maps. The output of Gaussian convolution is denoted as 𝐹𝑖𝑛 .
given images, Le et al. (2019) presented an end-to-end network to In order to make use of the generated camouflage maps to refine
segment the camouflaged object accurately according to whether the the feature maps, which can also be understood as using the posterior
images contain camouflaged target or not. To fuse multi-scale semantic feature to optimize the detection. We reuse conv4 and conv5 to build
features effectively, Zheng et al. (2019) presented a dense deconvolu- a new optimization branch, the camouflage maps generated by this
branch are named as enhanced camouflage maps 𝐹𝑒 , (𝑒 = 4, 5). In
tion network based on the visual attention mechanism. The network
addition, instead of directly upsampling and outputting the enhanced
adopted short connections in the deconvolution phase to detect the
camouflage map of the optimized branch, we comprehensively consider
human with a camouflage pattern in messy natural circumstances.
the global information (from enhanced camouflage maps) and local
Recently, Fan et al. (2020a) proposed a novel framework that consists
information (from all levels of rough features), and adopt a similar
of the search module (SM) and the identification module (IM). The
aggregation method with GCPANet (Chen et al., 2020) to generate the
search module saves information from various layers by a densely
final camouflage maps.
linked strategy and the identification module leverages the information
to detect the camouflaged object accurately.
3.2. Neighbor connection mode
3
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
Fig. 2. The overall architecture of our proposed network, which consists of two main modules, namely, Neighbor Connection Mode (NCM) and Hierarchical Information Transfer
(HIT). NCM is to aggregate the features from adjacent layers. The purpose of HIT is to enhance the information transmission among different convolution layers with various
dilated rates.
Fig. 3. The details of our proposed HIT module. 𝑆𝐴 denoted Spatial Attention module.
3.3. Hierarchical Information Transfer (HIT) Our HIT module consists of five branches, four of which constitute
the main part of the module, as shown in Fig. 3. In addition, we
Many previous works, such as RFB (Liu et al., 2018b) and SINet (Fan introduce the idea of residual learning, and add the attention mech-
et al., 2020a), have proved that increasing the receptive field can make anism (Zhao and Wu, 2019) to suppress the interference of irrelevant
features, which represent the fifth branch.
the features extracted more diverse and detection results are more
Specifically, the input of the HIT module is the intermediate feature
satisfactory. Thus, we propose a Hierarchical Information Transfer
(𝐹𝑗𝑐 , (𝑗 = 1, … , 4)) of the four levels mentioned in Section 3.3 Neighbor
(HIT) module with the dilated convolution and large stride convolution
Connection Mode. Because the number of channels in each level is
kernel.
different, in order to facilitate the calculation, we first compress the
Because the receptive fields of each branch are different, we con- channel number of input features into 32 by using 1 × 1 convolution
sider that adding bridge connections between different receptive fields for each of the five branches, and the corresponding output is denoted
branches can increase the degree of feature reuse. On the other hand, by 𝑏𝑘 (k=1, . . . , 5). The first four branches employ convolution kernels
it can supplement the lack of information in the large receptive field of different strides and dilate rates (𝑠 = 2𝑘 − 1, 𝑠 represents the strides
relative to the small receptive field. In this way, the characteristics of of convolution kernel, 𝑘 represents the different levels of branches) for
the larger receptive field not only contain the information of current feature extraction. When 𝑘 = 1, the stride of convolution kernel is 1,
branches, but also contain the information of small receptive field and the dilate rate is 1. Output of this convolution was denoted by 𝑏1 .
branches. The structure of the module is shown in Fig. 3. When 𝑘 = 2, 𝑥2 is generated by convolution with the stride size of 3
4
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
Table 1
Quantitative results on benchmark datasets (COD) compared with 14 typical deep learning based models. The best performance is highlighted in red, the second is marked in
green and the third is marked in blue, respectively. ↑ indicates the higher the score the better. ↓ indicates the lower the score the better. ∗ represents that the source code is not
accessible publicly. We reevaluate the camouflage maps (Which comes from https://github.com/DengPingFan/SINet) of the compared model on public evaluation indicators.
Model CAMO CHAMELEON COD10K
S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑
PiCANet (Liu et al., 2018a) 0.609 0.156 0.584 0.529 0.769 0.085 0.749 0.698 0.649 0.090 0.643 0.485
UNet++ (Zhou et al., 2018) 0.599 0.149 0.653 0.501 0.695 0.094 0.762 0.569 0.623 0.086 0.672 0.425
ANet (Le et al., 2019) 0.682 0.126 0.685 0.566 * * * * * * * *
BASNet (Qin et al., 2019) 0.618 0.159 0.661 0.519 0.687 0.118 0.721 0.581 0.634 0.105 0.678 0.451
CPD-R (Wu et al., 2019a) 0.726 0.115 0.729 0.663 0.853 0.052 0.866 0.800 0.747 0.059 0.770 0.631
HTC (Chen et al., 2019b) 0.476 0.172 0.442 0.218 0.517 0.129 0.489 0.236 0.548 0.088 0.520 0.253
MSRCNN (Huang et al., 2019) 0.617 0.133 0.669 0.528 0.637 0.091 0.686 0.506 0.641 0.073 0.706 0.479
PFANet (Zhao and Wu, 2019) 0.659 0.172 0.622 0.564 0.679 0.144 0.648 0.552 0.636 0.128 0.618 0.466
PoolNet (Liu et al., 2019) 0.702 0.129 0.698 0.629 0.776 0.081 0.779 0.706 0.705 0.074 0.713 0.870
EGNet (Zhao et al., 2019) 0.732 0.104 0.768 0.680 0.848 0.050 0.870 0.795 0.737 0.056 0.779 0.613
SCRN (Wu et al., 2019b) 0.744 0.105 0.742 0.703 0.864 0.047 0.872 0.823 0.773 0.051 0.783 0.674
SINet (Fan et al., 2020a) 0.751 0.100 0.771 0.706 0.869 0.044 0.891 0.832 0.771 0.051 0.806 0.676
MirrorNet (ResNet-50) 0.741 0.100 * * * * * * * * * *
(Yan et al., 2021)
SINet-V2(ResNet-50) 0.725 0.103 0.754 0.718 0.770 0.059 0.871 0.808 0.719 0.053 0.766 0.620
(Fan et al., 2021)
Ours 0.780 0.088 0.803 0.730 0.874 0.041 0.891 0.834 0.790 0.046 0.817 0.695
(the input of the convolution is 𝑏2 ). Based on the above analysis, we where 𝛼𝑚 (𝑚 = 1, … , 4) represents the different stage weights mentioned
combine 𝑏1 and 𝑥2 in the way of element addition, to highlight the above.
target position and suppress the interference feature, followed by the
sigmoid function. In this way, we can get soft attention weights. Next, 4. Experiments and results
we fuse the weights with 𝑥2 in the way of element multiplication, and
take the original 𝑥2 as the bias to obtain the weighted feature map. The In this section, we describe the benchmark datasets in COD, evalu-
process can be expressed as follows: ation metrics and the implementation details of our proposed network.
In addition, we compare our propose model with other typical deep
𝐹2𝑤 = 𝑚𝑢𝑙(𝑥2 , 𝜎(𝑥2 + 𝑏1 )) + 𝑥2 (2)
learning based models.
𝐹2𝑤 represents the weighted feature map of the corresponding level.
𝜎 represents the sigmoid function. Next, 𝐹2𝑤 is further processed by 4.1. Datasets
convolution unit with kernel size of 3 × 3 and dilate rate of 2𝑘 − 1
to generate the final camouflage map of this branch (𝑘 = 2), which is CAMO (Le et al., 2019) is a subdataset of CAMO-COCO (Le et al.,
denoted by 𝑥𝑜𝑢𝑡 . When 𝑘 = 1, 𝑥𝑜𝑢𝑡 = 𝑏1 . 2019) proposed by Le et al. in 2018 for camouflaged objects seg-
𝑘 𝑘
When 𝑘 = 3 and 4, the process are similar to 𝑘 = 2. The details can mentation. CAMO contains 1250 images (1000 for training, 250 for
be expressed as follows: testing), where each image contains at least one camouflaged object.
Furthermore, it involves varieties of challenging scenarios, such as
𝑥𝑘 = 𝐶𝑜𝑛𝑣(𝑏𝑘 , 𝑠 = 2𝑘 − 1) (3) shape complexity, background clutter, object appearance, and so on.
𝐹𝑘𝑤 = 𝑚𝑢𝑙(𝑥𝑘 , 𝜎(𝑥𝑘 + 𝑥𝑜𝑢𝑡
𝑘−1
)) + 𝑥𝑘 (4) CHAMELEON (Shen et al., 2018) dataset contains 76 images and
all of them are natural camouflage where animals are hidden in the
𝑥𝑜𝑢𝑡 𝑤
𝑘 = 𝐶𝑜𝑛𝑣(𝐹𝑘 , 𝑠 = 3, 𝑑 = 2𝑘 − 1) (5)
natural environment by their skin color or body shape. It is assiduous
where 𝑑 represents dilate rates. Next, 𝑥𝑜𝑢𝑡
are fused in the channel
𝑘
to find these camouflaged animals out from natural camouflage images.
dimension using concatenation operation and followed a convolution As a result, these images are suitable for camouflaged object detection.
with the kernel size is 1 ∗ 1 to reduce dimension. Finally, it is fused with COD10K (Fan et al., 2020a) is the lastest dataset proposed by
the fifth branch by element addition to generating camouflage map, and Fan et al. for camouflaged object detection, which contains 10000
the output denoted by 𝐹𝑗+1 𝑐 . images (6000 for training, 4000 for testing). It consists of 10 super-
classes, and 78 subclasses which are gathered from several photography
3.4. Loss function websites. Besides, COD10K contains two kinds of camouflage images,
natural camouflage, and artificial camouflage. Natural camouflage im-
BCE is a common loss function used to optimize the model, which ages involve animal camouflage hidden in the natural environment,
forces the network detection results to be more accurate. BCE can be while artificial camouflage images include many occasions, such as
calculated by the following formula: operations, games, and so on.
1 ∑ [
𝑁
( ) ( ) ( )] 4.2. Evaluation metrics
𝐿𝑜𝑠𝑠 = − 𝑦𝑞 𝑙𝑜𝑔 𝑝𝑞 + 1 − 𝑦𝑞 𝑙𝑜𝑔 1 − 𝑝𝑞 (6)
𝑁 𝑚=1
where 𝑁 represents the number of all pixels in the image, 𝑦𝑞 denotes F-Measure (Arbelaez et al., 2011) is the weighted harmonic average
the ground truth of pixel 𝑞, and 𝑝𝑞 represents the probability that pixel of Precision and Recall. It is widely used in the field of IR (Information
𝑞 belongs to the camouflaged objects. 𝑁 can be obtained by multiplying Retrieval) as an evaluation criterion and is mainly used to evaluate
the width and height of the image. the quality of classification models. F-measure is calculated by the
In this paper, we use BCE for multistage supervision. The closer following formula:
( 2 )
to the output of the network, the greater the weight is given to it 𝛽 + 1 𝑃𝑅
(1, 0.8, 0.6, 0.4, respectively). Therefore, the final loss function can be 𝐹𝛽 = (8)
𝛽2𝑃 + 𝑅
calculated by the following formula:
∑ where 𝛽 is a hyperparameter to trade off precision and recall. According
𝐿𝑎𝑙𝑙 = 𝛼𝑚 𝐿𝑜𝑠𝑠 (7) to Zhang et al. (2016), we set 𝛽 2 = 0.3. 𝑃 and 𝑅 represent precision
5
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
Table 2
Comparison of our model and others on three largest subclasses of the COD10K dataset in terms of four public evaluation indicators. The best performence are marked in red.
Model COD10K-Aquatic 474 images COD10K-Terrestria 699 images COD10K-Flying 714 images
S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑
PiCANet (Liu et al., 2018a) 0.629 0.120 0.623 0.509 0.625 0.084 0.628 0.424 0.677 0.076 0.663 0.517
UNet++ (Zhou et al., 2018) 0.599 0.121 0.659 0.437 0.593 0.081 0.637 0.366 0.659 0.068 0.708 0.467
BASNet (Qin et al., 2019) 0.620 0.134 0.666 0.465 0.601 0.109 0.645 0.382 0.664 0.086 0.710 0.492
CPD-R (Wu et al., 2019a) 0.739 0.082 0.770 0.652 0.714 0.058 0.735 0.565 0.777 0.046 0.796 0.672
HTC (Chen et al., 2019b) 0.507 0.129 0.494 0.223 0.530 0.078 0.484 0.196 0.582 0.070 0.558 0.309
MSRCNN (Huang et al., 2019) 0.614 0.107 0.685 0.465 0.611 0.070 0.671 0.418 0.674 0.058 0.742 0.523
PFANet (Zhao and Wu, 2019) 0.629 0.162 0.614 0.490 0.609 0.123 0.600 0.404 0.657 0.113 0.632 0.496
PoolNet (Liu et al., 2019) 0.689 0.102 0.705 0.572 0.677 0.070 0.688 0.509 0.733 0.062 0.733 0.614
EGNet (Zhao et al., 2019) 0.725 0.080 0.775 0.630 0.704 0.054 0.748 0.550 0.768 0.044 0.803 0.651
SCRN (Wu et al., 2019b) 0.767 0.071 0.787 0.695 0.746 0.051 0.756 0.629 0.803 0.040 0.812 0.714
SINet (Fan et al., 2020a) 0.758 0.073 0.803 0.685 0.743 0.050 0.778 0.623 0.798 0.040 0.828 0.709
SINetL-V2(ResNet-50) 0.617 0.148 0.627 0.456 0.770 0.059 0.871 0.808 0.719 0.053 0.766 0.620
(Fan et al., 2021)
Ours 0.786 0.062 0.823 0.725 0.763 0.045 0.785 0.643 0.818 0.034 0.837 0.735
Fig. 4. Visual comparison of COD detection results with 12 models on the three public datasets. The first column represents the input image. The second column is the Ground
Truth (GT). The results of our model are shown in the third column. As shown in the above figure, our results are closest to GT. The completely black detection result in the
figure indicating that the model does not detect any camouflaged objects. Since ANet and MirroeNet did not provide predicted maps on responding datasets or the complete code
based on PyTorch, we cannot reoccur these models and exhibit their camouflaged maps.
and recall, respectively. In this paper, we use 𝐹 𝑚𝑎𝑥 to denote the max According to Fan et al. (2017), we set 𝛼 as 0.5. 𝑆𝑟 is the region aware
F-measure. structural similarity, which captures object part structure information
Mean absolute error (MAE) is designed to directly calculate the and 𝑆𝑜 denotes the object aware structural similarity.
average absolute difference between the feature map output by the S-measure is mainly used to measure structural similarity in non-
model and the groundtruth. MAE is computed as: binary foreground maps, while E-measure (Fan et al., 2018b) is mainly
used to measure image-level statistics and local pixel matching infor-
1 ∑𝐻 ∑ 𝑊
𝑀𝐴𝐸 = |𝑆(𝑥, 𝑦) − 𝐺(𝑥, 𝑦)| (9) mation. The E-measure is defined as:
𝐻 × 𝑊 𝑥=1 𝑦=1
1 ∑𝑊 ∑ 𝐻
𝐸= 𝜙 (𝑥, 𝑦) (11)
where 𝑊 and 𝐻 represent the width and height of the feature map, 𝑊 × 𝐻 𝑥=1 𝑦=1 𝐹 𝑀
respectively. 𝑥 and 𝑦 denote the coordinate of each pixel. 𝑆 refers to
the feature map and 𝐺 denotes the groundtruth. where 𝐻 and 𝑊 are the height and the width of the predicted map,
S-Measure (Fan et al., 2017) is the structural similarity measure, respectively. 𝜙 is the enhanced alignment matrix. In this paper, we
which was proposed by Fan et al. to evaluate non-binary foreground use 𝐸𝑚𝑒𝑎𝑛 to represent mean E-Measure. PR curve is to obtain the
maps. In this paper, we employ S-measure to assess the similarity relationship between the precision and recall, so we employ the PR
between the feature map and the groundtruth map. The S-measure is curve to measure the performance. The camouflage map is segmented
computed as: by a set of fixed threshold that ranges from 0 to 255. The PR curve can
be obtained by calculating the recall rate and precision score of each
𝑆 = 𝛼𝑆𝑜 + (1 − 𝛼) 𝑆𝑟 (10) threshold within the range.
6
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
7
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
Table 3
Ablation study of our proposed model on different architectures.
Backbone Bi-branch NCM HIT CHAMELEON CAMO COD10K
S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑ S↑ MAE↓ Emean↑ Fmax↑
√
0.526 0.272 0.432 0.335 0.511 0.291 0.428 0.356 0.529 0.219 0.457 0.294
√ √
0.799 0.072 0.772 0.744 0.690 0.128 0.669 0.630 0.742 0.062 0.735 0.632
√ √ √
0.858 0.048 0.860 0.811 0.759 0.098 0.770 0.715 0.776 0.049 0.793 0.680
√ √ √ √
0.874 0.041 0.891 0.834 0.780 0.088 0.803 0.730 0.790 0.046 0.817 0.675
References
Arbelaez, P., Maire, M., Fowlkes, C.C., Malik, J., 2011. Contour detection and
hierarchical image segmentation. 33, (5), pp. 898–916.
Bhajantri, N.U., Nagabhushan, P., 2007. Camouflage defect identification: A novel
approach. In: International Conference on Information Technology (ICIT). pp.
145–148.
Borji, A., Cheng, M.M., Hou, Q., Jiang, H., Li, J., 2019. Salient object detection: A
survey. 5, (2), pp. 117–150.
Borji, A., Itti, L., 2012. Exploiting local and global patch rarities for saliency detection.
pp. 478–485.
Chen, Y., Han, C., Wang, N., Zhang, Z., 2019a. Revisiting feature alignment for
one-stage object detection. arXiv preprint arXiv:1908.01570.
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J.,
Ouyang, W., et al., 2019b. Hybrid task cascade for instance segmentation. pp.
4974–4983.
Chen, Z., Xu, Q., Cong, R., Huang, Q., 2020. Global context-aware progressive
aggregation network for salient object detection. 34, (07), pp. 10599–10606.
Fig. 7. Some examples of failure cases. Because SINet achieved the best performance, Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M., 2014. Global contrast based
we only list the comparison example with SINet. salient region detection. 37, (3), pp. 569–582.
Chu, H.K., Hsu, W.H., Mitra, N.J., Cohen-Or, D., Lee, T.Y., 2010. Camouflage images.
ACM Trans. Graph. (ACM) 29 (4).
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y., 2017. Deformable
4.6. Failure cases convolutional networks. pp. 764–773.
Fan, D.P., Cheng, M.M., Liu, J.J., Gao, S.H., Hou, Q., Borji, A., 2018a. Salient objects
Although our model has achieved remarkable performance on the in clutter: Bringing salient object detection to the foreground. pp. 186–202.
benchmark datasets, there are also some failure cases. As shown in Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A., 2017. Structure-measure: A new way
to evaluate foreground maps. pp. 4548–4557.
Fig. 7, our model considers some objects in the background that are
Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A., 2018b. Enhanced-alignment
close to or highly similar to camouflaged objects as an integral part of measure for binary foreground map evaluation. pp. 698–704.
camouflaged objects. This may because we only consider the general Fan, D.P., Ji, G.P., Cheng, M.M., Shao, L., 2021. Concealed object detection. IEEE Trans.
complex scenarios and ignore the extremely complex ones. In future Pattern Anal. Mach. Intell..
work, we will consider all scenarios and propose a more applicable Fan, D.P., Ji, G.P., Sun, G., Cheng, M.M., Shen, J., Shao, L., 2020a. Camouflaged object
detection. pp. 2777–2787.
model.
Fan, D.P., Ji, G.P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L., 2020b. PraNet:
Parallel reverse attention network for polyp segmentation. Med. Image Comput.
5. Conclusion Comput.-Assisted Interv. (MICCAI) 263–273.
Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao, L., 2020c.
In this paper, we propose an end-to-end network with a double Inf-net: Automatic COVID-19 lung infection segmentation from ct images. 39, (8),
branch for COD. Our network consists of two main components, NCM pp. 2626–2637.
Galun, M., Sharon, E., Basri, R., Brandt, A., 2003. Texture segmentation by multiscale
and HIT. NCM simulates the process of visual exploration and im- aggregation of filter responses and shape elements. pp. 716–723.
proves feature reuse, which is beneficial to highlight the boundary Ge, S., Jin, X., Ye, Q., Luo, Z., Li, Q., 2018. Image editing by object-aware optimal
and position information of detected objects. As for HIT, large scale boundary searching and mixed-domain composition. 4, (1), pp. 71–82.
and different dilated rates convolution are used for feature extraction, He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.
and then the generated features are further refined by connecting pp. 770–778.
Hou, Q., Cheng, M., Hu, X., Borji, A., Tu, Z., Torr, P.H.S., 2017. Deeply supervised
with different levels. We combine the two components to improve the salient object detection with short connections. pp. 5300–5309.
detection accuracy of COD tasks. We employ four metrics to compare Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X., 2019. Mask scoring R-CNN. pp.
the proposed scheme with 12 typical deep learning based models on 6409–6418.
three public benchmark datasets, the results have shown that our model Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q., 2017. Densely connected
achieves the best performance. In addition, we have carried out a series convolutional networks. pp. 4700–4708.
Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual attention for rapid
of ablation experiments, and the experimental results also proved the scene analysis. 20, (11), pp. 1254–1259.
effectiveness of our proposed model. Klein, D.A., Frintrop, S., 2011. Center-surround divergence of feature statistics for
salient object detection. pp. 2214–2219.
CRediT authorship contribution statement Law, H., Deng, J., 2018. Cornernet: Detecting objects as paired keypoints. pp. 734–750.
Le, T., Nguyen, T.V., Nie, Z., Tran, M., Sugimoto, A., 2019. Anabranch network for
camouflaged object segmentation. Comput. Vis. Image Underst. (CVIU) 184, 45–56.
Cong Zhang: Conceptualization, Methodology, Writing. Kang Wang:
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017. Feature
Software, Visualization. Hongbo Bi: Supervision. Ziqi Liu: Data cura- pyramid networks for object detection. pp. 2117–2125.
tion, Writing – original draft. Lina Yang: Reviewing and editing. Liu, N., Han, J., Yang, M., 2018a. PiCANet: Learning pixel-wise contextual attention
for saliency detection. pp. 3089–3098.
Declaration of competing interest Liu, J., Hou, Q., Cheng, M., Feng, J., Jiang, J., 2019. A simple pooling-based design
for real-time salient object detection. pp. 3917–3926.
Liu, S., Huang, D., Wang, Y., 2018b. Receptive field block net for accurate and fast
The authors declare that they have no known competing finan- object detection. pp. 404–419.
cial interests or personal relationships that could have appeared to Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.-Y., 2010. Learning
influence the work reported in this paper. to detect a salient object. 33, (2), pp. 353–367.
8
C. Zhang, K. Wang, H. Bi et al. Computer Vision and Image Understanding 221 (2022) 103450
Pang, Y., Zhao, X., Zhang, L., Lu, H., 2020. Multi-scale interactive network for salient Wu, Z., Su, L., Huang, Q., 2019a. Cascaded partial decoder for fast and accurate salient
object detection. object detection. pp. 3907–3916.
Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A., 2012. Saliency filters: Contrast Wu, Z., Su, L., Huang, Q., 2019b. Stacked cross refinement network for edge-aware
based filtering for salient region detection. pp. 733–740. salient object detection. pp. 7264–7273.
Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M., 2019. BASNet: Xie, Y., Lu, H., Yang, M.H., 2012. BayesIan saliency via low and mid level cues. 22,
Boundary-aware salient object detection. pp. 7479–7489. (5), pp. 1689–1698.
Rao, C.P., Reddy, A.G., Rao, C., 2020. Camouflaged object detection for machine vision Yan, J., Le, T.N., Nguyen, K.D., Tran, M.T., Do, T.T., Nguyen, T.V., 2021. Mirrornet:
applications. Int. J. Speech Technol. 23 (11). Bio-inspired camouflaged object segmentation. IEEE Access 9, 43290–43300.
Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards real-time object Yu, F., Koltun, V., 2016. Multi-scale context aggregation by dilated convolutions.
detection with region proposal networks. pp. 91–99. Zhang, D., Han, J., Li, C., Wang, J., Li, X., 2016. Detection of co-salient objects by
Shen, Y., Ji, R., Zhang, S., Zuo, W., Yan, W., 2018. Generative adversarial learning looking deep and wide. 120, (2), pp. 215–232.
towards fast weakly supervised detection. pp. 5764–5773.
Zhang, P., Liu, W., Lu, H., Shen, C., 2019. Salient object detection with lossless feature
Song, L., Geng, W., 2010. A new camouflage texture evaluation method based on
reflection and weighted structural loss. 28 (6), 3048–3060.
WSSIM and nature image features. pp. 1–4.
Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X., 2017. Amulet: Aggregating multi-level
Tankus, A., Yeshurun, Y., 2001. Convexity-based visual camouflage breaking. Comput.
convolutional features for salient object detection. pp. 202–211.
Vis. Image Underst. (CVIU) 82 (3), 208–237.
Zhao, J., Liu, J.-J., Fan, D.P., Cao, Y., Yang, J., Cheng, M.M., 2019. EGNet: Edge
Tian, Z., Shen, C., Chen, H., He, T., 2019. Fcos: Fully convolutional one-stage object
guidance network for salient object detection. pp. 8779–8788.
detection. pp. 9627–9636.
Zhao, T., Wu, X., 2019. Pyramid feature attention network for saliency detection. pp.
Wang, W., Shen, J., 2018. Deep visual attention prediction. 27, (5), pp. 2368–2378.
Wang, W., Shen, J., Cheng, M., Shao, L., 2019a. An iterative and cooperative top-down 3085–3094.
and bottom-up inference network for salient object detection. pp. 5968–5977. Zheng, Y., Zhang, X., Wang, F., Cao, T., Sun, M., Wang, X., 2019. Detection of people
Wang, W., Shen, J., Dong, X., Borji, A., 2018. Salient object detection driven by fixation with camouflage pattern via dense deconvolution network. 26, (1), pp. 29–33.
prediction. pp. 1711–1720. Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2018. Unet++: A nested u-net
Wang, W., Zhao, S., Shen, J., Hoi, S.C.H., Borji, A., 2019b. Salient object detection architecture for medical image segmentation. In: Deep Learning in Medical Image
with pyramid attention and salient edges. pp. 1448–1457. Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 3–11.
Wu, Y.H., Gao, S.H., Mei, J., Xu, J., Fan, D.P., Zhao, C.W., Cheng, M.M., 2020. JCS: Zhou, X., Zhuo, J., Krahenbuhl, P., 2019. Bottom-up object detection by grouping
An explainable COVID-19 diagnosis system by joint classification and segmentation. extreme and center points. pp. 850–859.
arXiv preprint arXiv:2004.07054.