Tmi 2019 2959609
Tmi 2019 2959609
Tmi 2019 2959609
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 1
Abstract—The state-of-the-art models for medical image seg- and have proven to be effective in recovering fine-grained
mentation are variants of U-Net and fully convolutional networks details of the target objects [7], [8], [9] even on complex
(FCN). Despite their success, these models have two limitations: background [10], [11]. Skip connections have also played a
(1) their optimal depth is apriori unknown, requiring extensive
architecture search or inefficient ensemble of models of varying key role in the success of instance-level segmentation models
depths; and (2) their skip connections impose an unnecessarily such as [12], [13] where the idea is to segment and distinguish
restrictive fusion scheme, forcing aggregation only at the same- each instance of desired objects.
scale feature maps of the encoder and decoder sub-networks. However, these encoder-decoder architectures for image
To overcome these two limitations, we propose UNet++, a new segmentation come with two limitations. First, the optimal
neural architecture for semantic and instance segmentation, by
(1) alleviating the unknown network depth with an efficient depth of an encoder-decoder network can vary from one
ensemble of U-Nets of varying depths, which partially share application to another, depending on the task difficulty and
an encoder and co-learn simultaneously using deep supervi- the amount of labeled data available for training. A sim-
sion; (2) redesigning skip connections to aggregate features of ple approach would be to train models of varying depths
varying semantic scales at the decoder sub-networks, leading separately and then ensemble the resulting models during
to a highly flexible feature fusion scheme; and (3) devising a
pruning scheme to accelerate the inference speed of UNet++. the inference time [14], [15], [16]. However, this simple
We have evaluated UNet++ using six different medical image approach is inefficient from a deployment perspective, because
segmentation datasets, covering multiple imaging modalities such these networks do not share a common encoder. Furthermore,
as computed tomography (CT), magnetic resonance imaging being trained independently, these networks do not enjoy the
(MRI), and electron microscopy (EM), and demonstrating that (1) benefits of multi-task learning [17], [18]. Second, the design
UNet++ consistently outperforms the baseline models for the task
of semantic segmentation across different datasets and backbone of skip connections used in an encoder-decoder network is
architectures; (2) UNet++ enhances segmentation quality of unnecessarily restrictive, demanding the fusion of the same-
varying-size objects—an improvement over the fixed-depth U- scale encoder and decoder feature maps. While striking as a
Net; (3) Mask RCNN++ (Mask R-CNN with UNet++ design) natural design, the same-scale feature maps from the decoder
outperforms the original Mask R-CNN for the task of instance and encoder networks are semantically dissimilar and no solid
segmentation; and (4) pruned UNet++ models achieve significant
speedup while showing only modest performance degradation. theory guarantees that they are the best match for feature
Our implementation and pre-trained models are available at fusion.
https://github.com/MrGiovanni/UNetPlusPlus. In this paper, we present UNet++, a new general purpose
Index Terms—Neuronal Structure Segmentation, Liver Seg- image segmentation architecture that aims at overcoming the
mentation, Cell Segmentation, Nuclei Segmentation, Brain Tumor above limitations. As presented in Fig. 1(g), UNet++ consists
Segmentation, Lung Nodule Segmentation, Medical Image Seg- of U-Nets of varying depths whose decoders are densely
mentation, Semantic Segmentation, Instance Segmentation, Deep connected at the same resolution via the redesigned skip
Supervision, Model Pruning. pathways. The architectural changes introduced in UNet++
enable the following advantages. First, UNet++ is not prone
I. I NTRODUCTION to the choice of network depth because it embeds U-Nets of
The encoder-decoder networks are widely used in modern varying depths in its architecture. All these U-Nets partially
semantic and instance segmentation models [1], [2], [3], [4], share an encoder, while their decoders are intertwined. By
[5], [6]. Their success is largely attributed to their skip training UNet++ with deep supervision, all the constituent U-
connections, which combine deep, semantic, coarse-grained Nets are trained simultaneously while benefiting from a shared
feature maps from the decoder sub-network with shallow, low- image representation. This design not only improves the over-
level, fine-grained feature maps from the encoder sub-network, all segmentation performance, but also enables model pruning
during the inference time. Second, UNet++ is not handicapped
Z. Zhou, N. Tajbakhsh and J. Liang are with the Department of Biomedical by unnecessarily restrictive skip connections where only the
Informatics, Arizona State University, Scottsdale, AZ 85259 USA. same-scale feature maps from the encoder and decoder can be
(zongweiz@asu.edu; ntajbakh@asu.edu; jianming.liang@asu.edu)
M M. Rahman Siddiquee is with School of Computing, Informatics, and fused. The redesigned skip connections introduced in UNet++
Decision Systems Engineering, Arizona State University, Tempe, AZ 85281 present feature maps of varying scales at a decoder node,
USA. (mrahmans@asu.edu) allowing the aggregation layer to decide how various feature
Copyright (c) 2019 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be maps carried along the skip connections should be fused with
obtained from the IEEE by sending a request to pubs-permissions@ieee.org. the decoder feature maps. The redesigned skip connections
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 2
Fig. 1: Evolution from U-Net to UNet++. Each node in the graph represents a convolution block, downward arrows indicate down-sampling,
upward arrows indicate up-sampling, and dot arrows indicate skip connections. (a–d) U-Nets of varying depths. (e) Ensemble architecture,
U-Nete , which combines U-Nets of varying depths into one unified architecture. All U-Nets (partially) share the same encoder, but have
their own decoders. (f) UNet+ is constructed by connecting the decoders of U-Nete , enabling the deeper decoders to send supervision
signals to the shallower decoders. (g) UNet++ is constructed by adding dense skip connections to UNet+, enabling dense feature propagation
along skip connections and thus more flexible feature fusion at the decoder nodes. As a result, each node in the UNet++ decoders, from
a horizontal perspective, combines multiscale features from its all preceding nodes at the same resolution, and from a vertical perspective,
integrates multiscale features across different resolutions from its preceding node, as formulated at Eq. 1. This multiscale feature aggregation
of UNet++ gradually synthesizes the segmentation, leading to increased accuracy and faster convergence, as evidenced by our empirical
results in Section IV. Note that, explicit deep supervision is required (bold links) to train U-Nete but optional (pale links) for UNet+ and
UNet++.
are realized in UNet++ by densely connecting the decoders leading to much better performance than individually
of the constituents U-Nets at the same resolution. We have training isolated U-Nets of the same architecture (see
extensively evaluated UNet++ across six segmentation datasets Section IV-D and Section V-C).
and multiple backbones of different depths. Our results demon- 5) We demonstrate the extensibility of UNet++ to multiple
strate that UNet++ powered by redesigned skip connections backbone encoders and further its applicability to various
and deep supervision enables a significantly higher level of medical imaging modalities including CT, MRI, and
performance for both semantic and instance segmentation. electron microscopy (see Section IV-A and Section IV-B).
This significant improvement of UNet++ over the classical U-
Net architecture is ascribed to the advantages offered by the II. P ROPOSED N ETWORK A RCHITECTURE : UN ET ++
redesigned skip connections and the extended decoders, which Fig. 1 shows how UNet++ evolves from the original U-
together enable gradual aggregation of the image features Net. In the following, we first trace this evolution, motivating
across the network, both horizontally and vertically. the need for UNet++, and then explain its technical and
In summary, we make the following five contributions: implementation details.
1) We introduce a built-in ensemble of U-Nets of varying
depths in UNet++, enabling improved segmentation per- A. Motivation behind the new architecture
formance for varying size objects—an improvement over We have done a comprehensive ablation study to investi-
the fixed-depth U-Net (see Section II-B). gate the performance of U-Nets of varying depths (Fig. 1(a-
2) We redesign skip connections in UNet++, enabling flexi- d)). For this purpose, we have used three relatively small
ble feature fusion in decoders—an improvement over the datasets, namely Cell, EM, and Brain Tumor (detailed
restrictive skip connections in U-Net that require fusion in Section III-A). Table I summarizes the results. For the cell
of only same-scale feature maps (see Section II-B). and brain tumor segmentation, a shallower network (U-Net L3 )
3) We devise a scheme to prune a trained UNet++, accelerat- outperforms the deep U-Net. For the EM dataset, on the other
ing its inference speed while maintaining its performance hand, the deeper U-Nets consistently outperform the shallower
(see Section IV-C). counterparts, but the performance gain is only marginal. Our
4) We discover that simultaneously training multi-depth U- experimental results suggest two key findings: 1) deeper U-
Nets embedded within the UNet++ architecture stimu- Nets are not necessarily always better, 2) the optimal depth of
lates collaborative learning among the constituent U-Nets, architecture depends on the difficulty and size of the dataset
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 3
Fig. 2: Training UNet++ with deep supervision makes segmentation results available at multiple nodes X0,j , enabling architecture pruning
at inference time. Taking the segmentation result from X0,4 leads to no pruning, UNet++ (L4 ), whereas taking the segmentation result from
X0,1 results in a maximally pruned architecture, UNet++ L1 . Note that nodes removed during pruning are colored in gray.
from two drawbacks. First, the decoders are disconnected— where function H(·) is a convolution operation followed by
deeper U-Nets do not offer a supervision signal to the decoders an activation function, D(·) and U(·) denote a down-sampling
of the shallower U-Nets in the ensemble. Second, the common layer and an up-sampling layer respectively, and [ ] denotes
design of skip connections used in the U-Nete is unnecessarily the concatenation layer. Basically, as shown in Fig. 1(g), nodes
restrictive, requiring the network to combine the decoder at level j = 0 receive only one input from the previous layer
feature maps with only the same-scale feature maps from of the encoder; nodes at level j = 1 receive two inputs, both
the encoder. While striking as a natural design, there is no from the encoder sub-network but at two consecutive levels;
guarantee that the same-scale feature maps are the best match and nodes at level j > 1 receive j + 1 inputs, of which j
for the feature fusion. inputs are the outputs of the previous j nodes in the same skip
To overcome the above limitations, we remove long skip connection and the j +1th input is the up-sampled output from
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 4
the lower skip connection. The reason that all prior feature
maps accumulate and arrive at the current node is because
we make use of a dense convolution block along each skip
connection.
2) Deep supervision: We introduce deep supervision in
UNet++. For this purpose, we append a 1×1 convolution
with C kernels followed by a Sigmoid activation function to
the outputs from nodes X0,1 , X0,2 , X0,3 , and X0,4 where C
is the number of classes observed in the given dataset. We
then define a hybrid segmentation loss consisting of pixel-
wise cross-entropy loss and soft dice-coefficient loss for each
semantic scale. The hybrid loss may take advantages of what
both loss functions have to offer: smooth gradient and handling
of class imbalance [28], [29]. Mathematically, the hybrid loss
is defined as:
C N
!
1 XX 2yn,c pn,c
L(Y, P ) = − yn,c log pn,c + 2 + p2
(2)
N c=1 n=1 yn,c n,c
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 5
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 6
TABLE IV: Semantic segmentation results measured by IoU (mean±s.d. %) for U-Net, wide U-Net, UNet+ (our intermediate proposal),
and UNet++ (our final proposal). Both UNet+ and UNet++ are evaluated with and without deep supervision (DS). We have performed
independent two sample t-test between U-Net [5] vs. others for 20 independent trials and highlighted boxes in red when the differences are
statistically significant (p < 0.05).
2D Application 3D Application
Architecture DS Params Architecture DS Params
EM Cell Nuclei Brain Tumor† Liver Lung Nodule
U-Net [5] 7 7.8M 88.30±0.24 88.73±1.64 90.57±1.26 89.21±1.55 79.90±1.38 V-Net [28] 7 22.6M 71.17±4.53
wide U-Net 7 9.1M 88.37±0.13 88.91±1.43 90.47±1.15 89.35±1.49 80.25±1.31 wide V-Net 7 27.0M 73.12±3.99
UNet+ 7 8.7M 88.39±0.15 90.71±1.25 91.73±1.09 90.70±0.91 79.62±1.20 VNet+ 7 25.3M 75.93±2.93
UNet+ 3 8.7M 88.89±0.12 91.18±1.13 92.04±0.89 91.15±0.65 82.83±0.92 VNet+ 3 25.3M 76.72±2.48
UNet++ 7 9.0M 88.92±0.14 91.03±1.34 92.44±1.20 90.86±0.81 82.51±1.29 VNet++ 7 26.2M 76.24±3.11
UNet++ 3 9.0M 89.33±0.10 91.21±0.98 92.37±0.98 91.21±0.68 82.60±1.11 VNet++ 3 26.2M 77.05±2.42
†
The winner in BraTS-2013 holds a “complete” Dice of 92% vs. 90.83%±2.46% (our UNet++ with deep supervision).
Fig. 4: Comparison between U-Net, UNet+, and UNet++ when applied to the state-of-the-art backbones for the tasks of neuronal
structure, cell, nuclei, brain tumor, and liver segmentation. UNet++, trained with deep supervision, consistently outperforms U-Net across all
backbone architectures and applications under study. By densely connecting the intermediate layers, UNet++ also yields higher segmentation
performance than UNet+ in most experimental configurations. The error bars represent the 95% confidence interval and the number of ∗ on
the bridge indicates the level of significance measured by p-value (“n.s.” stands for “not statistically significant”).
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 7
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 8
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 9
Fig. 8: Visualization and comparison of feature maps from early, intermediate, and late layers along the top most skip connection for brain
tumor images. Here, the dot arrows denote plain skip connection in U-Net and UNet+, while the dash arrows denote dense connections
introduced in UNet++.
GridNet contains multiple streams with different resolutions, introduce a deep supervision by combining predictions from
it lacks up-sampling layers between skip connections; and varying resolutions of feature maps, suggesting that it can
thus, it does not represent UNet++. Full-resolution residual combat potential optimization difficulties and thus reach faster
networks (FRRN) [46] employs a two-stream system, where convergence rate and more powerful discrimination capabil-
full-resolution information is carried in one stream and con- ity. Zhu et al. [50] used eight additional deeply supervised
text information in the other pooling stream. In [47], two layers in their proposed architecture. Our nested networks are
improved versions of FRRN are proposed, i.e., incremental however more amenable to training under deep supervision:
MRRN with 28.6M parameters and dense MRRN with 25.5M 1) multiple decoders automatically generate full resolution
parameters. These 2D architectures however have similar segmentation maps; 2) the networks are embedded various
number of parameters to our 3D VNet++ and three times different depths of U-Net so that it grasps multiple-resolution
more parameters than 2D UNet++; and thus, simply upgrading features; 3) densely connected feature maps help smooth the
these architectures to a 3D manner may not be amenable to gradient flow and give relatively consistent predicting mask;
the common 3D volumetric medical imaging applications. We 4) the high dimension features have effects on every outputs
would like to note that our redesigned dense skip connections through back-propagation, allowing us to prune the network
are completely different from those used in MRRN, which in the inference phase.
consists of a common residual stream. Also, it’s not flexible
to apply the design of MRRN to other backbone encoders and D. Our previous work
meta framework such as Mask R-CNN [12]. DLA2 [48], topo- We first presented UNet++ in our DLMIA 2018 paper [51].
logically equivalent to our intermediate architecture UNet+ UNet++ has since been quickly adopted by the research com-
(Fig. 1(f)), sequentially connects the same resolution of feature munity, either as a strong baseline for comparison [52], [53],
maps, without long skip connections as used in U-Net. Our [54], [55], or as a source of inspiration for developing newer
experimental results demonstrate that by densely connecting semantic segmentation architectures [56], [57], [58], [59],
the layers, UNet++ achieves higher segmentation performance [60], [61]; it has also been utilized for multiple applications,
than UNet+/DLA (see Table IV). such as segmenting objects in biomedical images [62], [63],
natural images [64], and satellite images [65], [66]. Recently,
C. Deep supervision Shenoy [67] has independently and systematically investigated
UNet++ for the task of “contact prediction model PconsC4”,
He et al. [8] suggested that the depth d of network can
demonstrating significant improvement over widely-used U-
act as a regularizer. Lee et al. [27] demonstrated that deeply
Net.
supervised layers can improve the learning ability of the
Nevertheless, to further strengthen UNet++ on our own,
hidden layer, enforcing the intermediate layers to learn dis-
the current work presents several extensions to our previous
criminative features, enabling fast convergence and regular-
work: (1) we present a comprehensive study on network
ization of the network [26]. DenseNet [9] performs a similar
depth, motivating the need for the proposed architecture (Sec-
deep supervision in an implicit fashion. Deep supervision can
tion II-A); (2) we compare the embedded training schemes
be used in U-Net like architecture as well. Dou et al. [49]
with the isolated ones at various levels of pruned UNet++,
2 Deep Layer Aggregation—a simultaneous but independent work published and discover that training embedded U-Nets of multi-depths
in CVPR-2018 [48]. leads to improved performance than individually training them
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 10
in isolation (Section IV-D); (3) we strengthen our experiments [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
by including a new magnetic resonance imaging (MRI) dataset recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 770–778.
for brain tumor segmentation (Section IV); (4) we demonstrate [9] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
the effectiveness of UNet++ in Mask R-CNN, resulting in a connected convolutional networks,” in Proceedings of the IEEE Confer-
more powerful model namely Mask RCNN++ (Section IV-B); ence on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017,
p. 3.
(5) we investigate the extensibility of UNet++ to multiple [10] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns
advanced encoder backbones for semantic segmentation (Sec- for object segmentation and fine-grained localization,” in Proceedings
tion IV-A); (6) we study the effectiveness of UNet++ in of the IEEE Conference on Computer Vision and Pattern Recognition,
2015, pp. 447–456.
segmenting lesions of varying sizes (Section V-A); and (7) [11] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
we visualize the feature propagation along the resigned skip “Feature pyramid networks for object detection,” in Proceedings of the
connection to explain the performance (Section V-B). IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,
no. 2, 2017, p. 4.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
VII. C ONCLUSION Proceedings of the IEEE International Conference on Computer Vision.
IEEE, 2017, pp. 2980–2988.
We have presented a novel architecture, named UNet++, for [13] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick, “Learning to
more accurate image segmentation. The improved performance segment every thing,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 4233–4241.
by our UNet++ is attributed to its nested structure and re- [14] T. G. Dietterich, “Ensemble methods in machine learning,” in Interna-
designed skip connections, which aim to address two key tional workshop on multiple classifier systems. Springer, 2000, pp.
challenges of the U-Net: 1) unknown depth of the optimal 1–15.
[15] S. Hoo-Chang, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,
architecture and 2) the unnecessarily restrictive design of skip D. Mollura, and R. M. Summers, “Deep convolutional neural networks
connections. We have evaluated UNet++ using six distinct for computer-aided detection: Cnn architectures, dataset characteristics
biomedical imaging applications and demonstrated consistent and transfer learning,” IEEE transactions on medical imaging, vol. 35,
no. 5, p. 1285, 2016.
performance improvement over various state-of-the-art back- [16] F. Ciompi, B. de Hoop, S. J. van Riel, K. Chung, E. T. Scholten,
bones for semantic segmentation and meta framework for M. Oudkerk, P. A. de Jong, M. Prokop, and B. van Ginneken, “Au-
instance segmentation. tomatic classification of pulmonary peri-fissural nodules in computed
tomography using an ensemble of 2d views and a convolutional neural
network out-of-the-box,” Medical image analysis, vol. 26, no. 1, pp.
ACKNOWLEDGMENTS 195–202, 2015.
[17] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and
This research has been supported partially by ASU and trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
Mayo Clinic through a Seed Grant and an Innovation Grant, [18] Y. Zhang and Q. Yang, “A survey on multi-task learning,” arXiv preprint
arXiv:1707.08114, 2017.
and partially by NIH under Award Number R01HL128785. [19] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,
The content is solely the responsibility of the authors and does A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture
not necessarily represent the official views of NIH. We thank search,” in Proceedings of the European Conference on Computer Vision,
2018, pp. 19–34.
Mohammad Reza Hosseinzadeh Taher and Fatemeh Haghighi [20] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
for their verification of liver segmentation performance and architectures for scalable image recognition,” in Proceedings of the IEEE
the ablation study of embedded and isolated UNet++. We also Conference on Computer Vision and Pattern Recognition, 2018, pp.
8697–8710.
thank Michael G. Meyer for allowing us to test our ideas on [21] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and
the Cell-CT dataset. The content of this paper is covered by L. Fei-Fei, “Auto-deeplab: Hierarchical neural architecture search for
US patents pending. semantic image segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 82–92.
[22] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable
R EFERENCES architecture search for semantic segmentation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2019,
[1] S. K. Zhou, H. Greenspan, and D. Shen, Deep learning for medical pp. 11 641–11 650.
image analysis. Academic Press, 2017. [23] X. Li, Y. Zhou, Z. Pan, and J. Feng, “Partial order pruning: for best
[2] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image speed/accuracy trade-off in neural architecture search,” in Proceedings
analysis,” Annual review of biomedical engineering, vol. 19, pp. 221– of the IEEE Conference on Computer Vision and Pattern Recognition,
248, 2017. 2019, pp. 9145–9153.
[3] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, [24] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, of the IEEE International Conference on Computer Vision, 2015, pp.
“A survey on deep learning in medical image analysis,” Medical image 1395–1403.
analysis, vol. 42, pp. 60–88, 2017. [25] H. Chen, X. J. Qi, J. Z. Cheng, and P. A. Heng, “Deep contextual net-
[4] G. Chartrand, P. M. Cheng, E. Vorontsov, M. Drozdzal, S. Turcotte, C. J. works for neuronal structure segmentation,” in Thirtieth AAAI conference
Pal, S. Kadoury, and A. Tang, “Deep learning: a primer for radiologists,” on artificial intelligence, 2016.
Radiographics, vol. 37, no. 7, pp. 2113–2131, 2017. [26] Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P.-A. Heng, “3d
[5] T. Falk, D. Mai, R. Bensch, Ö. Çiçek, A. Abdulkadir, Y. Marrakchi, deeply supervised network for automated segmentation of volumetric
A. Böhm, J. Deubner, Z. Jäckel, K. Seiwald et al., “U-net: deep learning medical images,” Medical image analysis, vol. 41, pp. 40–54, 2017.
for cell counting, detection, and morphometry,” Nature methods, p. 1, [27] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised
2018. nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.
[6] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. Chiang, Z. Wu, and X. Ding, [28] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional
“Embracing imperfect datasets: A review of deep learning solutions for neural networks for volumetric medical image segmentation,” in 2016
medical image segmentation,” arXiv preprint arXiv:1908.10454, 2019. Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp.
[7] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The 565–571.
importance of skip connections in biomedical image segmentation,” in [29] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso,
Deep Learning and Data Labeling for Medical Applications. Springer, “Generalised dice overlap as a deep learning loss function for highly
2016, pp. 179–187. unbalanced segmentations,” in Deep Learning in Medical Image Analysis
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 11
and Multimodal Learning for Clinical Decision Support. Springer, [50] Q. Zhu, B. Du, B. Turkbey, P. L. Choyke, and P. Yan, “Deeply-supervised
2017, pp. 240–248. cnn for prostate segmentation,” in International Joint Conference on
[30] A. Cardona, S. Saalfeld, S. Preibisch, B. Schmid, A. Cheng, J. Pulokas, Neural Networks (IJCNN). IEEE, 2017, pp. 178–184.
P. Tomancak, and V. Hartenstein, “An integrated micro-and macroar- [51] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++:
chitectural analysis of the drosophila brain by computer-assisted serial A nested u-net architecture for medical image segmentation,” in Deep
section electron microscopy,” PLoS biology, vol. 8, no. 10, p. e1000502, Learning in Medical Image Analysis and Multimodal Learning for
2010. Clinical Decision Support. Springer, 2018, pp. 3–11.
[31] M. G. Meyer, J. W. Hayenga, T. Neumann, R. Katdare, C. Presley, [52] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang,
D. E. Steinhauer, T. M. Bell, C. A. Lancaster, and A. C. Nelson, W. Liu, and J. Wang, “High-resolution representations for labeling pixels
“The cell-ct 3-dimensional cell imaging technology platform enables and regions,” CoRR, vol. abs/1904.04514, 2019.
the detection of lung cancer using the noninvasive luced sputum test,” [53] Y. Fang, C. Chen, Y. Yuan, and K.-y. Tong, “Selective feature aggrega-
Cancer cytopathology, vol. 123, no. 9, pp. 512–523, 2015. tion network with area-boundary constraints for polyp segmentation,” in
[32] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, International Conference on Medical Image Computing and Computer-
J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The Assisted Intervention. Springer, 2019, pp. 302–310.
multimodal brain tumor image segmentation benchmark (brats),” IEEE [54] J. Fang, Y. Zhang, K. Xie, S. Yuan, and Q. Chen, “An improved
transactions on medical imaging, vol. 34, no. 10, p. 1993, 2015. mpb-cnn segmentation method for edema area and neurosensory retinal
[33] S. G. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. detachment in sd-oct images,” in International Workshop on Ophthalmic
Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Medical Image Analysis. Springer, 2019, pp. 130–138.
Hoffman et al., “The lung image database consortium (lidc) and image [55] C. Meng, K. Sun, S. Guan, Q. Wang, R. Zong, and L. Liu, “Multiscale
database resource initiative (idri): a completed reference database of dense convolutional neural network for dsa cerebrovascular segmenta-
lung nodules on ct scans,” Medical physics, vol. 38, no. 2, pp. 915–931, tion,” Neurocomputing, vol. 373, pp. 123–134, 2020.
2011. [56] J. Zhang, Y. Jin, J. Xu, X. Xu, and Y. Zhang, “Mdu-net: Multi-scale
densely connected u-net for biomedical image segmentation,” arXiv
[34] M. Kistler, S. Bonaretti, M. Pfahrer, R. Niklaus, and P. Büchler, “The
preprint arXiv:1812.00352, 2018.
virtual skeleton database: an open access repository for biomedical re-
[57] F. Chen, Y. Ding, Z. Wu, D. Wu, and J. Wen, “An improved framework
search and collaboration,” Journal of medical Internet research, vol. 15,
called du++ applied to brain tumor segmentation,” in 2018 15th Interna-
no. 11, p. e245, 2013.
tional Computer Conference on Wavelet Active Media Technology and
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- Information Processing (ICCWAMTIP). IEEE, 2018, pp. 85–88.
works for biomedical image segmentation,” in International Conference [58] C. Zhou, S. Chen, C. Ding, and D. Tao, “Learning contextual and
on Medical Image Computing and Computer-Assisted Intervention. attentive information for brain tumor segmentation,” in International
Springer, 2015, pp. 234–241. MICCAI Brainlesion Workshop. Springer, 2018, pp. 497–507.
[36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [59] S. Wu, Z. Wang, C. Liu, C. Zhu, S. Wu, and K. Xiao, “Automatical
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. segmentation of pelvic organs after hysterectomy by using dilated
[37] G. Song and W. Chai, “Collaborative learning for deep neural networks,” convolution u-net++,” in 2019 IEEE 19th International Conference on
in Neural Information Processing Systems (NeurIPS), 2018. Software Quality, Reliability and Security Companion (QRS-C). IEEE,
[38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural 2019, pp. 362–367.
network,” arXiv preprint arXiv:1503.02531, 2015. [60] T. Song, F. Meng, A. Rodrı́guez-Patón, P. Li, P. Zheng, and X. Wang,
[39] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks “U-next: A novel convolution neural network with an aggregation u-
for semantic segmentation,” in Proceedings of the IEEE Conference on net architecture for gallstone segmentation in ct images,” IEEE Access,
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. vol. 7, pp. 166 823–166 832, 2019.
[40] A. Chaurasia and E. Culurciello, “Linknet: Exploiting encoder repre- [61] C. Yang and F. Gao, “Eda-net: Dense aggregation of deep and shal-
sentations for efficient semantic segmentation,” in 2017 IEEE Visual low information achieves quantitative photoacoustic blood oxygenation
Communications and Image Processing (VCIP). IEEE, 2017, pp. 1–4. imaging deep in human breast,” in International Conference on Medical
[41] G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path Image Computing and Computer-Assisted Intervention. Springer, 2019,
refinement networks for high-resolution semantic segmentation.” in pp. 246–254.
Proceedings of the IEEE Conference on Computer Vision and Pattern [62] V. Zyuzin and T. Chumarnaya, “Comparison of unet architectures for
Recognition, vol. 1, no. 2, 2017, p. 5. segmentation of the left ventricle endocardial border on two-dimensional
[42] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time ultrasound images,” in 2019 Ural Symposium on Biomedical Engineer-
semantic segmentation on high-resolution images,” in Proceedings of ing, Radioelectronics and Information Technology (USBEREIT). IEEE,
the European Conference on Computer Vision, 2018, pp. 405–420. 2019, pp. 110–113.
[43] N. Tajbakhsh, B. Lai, S. Ananth, and X. Ding, “Errornet: Learning error [63] H. Cui, X. Liu, and N. Huang, “Pulmonary vessel segmentation based on
representations from limited data to improve vascular segmentation,” orthogonal fused u-net++ of chest ct images,” in International Confer-
arXiv preprint arXiv:1910.04814, 2019. ence on Medical Image Computing and Computer-Assisted Intervention.
Springer, 2019, pp. 293–300.
[44] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
[64] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
networks,” in Proceedings of the European Conference on Computer
tation learning for human pose estimation,” in Proceedings of the IEEE
Vision. Springer, 2016, pp. 630–645.
International Conference on Computer Vision, 2019.
[45] D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Trémeau, and [65] D. Peng, Y. Zhang, and H. Guan, “End-to-end change detection for high
C. Wolf, “Residual conv-deconv grid network for semantic segmenta- resolution satellite images using improved unet++,” Remote Sensing,
tion,” in Proceedings of the British Machine Vision Conference, 2017, vol. 11, no. 11, p. 1382, 2019.
2017. [66] Y. Zhang, W. Gong, J. Sun, and W. Li, “Web-net: A novel nest
[46] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution resid- networks with ultra-hierarchical sampling for building extraction from
ual networks for semantic segmentation in street scenes,” in Proceedings aerial imageries,” Remote Sensing, vol. 11, no. 16, p. 1897, 2019.
of the IEEE Conference on Computer Vision and Pattern Recognition, [67] A. A. Shenoy, “Feature optimization of contact map predictions based
2017, pp. 4151–4160. on inter-residue distances and u-net++ architecture.”
[47] J. Jiang, Y.-C. Hu, C.-J. Liu, D. Halpenny, M. D. Hellmann, J. O. Deasy,
G. Mageras, and H. Veeraraghavan, “Multiple resolution residually
connected feature streams for automatic lung tumor segmentation from
ct images,” IEEE transactions on medical imaging, vol. 38, no. 1, pp.
134–144, 2019.
[48] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. IEEE, 2018, pp. 2403–2412.
[49] Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.-A. Heng, “3d deeply su-
pervised network for automatic liver segmentation from ct volumes,” in
International Conference on Medical Image Computing and Computer-
Assisted Intervention. Springer, 2016, pp. 149–157.
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 12
A PPENDIX A A PPENDIX B
A DDITIONAL M EASUREMENTS L EARNING CURVES
0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.