Tmi 2019 2959609

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 1

UNet++: Redesigning Skip Connections to Exploit


Multiscale Features in Image Segmentation
Zongwei Zhou, Member, IEEE, Md Mahfuzur Rahman Siddiquee, Member, IEEE,
Nima Tajbakhsh, Member, IEEE, and Jianming Liang, Senior Member, IEEE

Abstract—The state-of-the-art models for medical image seg- and have proven to be effective in recovering fine-grained
mentation are variants of U-Net and fully convolutional networks details of the target objects [7], [8], [9] even on complex
(FCN). Despite their success, these models have two limitations: background [10], [11]. Skip connections have also played a
(1) their optimal depth is apriori unknown, requiring extensive
architecture search or inefficient ensemble of models of varying key role in the success of instance-level segmentation models
depths; and (2) their skip connections impose an unnecessarily such as [12], [13] where the idea is to segment and distinguish
restrictive fusion scheme, forcing aggregation only at the same- each instance of desired objects.
scale feature maps of the encoder and decoder sub-networks. However, these encoder-decoder architectures for image
To overcome these two limitations, we propose UNet++, a new segmentation come with two limitations. First, the optimal
neural architecture for semantic and instance segmentation, by
(1) alleviating the unknown network depth with an efficient depth of an encoder-decoder network can vary from one
ensemble of U-Nets of varying depths, which partially share application to another, depending on the task difficulty and
an encoder and co-learn simultaneously using deep supervi- the amount of labeled data available for training. A sim-
sion; (2) redesigning skip connections to aggregate features of ple approach would be to train models of varying depths
varying semantic scales at the decoder sub-networks, leading separately and then ensemble the resulting models during
to a highly flexible feature fusion scheme; and (3) devising a
pruning scheme to accelerate the inference speed of UNet++. the inference time [14], [15], [16]. However, this simple
We have evaluated UNet++ using six different medical image approach is inefficient from a deployment perspective, because
segmentation datasets, covering multiple imaging modalities such these networks do not share a common encoder. Furthermore,
as computed tomography (CT), magnetic resonance imaging being trained independently, these networks do not enjoy the
(MRI), and electron microscopy (EM), and demonstrating that (1) benefits of multi-task learning [17], [18]. Second, the design
UNet++ consistently outperforms the baseline models for the task
of semantic segmentation across different datasets and backbone of skip connections used in an encoder-decoder network is
architectures; (2) UNet++ enhances segmentation quality of unnecessarily restrictive, demanding the fusion of the same-
varying-size objects—an improvement over the fixed-depth U- scale encoder and decoder feature maps. While striking as a
Net; (3) Mask RCNN++ (Mask R-CNN with UNet++ design) natural design, the same-scale feature maps from the decoder
outperforms the original Mask R-CNN for the task of instance and encoder networks are semantically dissimilar and no solid
segmentation; and (4) pruned UNet++ models achieve significant
speedup while showing only modest performance degradation. theory guarantees that they are the best match for feature
Our implementation and pre-trained models are available at fusion.
https://github.com/MrGiovanni/UNetPlusPlus. In this paper, we present UNet++, a new general purpose
Index Terms—Neuronal Structure Segmentation, Liver Seg- image segmentation architecture that aims at overcoming the
mentation, Cell Segmentation, Nuclei Segmentation, Brain Tumor above limitations. As presented in Fig. 1(g), UNet++ consists
Segmentation, Lung Nodule Segmentation, Medical Image Seg- of U-Nets of varying depths whose decoders are densely
mentation, Semantic Segmentation, Instance Segmentation, Deep connected at the same resolution via the redesigned skip
Supervision, Model Pruning. pathways. The architectural changes introduced in UNet++
enable the following advantages. First, UNet++ is not prone
I. I NTRODUCTION to the choice of network depth because it embeds U-Nets of
The encoder-decoder networks are widely used in modern varying depths in its architecture. All these U-Nets partially
semantic and instance segmentation models [1], [2], [3], [4], share an encoder, while their decoders are intertwined. By
[5], [6]. Their success is largely attributed to their skip training UNet++ with deep supervision, all the constituent U-
connections, which combine deep, semantic, coarse-grained Nets are trained simultaneously while benefiting from a shared
feature maps from the decoder sub-network with shallow, low- image representation. This design not only improves the over-
level, fine-grained feature maps from the encoder sub-network, all segmentation performance, but also enables model pruning
during the inference time. Second, UNet++ is not handicapped
Z. Zhou, N. Tajbakhsh and J. Liang are with the Department of Biomedical by unnecessarily restrictive skip connections where only the
Informatics, Arizona State University, Scottsdale, AZ 85259 USA. same-scale feature maps from the encoder and decoder can be
(zongweiz@asu.edu; ntajbakh@asu.edu; jianming.liang@asu.edu)
M M. Rahman Siddiquee is with School of Computing, Informatics, and fused. The redesigned skip connections introduced in UNet++
Decision Systems Engineering, Arizona State University, Tempe, AZ 85281 present feature maps of varying scales at a decoder node,
USA. (mrahmans@asu.edu) allowing the aggregation layer to decide how various feature
Copyright (c) 2019 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be maps carried along the skip connections should be fused with
obtained from the IEEE by sending a request to pubs-permissions@ieee.org. the decoder feature maps. The redesigned skip connections

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 2

Fig. 1: Evolution from U-Net to UNet++. Each node in the graph represents a convolution block, downward arrows indicate down-sampling,
upward arrows indicate up-sampling, and dot arrows indicate skip connections. (a–d) U-Nets of varying depths. (e) Ensemble architecture,
U-Nete , which combines U-Nets of varying depths into one unified architecture. All U-Nets (partially) share the same encoder, but have
their own decoders. (f) UNet+ is constructed by connecting the decoders of U-Nete , enabling the deeper decoders to send supervision
signals to the shallower decoders. (g) UNet++ is constructed by adding dense skip connections to UNet+, enabling dense feature propagation
along skip connections and thus more flexible feature fusion at the decoder nodes. As a result, each node in the UNet++ decoders, from
a horizontal perspective, combines multiscale features from its all preceding nodes at the same resolution, and from a vertical perspective,
integrates multiscale features across different resolutions from its preceding node, as formulated at Eq. 1. This multiscale feature aggregation
of UNet++ gradually synthesizes the segmentation, leading to increased accuracy and faster convergence, as evidenced by our empirical
results in Section IV. Note that, explicit deep supervision is required (bold links) to train U-Nete but optional (pale links) for UNet+ and
UNet++.

are realized in UNet++ by densely connecting the decoders leading to much better performance than individually
of the constituents U-Nets at the same resolution. We have training isolated U-Nets of the same architecture (see
extensively evaluated UNet++ across six segmentation datasets Section IV-D and Section V-C).
and multiple backbones of different depths. Our results demon- 5) We demonstrate the extensibility of UNet++ to multiple
strate that UNet++ powered by redesigned skip connections backbone encoders and further its applicability to various
and deep supervision enables a significantly higher level of medical imaging modalities including CT, MRI, and
performance for both semantic and instance segmentation. electron microscopy (see Section IV-A and Section IV-B).
This significant improvement of UNet++ over the classical U-
Net architecture is ascribed to the advantages offered by the II. P ROPOSED N ETWORK A RCHITECTURE : UN ET ++
redesigned skip connections and the extended decoders, which Fig. 1 shows how UNet++ evolves from the original U-
together enable gradual aggregation of the image features Net. In the following, we first trace this evolution, motivating
across the network, both horizontally and vertically. the need for UNet++, and then explain its technical and
In summary, we make the following five contributions: implementation details.
1) We introduce a built-in ensemble of U-Nets of varying
depths in UNet++, enabling improved segmentation per- A. Motivation behind the new architecture
formance for varying size objects—an improvement over We have done a comprehensive ablation study to investi-
the fixed-depth U-Net (see Section II-B). gate the performance of U-Nets of varying depths (Fig. 1(a-
2) We redesign skip connections in UNet++, enabling flexi- d)). For this purpose, we have used three relatively small
ble feature fusion in decoders—an improvement over the datasets, namely Cell, EM, and Brain Tumor (detailed
restrictive skip connections in U-Net that require fusion in Section III-A). Table I summarizes the results. For the cell
of only same-scale feature maps (see Section II-B). and brain tumor segmentation, a shallower network (U-Net L3 )
3) We devise a scheme to prune a trained UNet++, accelerat- outperforms the deep U-Net. For the EM dataset, on the other
ing its inference speed while maintaining its performance hand, the deeper U-Nets consistently outperform the shallower
(see Section IV-C). counterparts, but the performance gain is only marginal. Our
4) We discover that simultaneously training multi-depth U- experimental results suggest two key findings: 1) deeper U-
Nets embedded within the UNet++ architecture stimu- Nets are not necessarily always better, 2) the optimal depth of
lates collaborative learning among the constituent U-Nets, architecture depends on the difficulty and size of the dataset

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 3

Fig. 2: Training UNet++ with deep supervision makes segmentation results available at multiple nodes X0,j , enabling architecture pruning
at inference time. Taking the segmentation result from X0,4 leads to no pruning, UNet++ (L4 ), whereas taking the segmentation result from
X0,1 results in a maximally pruned architecture, UNet++ L1 . Note that nodes removed during pruning are colored in gray.

TABLE I: Ablation study on U-Nets of varying depths alongside


connections from the U-Nete and connect every two adjacent
with the new variants of U-Nets proposed in this work. U-Net Ld
refers to a U-Net with a depth of d (Fig. 1(a-d)). U-Nete , UNet+, nodes in the ensemble, resulting in a new architecture, which
and UNet++ are the new variants of U-Net, which are depicted in we refer to as UNet+ (Fig. 1(f)). Owing to the new connectiv-
Fig. 1(e-g). “DS” denotes deeply supervised training followed by ity scheme, UNet+ connects the disjoint decoders, enabling
average voting. Intersection over union (IoU) is used as the metric gradient back-propagation from the deeper decoders to the
for comparison (mean±s.d. %). shallower counterparts. UNet+ further relaxes the unnecessar-
Architecture DS Params EM Cell Brain Tumor
ily restrictive behaviour of skip connections by presenting each
U-Net L1 7 0.1M 86.83±0.43 88.58±1.68 86.90±2.25
U-Net L2 7 0.5M 87.59±0.34 89.39±1.64 88.71±1.45 node in the decoders with the aggregation of all feature maps
U-Net L3 7 1.9M 88.16±0.29 90.14±1.57 89.62±1.41 computed in the shallower stream. While using aggregated
U-Net (L4 ) 7 7.8M 88.30±0.24 88.73±1.64 89.21±1.55 feature maps at a decoder node is far less restrictive than
U-Nete 3 8.7M 88.33±0.23 90.72±1.51 90.19±0.83
UNet+ 7 8.7M 88.39±0.15 90.71±1.25 90.70±0.91 having only the same-scale feature map from the encoder,
UNet+ 3 8.7M 88.89±0.12 91.18±1.13 91.15±0.65 there is still room for improvement. We further propose to use
UNet++ 7 9.0M 88.92±0.14 91.03±1.34 90.86±0.81 dense connectivity in UNet+, resulting in our final architecture
UNet++ 3 9.0M 89.33±0.10 91.21±0.98 91.21±0.68
proposal, which we refer to as UNet++ (Fig. 1(g)). With dense
connectivity, each node in a decoder is presented with not only
the final aggregated feature maps but also with the intermediate
at hand. While these findings may encourage an automated aggregated feature maps and the original same-scale feature
neural architecture search, such an approach is hindered by maps from the encoder. As such, the aggregation layer in the
the limited computational resources [19], [20], [21], [22], [23]. decoder node may learn to use only the same-scale encoder
Alternatively, we propose an ensemble architecture, which feature maps or use all collected feature maps available at
combines U-Nets of varying depths into one unified structure. the gate. Unlike U-Nete , deep supervision is not required for
We refer to this architecture as U-Nete (Fig. 1(e)). We train UNet+ and UNet++, however, as we will describe later, deep
U-Nete by defining a separate loss function for each U- supervision enables model pruning during the inference time,
Net in the ensemble, i.e., X0,j , j ∈ {1, 2, 3, 4}. Our deep leading to a significant speedup with only modest drop in
supervision scheme differs from the commonly used deep performance.
supervision in deep image classification and image segmen-
tation networks; in [24], [25], [26], [27] the auxiliary loss B. Technical details
functions are added to the nodes along the decoder network, 1) Network connectivity: Let xi,j denote the output of
i.e., X4−j,j , j ∈ {0, 1, 2, 3, 4}, whereas we apply them on node Xi,j where i indexes the down-sampling layer along
X0,j , j ∈ {1, 2, 3, 4}. At the inference time, the output from the encoder and j indexes the convolution layer of the dense
each U-Net in the ensemble is averaged. block along the skip connection. The stack of feature maps
The ensemble architecture (U-Nete ) outlined above bene- represented by xi,j is computed as
fits from knowledge sharing, because all U-Nets within the
D(xi−1,j ) ,
( 
H h i j = 0
ensemble partially share the same encoder even though they xi,j =  i,k j−1 i+1,j−1 ) (1)
H x , U (x , j>0
have their own decoders. However, this architecture still suffers k=0

from two drawbacks. First, the decoders are disconnected— where function H(·) is a convolution operation followed by
deeper U-Nets do not offer a supervision signal to the decoders an activation function, D(·) and U(·) denote a down-sampling
of the shallower U-Nets in the ensemble. Second, the common layer and an up-sampling layer respectively, and [ ] denotes
design of skip connections used in the U-Nete is unnecessarily the concatenation layer. Basically, as shown in Fig. 1(g), nodes
restrictive, requiring the network to combine the decoder at level j = 0 receive only one input from the previous layer
feature maps with only the same-scale feature maps from of the encoder; nodes at level j = 1 receive two inputs, both
the encoder. While striking as a natural design, there is no from the encoder sub-network but at two consecutive levels;
guarantee that the same-scale feature maps are the best match and nodes at level j > 1 receive j + 1 inputs, of which j
for the feature fusion. inputs are the outputs of the previous j nodes in the same skip
To overcome the above limitations, we remove long skip connection and the j +1th input is the up-sampled output from

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 4

TABLE II: Summary of biomedical image segmentation datasets


used in our experiments (see Section III-A for details).
Application Images Input Size Modality Provider
EM 30 96×96 microscopy ISBI 2012 [30]
Cell 354 96×96 Cell-CT VisionGate [31]
Nuclei 670 96×96 mixed Data Science Bowl
Brain Tumor 66,348 256×256 MRI BraTS2013 [32]
Liver 331 96×96 CT MICCAI 2017 LiTS
Lung Nodule 1,012 64×64×64 CT LIDC-IDRI [33]

the lower skip connection. The reason that all prior feature
maps accumulate and arrive at the current node is because
we make use of a dense convolution block along each skip
connection.
2) Deep supervision: We introduce deep supervision in
UNet++. For this purpose, we append a 1×1 convolution
with C kernels followed by a Sigmoid activation function to
the outputs from nodes X0,1 , X0,2 , X0,3 , and X0,4 where C
is the number of classes observed in the given dataset. We
then define a hybrid segmentation loss consisting of pixel-
wise cross-entropy loss and soft dice-coefficient loss for each
semantic scale. The hybrid loss may take advantages of what
both loss functions have to offer: smooth gradient and handling
of class imbalance [28], [29]. Mathematically, the hybrid loss
is defined as:
C N
!
1 XX 2yn,c pn,c
L(Y, P ) = − yn,c log pn,c + 2 + p2
(2)
N c=1 n=1 yn,c n,c

where yn,c ∈ Y and pn,c ∈ P denote the target labels and


predicted probabilities for class c and nth pixel in the batch,
N indicates the number of pixels within one batch. The overall
loss function for UNet++ is then defined as the weighted
summation
Pd of the hybrid loss from each individual decoders: Fig. 3: Qualitative comparison among U-Net, wide U-Net, and
L = i=1 ηi · L(Y, P i ), where d indexes the decoder. In the UNet++; showing segmentation results for our six distinct biomedical
experiments, we give same balanced weights ηi to each loss, image segmentation applications. They include various 2D and 3D
i.e., ηi ≡ 1, and do not process the ground truth for different modalities. The corresponding quantitative scores are provided at the
outputs supervision like Gaussian blur. bottom of each prediction (IoU | Dice).
3) Model pruning: Deep supervision enables model prun-
ing. Owing to deep supervision, UNet++ can be deployed in
two operation modes: 1) ensemble mode where the segmen- 1) Electron Microscopic (EM): The dataset is
tation results from all segmentation branches are collected provided by the EM segmentation challenge [30] as a part
and then averaged, and 2) pruned mode where the segmen- of ISBI 2012. The dataset consists of 30 images (512×512
tation output is selected from only one of the segmentation pixels) from serial section transmission electron microscopy
branches, the choice of which determines the extent of model of the Drosophila firt instar larva ventral nerve cord (VNC).
pruning and speed gain. Fig. 2 shows how the choice of the Referring to the example in Fig. 3, each image comes with a
segmentation branch results in pruned architectures of varying corresponding fully annotated ground truth segmentation map
complexity. Specifically, taking the segmentation result from for cells (white) and membranes (black). The labeled images
X0,4 leads to no pruning whereas taking the segmentation are split into training (24 images), validation (3 images), and
result from X0,1 leads to maximal pruning of the network. test (3 images) datasets. Both training and inference are done
based on 96×96 patches, which are chosen to overlap by half
of the patch size via sliding windows. Specifically, during the
III. E XPERIMENTS
inference, we aggregate predictions across patches by voting
A. Datasets in the overlapping areas.
Table II summarizes the six biomedical image segmenta-
2) Cell: The dataset is acquired with a Cell-CT imaging
tion datasets used in this study, covering lesions/organs from
system [31]. Two trained experts manually segment the col-
most commonly used medical imaging modalities including
lected images, so each image in the dataset comes with two
microscopy, computed tomography (CT), and magnetic reso-
binary cell masks. For our experiments, we select a subset of
nance imaging (MRI).
354 images that have the highest level of agreement between

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 5

TABLE III: Details of the architectures used in our study. Wider


the two expert annotators. The selected images are then split
version of U-Net and V-Net are designed to have comparable number
into training (212 images), validation (70 images), and test (72 of parameters to UNet++ and VNet++.
images) subsets.
X0,0 X1,0 X2,0 X3,0 X4,0
Architecture Params
3) Nuclei: The dataset is provided by the Data Science X0,4 X1,3 X2,2 X3,1 X4,0
Bowl 2018 segmentation challenge and consists of 670 seg- U-Net 7.8M 32 64 128 256 512
wide U-Net 9.1M 35 70 140 280 560
mented nuclei images from different modalities (brightfield V-Net 22.6M 32 64 128 256 512
vs. fluorescence). This is the only dataset used in this work wide V-Net 27.0M 35 70 140 280 560
with instance-level annotation where each nucleolus is marked Architecture Params X0,0−4 X1,0−3 X2,0−2 X3,0−1 X4,0
in a different color. Images are randomly assigned into a UNet+ 8.7M 32 64 128 256 512
UNet++ 9.0M 32 64 128 256 512
training set (50%), a validation set (20%), and a test set (30%). VNet+ 25.3M 32 64 128 256 512
We then use a sliding window mechanism to extract 96×96 VNet++ 26.2M 32 64 128 256 512
patches from the images, with 32-pixel stride for training and
validating model, and with 1-pixel stride for testing.
number of parameters. Table III details the U-Net and wide U-
4) Brain Tumor: The dataset is provided by BraTS
Net architectures. We have further compared the performance
2013 [32], [34]. To ease the comparison with other approaches,
of UNet++ against UNet+, which is our intermediate archi-
the models are trained using 20 High-grade (HG) and 10
tecture proposal. The numbers of kernels in the intermediate
Low-grade (LG) with Flair, T1, T1c, and T2 scans of MR
nodes have been given in Table III.
images from all patients, resulting in a total of 66,348 slices.
Our experiments are implemented in Keras with Tensorflow
We further pre-process the dataset by re-scaling the slices to
backend. We use early-stop mechanism on the validation
256×256. Finally, the 30 patients available in the dataset are
set to avoid over-fitting and evaluate the results using Dice-
randomly assigned into five folds, each having images from
coefficient and Intersection over Union (IoU). Alternative mea-
six patients. We then randomly assign these five folds into a
surement metrics, such as pixel-wise sensitivity, specificity, F1,
training set (3-fold), a validation set (1-fold), and a test set
and F2 scores, along with the statistical analysis can be found
(1-fold). The ground truth segmentation have four different
in Appendix Section A. Adam is used as the optimizer with a
labels: necrosis, edema, non-enhancing tumor, and enhancing
learning rate of 3e-4. Both UNet+ and UNet++ are constructed
tumor. Following the BraTS-2013, the “complete” evaluation
from the original U-Net architecture. All the experiments are
is done by considering all four labels as positive class and
performed using three NVIDIA TITAN X (Pascal) GPUs with
others as negative class.
12 GB memory each.
5) Liver: The dataset is provided by MICCAI 2017 LiTS
Challenge and consists of 331 CT scans, which we split into IV. R ESULTS
training (100 patients), validation (15 patients), and test (15
patients) subsets. The ground truth segmentation provides two A. Semantic segmentation results
different labels: liver and lesion. For our experiments, we only Table IV compares U-Net, wide U-Net, UNet+, and UNet++
consider liver as positive class and others as negative class. in terms of the number parameters and segmentation results
6) Lung Nodule: The dataset is provided by the measured by IoU (mean±s.d) for the six segmentation tasks
Lung Image Database Consortium image collection (LIDC- under study. As seen, wide U-Net consistently outperforms
IDRI) [33] and consists of 1018 cases collected by seven U-Net. This improvement is attributed to the larger number
academic centers and eight medical imaging companies. Six of parameters in wide U-Net. UNet++ without deep su-
cases with ground truth issues were identified and removed. pervision achieves a significant IoU gain over both U-Net
The remaining cases were split into training (510), validation and wide U-Net for all the six tasks of neuronal structure
(100), and test (408) sets. Each case is a 3D CT scan and (↑0.62±0.10, ↑0.55±0.01), cell (↑2.30±0.30, ↑2.12±0.09),
the nodules have been marked as volumetric binary masks. nuclei (↑1.87±0.06, ↑1.71±0.06), brain tumor (↑2.00±0.87,
We have re-sampled the volumes to 1-1-1 spacing and then ↑1.86±0.81), liver (↑2.62±0.09, ↑2.26±0.02), and lung nodule
extracted a 64×64×64 crop around each nodule. These 3D (↑5.06±1.42, ↑3.12±0.88) segmentation. Using deep supervi-
crops are used for model training and evaluation. sion and average voting further improves UNet++, increasing
the IoU by up to 0.8 points. Specifically, neuronal structure
and lung nodule segmentation benefit the most from deep
B. Baselines and implementation supervision because they appear at varying scales in EM
For comparison, we use the original U-Net [35] and a cus- and CT slices. Deep supervision, however, is only marginally
tomized wide U-Net architecture for 2D segmentation tasks, effective for other datasets at best. Fig. 3 depicts a qualitative
and V-Net [28] and a customized wide V-Net architecture comparison between the results of U-Net, wide U-Net, and
for 3D segmentation tasks. We choose U-Net (or V-Net for UNet++.
3D) because it is a common performance baseline for image We have further investigated the extensibility of UNet++
segmentation. We have also designed a wide U-Net (or wide for semantic segmentation by applying redesigned skip con-
V-Net for 3D) with similar number of parameters to our nections to an array of modern CNN architectures: vgg-
suggested architecture. This is to ensure that the performance 19 [36], resnet-152 [8], and densenet-201 [9]. Specifically, we
gain yielded by our architecture is not simply due to increased have turned each architecture above into a U-Net model by

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 6

TABLE IV: Semantic segmentation results measured by IoU (mean±s.d. %) for U-Net, wide U-Net, UNet+ (our intermediate proposal),
and UNet++ (our final proposal). Both UNet+ and UNet++ are evaluated with and without deep supervision (DS). We have performed
independent two sample t-test between U-Net [5] vs. others for 20 independent trials and highlighted boxes in red when the differences are
statistically significant (p < 0.05).
2D Application 3D Application
Architecture DS Params Architecture DS Params
EM Cell Nuclei Brain Tumor† Liver Lung Nodule
U-Net [5] 7 7.8M 88.30±0.24 88.73±1.64 90.57±1.26 89.21±1.55 79.90±1.38 V-Net [28] 7 22.6M 71.17±4.53
wide U-Net 7 9.1M 88.37±0.13 88.91±1.43 90.47±1.15 89.35±1.49 80.25±1.31 wide V-Net 7 27.0M 73.12±3.99
UNet+ 7 8.7M 88.39±0.15 90.71±1.25 91.73±1.09 90.70±0.91 79.62±1.20 VNet+ 7 25.3M 75.93±2.93
UNet+ 3 8.7M 88.89±0.12 91.18±1.13 92.04±0.89 91.15±0.65 82.83±0.92 VNet+ 3 25.3M 76.72±2.48
UNet++ 7 9.0M 88.92±0.14 91.03±1.34 92.44±1.20 90.86±0.81 82.51±1.29 VNet++ 7 26.2M 76.24±3.11
UNet++ 3 9.0M 89.33±0.10 91.21±0.98 92.37±0.98 91.21±0.68 82.60±1.11 VNet++ 3 26.2M 77.05±2.42

The winner in BraTS-2013 holds a “complete” Dice of 92% vs. 90.83%±2.46% (our UNet++ with deep supervision).

Fig. 4: Comparison between U-Net, UNet+, and UNet++ when applied to the state-of-the-art backbones for the tasks of neuronal
structure, cell, nuclei, brain tumor, and liver segmentation. UNet++, trained with deep supervision, consistently outperforms U-Net across all
backbone architectures and applications under study. By densely connecting the intermediate layers, UNet++ also yields higher segmentation
performance than UNet+ in most experimental configurations. The error bars represent the 95% confidence interval and the number of ∗ on
the bridge indicates the level of significance measured by p-value (“n.s.” stands for “not statistically significant”).

TABLE V: Redesigned skip connections improve both semantic and


adding a decoder sub-network, and then replaced the plain instance segmentation for the task of nuclei segmentation. We use
skip connections of U-Net with the redesigned connections Mask R-CNN for instance segmentation and U-Net for semantic
of UNet++. For comparison, we have also trained U-Net segmentation in this comparison.
and UNet+ with the aforementioned backbone architectures. Architecture Backbone IoU Dice Score
For a comprehensive comparison, we have used EM, Cell, U-Net resnet101 91.03 75.73 0.244
UNet++ resnet101 92.55 89.74 0.327
Nuclei, Brain Tumor and Liver segmentation datasets. Mask R-CNN [12] resnet101 93.28 87.91 0.401
As seen in Fig. 4, UNet++ consistently outperforms U-Net Mask RCNN++† resnet101 95.10 91.36 0.414
and UNet+ across all backbone architectures and applications † Mask R-CNN with UNet++ design in its feature pyramid.
under study. Through 20 trials, we further present statistical
analysis based on the independent two-sample t-test on each
pair among U-Net, UNet+, and UNet++. Our results suggest
semantic segmentation. We use Mask R-CNN [12] as the base-
that UNet++ is an effective, backbone-agnostic extension to
line model for instance segmentation. Mask R-CNN utilizes
U-Net. To facilitate reproducibility and model reuse, we have
feature pyramid network (FPN) as backbone to generate object
released the implementation1 of U-Net, UNet+, and UNet++
proposal at multiple scales, and then outputs the segmentation
for various traditional and modern backbone architectures.
masks for the collected proposals via a dedicated segmentation
branch. We modify Mask R-CNN by replacing the plain skip
B. Instance segmentation results connections of FPN with the redesigned skip connections
Instance segmentation consists in segmenting and distin- of UNet++. We refer to this model as Mask RCNN++. We
guishing all object instances; hence, more challenging than use resnet101 as the backbone for Mask R-CNN in our
experiments.
1 The project page: https://github.com/MrGiovanni/UNetPlusPlus Table V compares the performance of Mask R-CNN and

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 7

Fig. 5: Complexity (size ∝ parameters), inference time, and IoU


of UNet++ under different levels of pruning. The inference time is
calculated by the time taken to process 10K test images on a single
NVIDIA TITAN X (Pascal) GPU with 12 GB memory.
Fig. 6: We demonstrate that our architectural design improves the
performance of each shallower network embedded in UNet++. The
embedded shallower networks show improved segmentation when
Mask RCNN++ for nuclei segmentation. We have chosen the pruned from UNet++ in comparison to the same network trained
Nuclei dataset because multiple nucleolus instances can be isolated. Due to no pruning, UNet++ L4 naturally achieves the same
present in an image, in which case each instance is annotated level of performance in isolated and embedded training modes.
in a different color, and thus marked as a distinct object.
Therefore, this dataset is amenable to both semantic segmen-
D. Embedded vs. isolated training of pruned models
tation where all nuclei instances are treated as foreground
class, and also instance segmentation where each individual In theory, UNet++ Ld can be trained in two fashions: 1)
nucleus is to be segmented separately. As seen in Table V, embedded training where the full UNet++ model is trained
Mask RCNN++ outperforms its original counterpart, achieving and then pruned at depth d to obtain UNet++ Ld , 2) isolated
1.82 points increase in IoU (93.28% to 95.10%), 3.45 points training where UNet++ Ld is trained in isolation without
increase in Dice (87.91% to 91.36%), and 0.013 points in- any interactions with the deeper encoder and decoder nodes.
crease in the leaderboard score (0.401 to 0.414). To put this Referring to Fig. 2, embedded training of a sub-network
performance in perspective, we have also trained a U-Net and consists of training all graph nodes (both yellow and grey
UNet++ model for semantic segmentation with a resnet101 components) with deep supervision, but we then use only the
backbone. As seen in Table V, Mask R-CNN models achieve yellow sub-network during the inference time. In contrast,
higher segmentation performance than semantic segmentation isolated training consists of removing the grey nodes from
models. Furthermore, as expected, UNet++ outperforms U-Net the graph, basing the training and test solely on the yellow
for semantic segmentation. sub-network.
We have compared the isolated and embedded training
schemes for various levels of UNet++ pruning across two
C. Model pruning
datasets in Fig. 6. We have discovered that the embedded
Once UNet++ is trained, the decoder path for depth d at training of UNet++ Ld results in a higher performing model
inference time is completely independent from the decoder than training the same architecture in isolation. The observed
path for depth d + 1. As a result, we can completely remove superiority is more pronounced under aggressive pruning when
the decoder for depth d+1, obtaining a shallower version of the the full UNet++ is pruned to UNet++ L1 . In particular,
trained UNet++ at depth d, owing to the introduced deep su- the embedded training of UNet++ L1 for liver segmentation
pervision. This pruning can significantly reduce the inference achieves 5-point increase in IoU over the isolated training
time, but segmentation performance may degrade. As such, scheme. This finding suggests that supervision signal coming
the level of pruning should be determined by evaluating the from the deep downstream enables training higher performing
model’s performance on the validation set. We have studied shallower models. This finding is also related to knowledge
the inference speed-IoU trade-off for UNet++ in Fig. 5. We use distillation where the knowledge learned by a deep teacher
UNet++ Ld to denote UNet++ pruned at depth d (see Fig. 2 network is learned by a shallower student network.
for further details). As seen, UNet++ L3 achieves on average
32.2% reduction in inference time and 75.6% reduction in
V. D ISCUSSIONS
memory footprint while degrading IoU by only 0.6 points.
More aggressive pruning further reduces the inference time but A. Performance analysis on stratified lesion sizes
at the cost of significant IoU degradation. More importantly, Fig. 7 compares U-Net and UNet++ for segmenting different
this observation has the potential to exert important impact sizes of brain tumors. To avoid clutter in the figure, we group
on computer-aided diagnosis (CAD) on mobile devices, as the tumors by size into seven buckets. As seen, UNet++
the existing deep convolutional neural network models are consistently outperforms U-Net across all the buckets. We
computationally expensive and memory intensive. also adopt t-test on each bucket based on 20 different trials

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 8

medical applications in Appendix Section B, revealing that


the addition of dense connections in UNet++ encourages a
better optimization and reaches lower validation loss.

C. Collaborative learning in UNet++


Collaborative learning is known as training multiple clas-
sifier heads of the same network simultaneously on the same
training data. It is found to improve the generalization power
of deep neural networks [37]. UNet++ naturally embodies
collaborative learning through aggregating multi-depth net-
Fig. 7: UNet++ can better segment tumors of various sizes than does works and supervising segmentation heads from each of the
U-Net. We measure the size of tumors based on the ground truth constituent networks. Besides, the segmentation heads, for
masks and then divide them into seven groups. The histogram shows example X0,2 in Fig. 2, receive gradients from both strong
the distribution of different tumor sizes. The box-plot compares the (loss from ground truth) and soft (losses propagated from
segmentation performances of U-Net (black) and UNet++ (red) in
each group. The t-test for two independent samples has been further adjacent deeper nodes) supervision. As a result, the shallower
performed on each group. As seen, UNet++ improves segmentation networks improve their segmentation (Fig. 6) and provide
for all sizes of tumors and the improvement is significant (p < 0.05) more informative representation to deeper counterparts. Ba-
for the majority of the tumor sizes (highlighted in red). sically, deeper and shallower networks regularize each other
via collaborative learning in UNet++. Training multi-depth
embedded networks together results in improved segmentation
to measure the significance of the improvement, concluding than training them individually as isolated network which is
that 5 out of the 7 comparisons are statistically significant evident in Section IV-D. The embedded design of UNet++
(p < 0.05). The capability of UNet++ in segmenting tumors of makes it amenable to auxiliary training, multi-task learning,
varying sizes is attributed to its built-in ensemble of U-Nets, and knowledge distillation [17], [38], [37].
which enables image segmentation based on multi-receptive
field networks. VI. R ELATED W ORKS
In the following, we review the works related to redesigned
B. Feature maps visualization skip connections, feature aggregation, and deep supervision,
In Section II-A, we explained that the redesigned skip which are the main components of our new architecture.
connections enable the fusion of semantically rich decoder
feature maps with feature maps of varying semantic scales A. Skip connections
from the intermediate layers of the architecture. In this section,
Skip connections were first introduced in the seminal work
we illustrate this privilege of our re-designed skip connections
of Long et al. [39] where they proposed a fully convolu-
by visualizing the intermediate feature maps.
tional networks (FCN) for semantic segmentation. Shortly
Fig. 8 shows representative feature maps from early, inter-
after, building on skip connections, Ronneberger et al. [35]
mediate, and late layers along the top most skip connection
proposed U-Net architecture for semantic segmentation in
(i.e., X0,i ) for a brain tumor image. The representative feature
medical images. The FCN and U-Net architectures however
map for a layer is obtained by averaging all its feature maps.
differ in how the up-sampled decoder feature maps were
Also note that architectures in the left side of Fig. 8 are trained
fused with the same-scale feature maps from the encoder
using only loss function appended to the deepest decoder layer
network. While FCN [39] uses the summation operation for
(X0,4 ) whereas the architectures in the right side of Fig. 8 are
feature fusion, U-Net [35] concatenates the features followed
trained with deep supervision. Note that these feature maps
by the application of convolutions and non-linearities. The
are not the final outputs. We have appended an additional
skip connections have shown to help recover the full spatial
1×1 convolutional layer on top of each decoder branch to
resolution, making fully convolutional methods suitable for
form the final segmentation. We observe that the outputs
semantic segmentation [40], [41], [42], [43]. Skip connections
of U-Net’s intermediate layers are semantically dissimilar
have further been used in modern neural architectures such as
whereas for UNet+ and UNet++ the outputs are formed
residual networks [8], [44] and dense networks [9], facilitating
gradually. The output of node X0,0 in U-Net undergoes slight
the gradient flow and improving the overall performance of
transformation (few convolution operations only) whereas the
classification networks.
output of X1,3 , the input of X0,4 , goes through nearly every
transformation (four down-sampling and three up-sampling
stages) learned by the network. Hence, there is a large gap B. Feature aggregation
between the representation capability of X0,0 and X1,3 . So, The exploration of aggregating hierarchical feature has
simply concatenating the outputs of X0,4 and X1,3 is not recently been the subject of research. Fourure et al. [45]
an optimal solution. In contrast, redesigned skip connections propose GridNet, which is an encoder-decoder architecture
in UNet+ and UNet++ help refine the segmentation result wherein the feature maps are wired in a grid fashion, gen-
gradually. We further present the learning curves of all six eralizing several classical segmentation architectures. Despite

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 9

Fig. 8: Visualization and comparison of feature maps from early, intermediate, and late layers along the top most skip connection for brain
tumor images. Here, the dot arrows denote plain skip connection in U-Net and UNet+, while the dash arrows denote dense connections
introduced in UNet++.

GridNet contains multiple streams with different resolutions, introduce a deep supervision by combining predictions from
it lacks up-sampling layers between skip connections; and varying resolutions of feature maps, suggesting that it can
thus, it does not represent UNet++. Full-resolution residual combat potential optimization difficulties and thus reach faster
networks (FRRN) [46] employs a two-stream system, where convergence rate and more powerful discrimination capabil-
full-resolution information is carried in one stream and con- ity. Zhu et al. [50] used eight additional deeply supervised
text information in the other pooling stream. In [47], two layers in their proposed architecture. Our nested networks are
improved versions of FRRN are proposed, i.e., incremental however more amenable to training under deep supervision:
MRRN with 28.6M parameters and dense MRRN with 25.5M 1) multiple decoders automatically generate full resolution
parameters. These 2D architectures however have similar segmentation maps; 2) the networks are embedded various
number of parameters to our 3D VNet++ and three times different depths of U-Net so that it grasps multiple-resolution
more parameters than 2D UNet++; and thus, simply upgrading features; 3) densely connected feature maps help smooth the
these architectures to a 3D manner may not be amenable to gradient flow and give relatively consistent predicting mask;
the common 3D volumetric medical imaging applications. We 4) the high dimension features have effects on every outputs
would like to note that our redesigned dense skip connections through back-propagation, allowing us to prune the network
are completely different from those used in MRRN, which in the inference phase.
consists of a common residual stream. Also, it’s not flexible
to apply the design of MRRN to other backbone encoders and D. Our previous work
meta framework such as Mask R-CNN [12]. DLA2 [48], topo- We first presented UNet++ in our DLMIA 2018 paper [51].
logically equivalent to our intermediate architecture UNet+ UNet++ has since been quickly adopted by the research com-
(Fig. 1(f)), sequentially connects the same resolution of feature munity, either as a strong baseline for comparison [52], [53],
maps, without long skip connections as used in U-Net. Our [54], [55], or as a source of inspiration for developing newer
experimental results demonstrate that by densely connecting semantic segmentation architectures [56], [57], [58], [59],
the layers, UNet++ achieves higher segmentation performance [60], [61]; it has also been utilized for multiple applications,
than UNet+/DLA (see Table IV). such as segmenting objects in biomedical images [62], [63],
natural images [64], and satellite images [65], [66]. Recently,
C. Deep supervision Shenoy [67] has independently and systematically investigated
UNet++ for the task of “contact prediction model PconsC4”,
He et al. [8] suggested that the depth d of network can
demonstrating significant improvement over widely-used U-
act as a regularizer. Lee et al. [27] demonstrated that deeply
Net.
supervised layers can improve the learning ability of the
Nevertheless, to further strengthen UNet++ on our own,
hidden layer, enforcing the intermediate layers to learn dis-
the current work presents several extensions to our previous
criminative features, enabling fast convergence and regular-
work: (1) we present a comprehensive study on network
ization of the network [26]. DenseNet [9] performs a similar
depth, motivating the need for the proposed architecture (Sec-
deep supervision in an implicit fashion. Deep supervision can
tion II-A); (2) we compare the embedded training schemes
be used in U-Net like architecture as well. Dou et al. [49]
with the isolated ones at various levels of pruned UNet++,
2 Deep Layer Aggregation—a simultaneous but independent work published and discover that training embedded U-Nets of multi-depths
in CVPR-2018 [48]. leads to improved performance than individually training them

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 10

in isolation (Section IV-D); (3) we strengthen our experiments [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
by including a new magnetic resonance imaging (MRI) dataset recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 770–778.
for brain tumor segmentation (Section IV); (4) we demonstrate [9] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
the effectiveness of UNet++ in Mask R-CNN, resulting in a connected convolutional networks,” in Proceedings of the IEEE Confer-
more powerful model namely Mask RCNN++ (Section IV-B); ence on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017,
p. 3.
(5) we investigate the extensibility of UNet++ to multiple [10] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns
advanced encoder backbones for semantic segmentation (Sec- for object segmentation and fine-grained localization,” in Proceedings
tion IV-A); (6) we study the effectiveness of UNet++ in of the IEEE Conference on Computer Vision and Pattern Recognition,
2015, pp. 447–456.
segmenting lesions of varying sizes (Section V-A); and (7) [11] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
we visualize the feature propagation along the resigned skip “Feature pyramid networks for object detection,” in Proceedings of the
connection to explain the performance (Section V-B). IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,
no. 2, 2017, p. 4.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
VII. C ONCLUSION Proceedings of the IEEE International Conference on Computer Vision.
IEEE, 2017, pp. 2980–2988.
We have presented a novel architecture, named UNet++, for [13] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick, “Learning to
more accurate image segmentation. The improved performance segment every thing,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 4233–4241.
by our UNet++ is attributed to its nested structure and re- [14] T. G. Dietterich, “Ensemble methods in machine learning,” in Interna-
designed skip connections, which aim to address two key tional workshop on multiple classifier systems. Springer, 2000, pp.
challenges of the U-Net: 1) unknown depth of the optimal 1–15.
[15] S. Hoo-Chang, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,
architecture and 2) the unnecessarily restrictive design of skip D. Mollura, and R. M. Summers, “Deep convolutional neural networks
connections. We have evaluated UNet++ using six distinct for computer-aided detection: Cnn architectures, dataset characteristics
biomedical imaging applications and demonstrated consistent and transfer learning,” IEEE transactions on medical imaging, vol. 35,
no. 5, p. 1285, 2016.
performance improvement over various state-of-the-art back- [16] F. Ciompi, B. de Hoop, S. J. van Riel, K. Chung, E. T. Scholten,
bones for semantic segmentation and meta framework for M. Oudkerk, P. A. de Jong, M. Prokop, and B. van Ginneken, “Au-
instance segmentation. tomatic classification of pulmonary peri-fissural nodules in computed
tomography using an ensemble of 2d views and a convolutional neural
network out-of-the-box,” Medical image analysis, vol. 26, no. 1, pp.
ACKNOWLEDGMENTS 195–202, 2015.
[17] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and
This research has been supported partially by ASU and trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
Mayo Clinic through a Seed Grant and an Innovation Grant, [18] Y. Zhang and Q. Yang, “A survey on multi-task learning,” arXiv preprint
arXiv:1707.08114, 2017.
and partially by NIH under Award Number R01HL128785. [19] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei,
The content is solely the responsibility of the authors and does A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture
not necessarily represent the official views of NIH. We thank search,” in Proceedings of the European Conference on Computer Vision,
2018, pp. 19–34.
Mohammad Reza Hosseinzadeh Taher and Fatemeh Haghighi [20] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable
for their verification of liver segmentation performance and architectures for scalable image recognition,” in Proceedings of the IEEE
the ablation study of embedded and isolated UNet++. We also Conference on Computer Vision and Pattern Recognition, 2018, pp.
8697–8710.
thank Michael G. Meyer for allowing us to test our ideas on [21] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and
the Cell-CT dataset. The content of this paper is covered by L. Fei-Fei, “Auto-deeplab: Hierarchical neural architecture search for
US patents pending. semantic image segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 82–92.
[22] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable
R EFERENCES architecture search for semantic segmentation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2019,
[1] S. K. Zhou, H. Greenspan, and D. Shen, Deep learning for medical pp. 11 641–11 650.
image analysis. Academic Press, 2017. [23] X. Li, Y. Zhou, Z. Pan, and J. Feng, “Partial order pruning: for best
[2] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image speed/accuracy trade-off in neural architecture search,” in Proceedings
analysis,” Annual review of biomedical engineering, vol. 19, pp. 221– of the IEEE Conference on Computer Vision and Pattern Recognition,
248, 2017. 2019, pp. 9145–9153.
[3] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, [24] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, of the IEEE International Conference on Computer Vision, 2015, pp.
“A survey on deep learning in medical image analysis,” Medical image 1395–1403.
analysis, vol. 42, pp. 60–88, 2017. [25] H. Chen, X. J. Qi, J. Z. Cheng, and P. A. Heng, “Deep contextual net-
[4] G. Chartrand, P. M. Cheng, E. Vorontsov, M. Drozdzal, S. Turcotte, C. J. works for neuronal structure segmentation,” in Thirtieth AAAI conference
Pal, S. Kadoury, and A. Tang, “Deep learning: a primer for radiologists,” on artificial intelligence, 2016.
Radiographics, vol. 37, no. 7, pp. 2113–2131, 2017. [26] Q. Dou, L. Yu, H. Chen, Y. Jin, X. Yang, J. Qin, and P.-A. Heng, “3d
[5] T. Falk, D. Mai, R. Bensch, Ö. Çiçek, A. Abdulkadir, Y. Marrakchi, deeply supervised network for automated segmentation of volumetric
A. Böhm, J. Deubner, Z. Jäckel, K. Seiwald et al., “U-net: deep learning medical images,” Medical image analysis, vol. 41, pp. 40–54, 2017.
for cell counting, detection, and morphometry,” Nature methods, p. 1, [27] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised
2018. nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.
[6] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. Chiang, Z. Wu, and X. Ding, [28] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional
“Embracing imperfect datasets: A review of deep learning solutions for neural networks for volumetric medical image segmentation,” in 2016
medical image segmentation,” arXiv preprint arXiv:1908.10454, 2019. Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp.
[7] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The 565–571.
importance of skip connections in biomedical image segmentation,” in [29] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. J. Cardoso,
Deep Learning and Data Labeling for Medical Applications. Springer, “Generalised dice overlap as a deep learning loss function for highly
2016, pp. 179–187. unbalanced segmentations,” in Deep Learning in Medical Image Analysis

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 11

and Multimodal Learning for Clinical Decision Support. Springer, [50] Q. Zhu, B. Du, B. Turkbey, P. L. Choyke, and P. Yan, “Deeply-supervised
2017, pp. 240–248. cnn for prostate segmentation,” in International Joint Conference on
[30] A. Cardona, S. Saalfeld, S. Preibisch, B. Schmid, A. Cheng, J. Pulokas, Neural Networks (IJCNN). IEEE, 2017, pp. 178–184.
P. Tomancak, and V. Hartenstein, “An integrated micro-and macroar- [51] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++:
chitectural analysis of the drosophila brain by computer-assisted serial A nested u-net architecture for medical image segmentation,” in Deep
section electron microscopy,” PLoS biology, vol. 8, no. 10, p. e1000502, Learning in Medical Image Analysis and Multimodal Learning for
2010. Clinical Decision Support. Springer, 2018, pp. 3–11.
[31] M. G. Meyer, J. W. Hayenga, T. Neumann, R. Katdare, C. Presley, [52] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang,
D. E. Steinhauer, T. M. Bell, C. A. Lancaster, and A. C. Nelson, W. Liu, and J. Wang, “High-resolution representations for labeling pixels
“The cell-ct 3-dimensional cell imaging technology platform enables and regions,” CoRR, vol. abs/1904.04514, 2019.
the detection of lung cancer using the noninvasive luced sputum test,” [53] Y. Fang, C. Chen, Y. Yuan, and K.-y. Tong, “Selective feature aggrega-
Cancer cytopathology, vol. 123, no. 9, pp. 512–523, 2015. tion network with area-boundary constraints for polyp segmentation,” in
[32] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, International Conference on Medical Image Computing and Computer-
J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The Assisted Intervention. Springer, 2019, pp. 302–310.
multimodal brain tumor image segmentation benchmark (brats),” IEEE [54] J. Fang, Y. Zhang, K. Xie, S. Yuan, and Q. Chen, “An improved
transactions on medical imaging, vol. 34, no. 10, p. 1993, 2015. mpb-cnn segmentation method for edema area and neurosensory retinal
[33] S. G. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. detachment in sd-oct images,” in International Workshop on Ophthalmic
Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Medical Image Analysis. Springer, 2019, pp. 130–138.
Hoffman et al., “The lung image database consortium (lidc) and image [55] C. Meng, K. Sun, S. Guan, Q. Wang, R. Zong, and L. Liu, “Multiscale
database resource initiative (idri): a completed reference database of dense convolutional neural network for dsa cerebrovascular segmenta-
lung nodules on ct scans,” Medical physics, vol. 38, no. 2, pp. 915–931, tion,” Neurocomputing, vol. 373, pp. 123–134, 2020.
2011. [56] J. Zhang, Y. Jin, J. Xu, X. Xu, and Y. Zhang, “Mdu-net: Multi-scale
densely connected u-net for biomedical image segmentation,” arXiv
[34] M. Kistler, S. Bonaretti, M. Pfahrer, R. Niklaus, and P. Büchler, “The
preprint arXiv:1812.00352, 2018.
virtual skeleton database: an open access repository for biomedical re-
[57] F. Chen, Y. Ding, Z. Wu, D. Wu, and J. Wen, “An improved framework
search and collaboration,” Journal of medical Internet research, vol. 15,
called du++ applied to brain tumor segmentation,” in 2018 15th Interna-
no. 11, p. e245, 2013.
tional Computer Conference on Wavelet Active Media Technology and
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- Information Processing (ICCWAMTIP). IEEE, 2018, pp. 85–88.
works for biomedical image segmentation,” in International Conference [58] C. Zhou, S. Chen, C. Ding, and D. Tao, “Learning contextual and
on Medical Image Computing and Computer-Assisted Intervention. attentive information for brain tumor segmentation,” in International
Springer, 2015, pp. 234–241. MICCAI Brainlesion Workshop. Springer, 2018, pp. 497–507.
[36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [59] S. Wu, Z. Wang, C. Liu, C. Zhu, S. Wu, and K. Xiao, “Automatical
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. segmentation of pelvic organs after hysterectomy by using dilated
[37] G. Song and W. Chai, “Collaborative learning for deep neural networks,” convolution u-net++,” in 2019 IEEE 19th International Conference on
in Neural Information Processing Systems (NeurIPS), 2018. Software Quality, Reliability and Security Companion (QRS-C). IEEE,
[38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural 2019, pp. 362–367.
network,” arXiv preprint arXiv:1503.02531, 2015. [60] T. Song, F. Meng, A. Rodrı́guez-Patón, P. Li, P. Zheng, and X. Wang,
[39] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks “U-next: A novel convolution neural network with an aggregation u-
for semantic segmentation,” in Proceedings of the IEEE Conference on net architecture for gallstone segmentation in ct images,” IEEE Access,
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. vol. 7, pp. 166 823–166 832, 2019.
[40] A. Chaurasia and E. Culurciello, “Linknet: Exploiting encoder repre- [61] C. Yang and F. Gao, “Eda-net: Dense aggregation of deep and shal-
sentations for efficient semantic segmentation,” in 2017 IEEE Visual low information achieves quantitative photoacoustic blood oxygenation
Communications and Image Processing (VCIP). IEEE, 2017, pp. 1–4. imaging deep in human breast,” in International Conference on Medical
[41] G. Lin, A. Milan, C. Shen, and I. D. Reid, “Refinenet: Multi-path Image Computing and Computer-Assisted Intervention. Springer, 2019,
refinement networks for high-resolution semantic segmentation.” in pp. 246–254.
Proceedings of the IEEE Conference on Computer Vision and Pattern [62] V. Zyuzin and T. Chumarnaya, “Comparison of unet architectures for
Recognition, vol. 1, no. 2, 2017, p. 5. segmentation of the left ventricle endocardial border on two-dimensional
[42] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time ultrasound images,” in 2019 Ural Symposium on Biomedical Engineer-
semantic segmentation on high-resolution images,” in Proceedings of ing, Radioelectronics and Information Technology (USBEREIT). IEEE,
the European Conference on Computer Vision, 2018, pp. 405–420. 2019, pp. 110–113.
[43] N. Tajbakhsh, B. Lai, S. Ananth, and X. Ding, “Errornet: Learning error [63] H. Cui, X. Liu, and N. Huang, “Pulmonary vessel segmentation based on
representations from limited data to improve vascular segmentation,” orthogonal fused u-net++ of chest ct images,” in International Confer-
arXiv preprint arXiv:1910.04814, 2019. ence on Medical Image Computing and Computer-Assisted Intervention.
Springer, 2019, pp. 293–300.
[44] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
[64] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
networks,” in Proceedings of the European Conference on Computer
tation learning for human pose estimation,” in Proceedings of the IEEE
Vision. Springer, 2016, pp. 630–645.
International Conference on Computer Vision, 2019.
[45] D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Trémeau, and [65] D. Peng, Y. Zhang, and H. Guan, “End-to-end change detection for high
C. Wolf, “Residual conv-deconv grid network for semantic segmenta- resolution satellite images using improved unet++,” Remote Sensing,
tion,” in Proceedings of the British Machine Vision Conference, 2017, vol. 11, no. 11, p. 1382, 2019.
2017. [66] Y. Zhang, W. Gong, J. Sun, and W. Li, “Web-net: A novel nest
[46] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution resid- networks with ultra-hierarchical sampling for building extraction from
ual networks for semantic segmentation in street scenes,” in Proceedings aerial imageries,” Remote Sensing, vol. 11, no. 16, p. 1897, 2019.
of the IEEE Conference on Computer Vision and Pattern Recognition, [67] A. A. Shenoy, “Feature optimization of contact map predictions based
2017, pp. 4151–4160. on inter-residue distances and u-net++ architecture.”
[47] J. Jiang, Y.-C. Hu, C.-J. Liu, D. Halpenny, M. D. Hellmann, J. O. Deasy,
G. Mageras, and H. Veeraraghavan, “Multiple resolution residually
connected feature streams for automatic lung tumor segmentation from
ct images,” IEEE transactions on medical imaging, vol. 38, no. 1, pp.
134–144, 2019.
[48] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. IEEE, 2018, pp. 2403–2412.
[49] Q. Dou, H. Chen, Y. Jin, L. Yu, J. Qin, and P.-A. Heng, “3d deeply su-
pervised network for automatic liver segmentation from ct volumes,” in
International Conference on Medical Image Computing and Computer-
Assisted Intervention. Springer, 2016, pp. 149–157.

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMI.2019.2959609, IEEE
Transactions on Medical Imaging
JOURNAL OF IEEE TRANSACTIONS ON MEDICAL IMAGING 12

A PPENDIX A A PPENDIX B
A DDITIONAL M EASUREMENTS L EARNING CURVES

TABLE VI: Pixel-wise sensitivity, specificity, F1, and F2 scores


for all six applications under study. Note that the p-values are
calculated between our UNet++ with deep supervision vs. the
original U-Net. As seen, powered by redesigned skip connections
and deep supervision, UNet++ achieves a significantly higher level
of segmentation performance over U-Net across all the biomedical
applications under study.
EM Sensitivity Specificity F1 score F2 score
U-Net 91.21±2.18 83.55±1.62 87.21±1.88 89.56±2.06
UNet++ 92.87±2.08 84.94±1.55 88.73±1.79 91.17±1.96
p-value 0.018 0.008 0.013 0.016
Cell Sensitivity Specificity F1 score F2 score
U-Net 94.04±2.36 96.10±0.75 81.25±2.62 88.47±2.49
UNet++ 95.88±2.59 96.76±0.65 84.34±2.52 90.90±2.57
p-value 0.025 0.005 5.00e-4 0.004
Nuclei Sensitivity Specificity F1 score F2 score
U-Net 93.57±4.30 93.94±0.87 83.64±2.97 89.33±3.71
UNet++ 97.28±4.85 96.30±0.94 90.14±3.82 94.29±4.41
p-value 0.015 5.35e-10 6.75e-7 4.47e-4
Brain Tumor Sensitivity Specificity F1 score F2 score
U-Net 94.00±1.15 97.52±0.78 88.42±2.61 91.68±1.77
UNet++ 95.81±1.25 98.01±0.67 90.83±2.46 93.75±1.77
p-value 2.90e-5 0.042 0.005 7.03e-3
Liver Sensitivity Specificity F1 score F2 score
U-Net 91.22±2.02 98.48±0.43 86.19±2.84 89.14±2.37
UNet++ 93.15±1.88 98.74±0.36 88.54±2.57 91.25±2.18
p-value 0.003 0.046 0.010 0.006
Lung Nodule Sensitivity Specificity F1 score F2 score
U-Net 94.95±1.31 97.27±0.47 83.98±1.94 90.24±1.60
UNet++ 95.83±0.86 97.81±0.40 86.78±1.66 91.99±1.22
p-value 0.018 3.25e-3 1.92e-5 4.27e-3

Fig. 9: UNet++ enables a better optimization than U-Net evidenced


by the learning curves for the tasks of neuronal structure, cell, nuclei,
brain tumor, liver, and lung nodule segmentation. We have plotted the
validation losses averaged by 20 trials for each application. As seen,
UNet++ with deep supervision accelerates the convergence speed
and yields the lower validation loss due to the new design of the
intermediate layers and dense skip connections.

0278-0062 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like