0% found this document useful (0 votes)
2 views18 pages

R7

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 18

remote sensing

Article
An Integrated Method for Road Crack Segmentation and
Surface Feature Quantification under Complex Backgrounds
Lu Deng 1,2 , An Zhang 1 , Jingjing Guo 1,2 and Yingkai Liu 1, *

1 College of Civil Engineering, Hunan University, Changsha 410082, China


2 Key Laboratory for Damage Diagnosis of Engineering Structures of Hunan Province,
Hunan University, Changsha 410082, China
* Correspondence: lyk199343@hnu.edu.cn

Abstract: In the present study, an integrated framework for automatic detection, segmentation, and
measurement of road surface cracks is proposed. First, road images are captured, and crack regions
are detected based on the fifth version of the You Only Look Once (YOLOv5) algorithm; then, a
modified Residual Unity Networking (Res-UNet) algorithm is proposed for accurate segmentation at
the pixel level within the crack regions; finally, a novel crack surface feature quantification algorithm
is developed to determine the pixels of crack in width and length, respectively. In addition, a road
crack dataset containing complex environmental noise is produced. Different shooting distances,
angles, and lighting conditions are considered. Validated through the same dataset and compared
with You Only Look at CoefficienTs ++ (YOLACT++) and DeepLabv3+, the proposed method shows
higher accuracy for crack segmentation under complex backgrounds. Specifically, the crack damage
detection based on the YOLOv5 method achieves a mean average precision of 91%; the modified
Res-UNet achieves 87% intersection over union (IoU) when segmenting crack pixels, 6.7% higher
than the original Res-UNet; and the developed crack surface feature algorithm has an accuracy of
95% in identifying the crack length and a root mean square error of 2.1 pixels in identifying the crack
width, with the accuracy being 3% higher in length measurement than that of the traditional method.

Keywords: road engineering; pavement; crack segmentation; deep learning; YOLOv5; feature quantification
Citation: Deng, L.; Zhang, A.; Guo, J.;
Liu, Y. An Integrated Method for
Road Crack Segmentation and
Surface Feature Quantification under 1. Introduction
Complex Backgrounds. Remote Sens.
Structural damage to roads may induce serious traffic accidents and substantial eco-
2023, 15, 1530. https://doi.org/
nomic losses. In China in 2021, the number of traffic accidents was up to 273,098, and the
10.3390/rs15061530
total direct property damage was CNY 145,036,000 [1]; in 2022, the Chinese transportation
Academic Editors: Massimo Losa department invested a total of CNY 1.29 trillion in road maintenance [2]. Therefore, it is
and Nicholas Fiorentini important to monitor the typical signs of road damage, i.e., surface cracks in pavements,
in a timely and accurate manner. Thus, it is necessary to detect and evaluate cracks in
Received: 7 February 2023
Revised: 5 March 2023
time at the early stage of their appearance, which enables road structures to become more
Accepted: 6 March 2023
durable and have a longer service life [3]. In the past decades, several contact sensor-based
Published: 10 March 2023 approaches for road crack detection have been proposed in the field of structural health
monitoring [4,5]. However, the contact sensor-based detection techniques have some restric-
tions, such as low operational efficiency, unstable measurement accuracy, and vulnerability
to temperature and humidity variations [6,7]. Therefore, it is of great significance to develop
Copyright: © 2023 by the authors. a new contact-free road crack detection and quantification method with better efficiency
Licensee MDPI, Basel, Switzerland. and accuracy.
This article is an open access article To break the restrictions of the contact sensor-based methods, some vision-based
distributed under the terms and
damage detection methods have been developed in several studies [8–10]. A four-camera
conditions of the Creative Commons
vision system and a novel global calibration method were proposed by Chen et al. [11],
Attribution (CC BY) license (https://
and the performance of the multi-vision system was improved by minimizing the global
creativecommons.org/licenses/by/
calibration error. Benefiting from the vision-based techniques, structural defects such as
4.0/).

Remote Sens. 2023, 15, 1530. https://doi.org/10.3390/rs15061530 https://www.mdpi.com/journal/remotesensing


Remote Sens. 2023, 15, 1530 2 of 18

spalling, cracks, and holes can be automatically detected from images. With the rapid
development of artificial intelligence, many deep learning algorithms based on deep convo-
lutional neural networks (CNNs) have been developed to explore the automatic detection
of road cracks and other various damage [12–14]. For instance, the faster region proposal
convolutional neural network (Faster R-CNN) was utilized by Hacıefendioğlu et al. [15] for
automatic crack identification. Moreover, a generative adversarial networks (GANs)-based
method and an improved VGG16-based algorithm were proposed by Que et al. [16] for
crack data augmentation and crack classification, respectively, which effectively solved the
training problem caused by insufficient datasets. With the You Only Look Once (YOLO)
algorithm, Du et al. [17] established a method for quickly identifying and classifying de-
fects in the road surface. It is worth noting that there are different versions of YOLO, and
YOLOv6 [18], YOLOv7 [19], and YOLOv8 [20] are currently updated. Considering the
stability and applicability of the algorithm, the most widely used YOLOv5 algorithm is
employed in the present study [21]. A common feature of these deep learning algorithms
is the use of bounding boxes. The bounding box, which is essentially a rectangle around
an object, specifies the object’s predicted location, category, and confidence. In addition
to using bounding boxes for localization, crack detection at the pixel level has also been
implemented in some deep learning algorithms [22–24]. For instance, Yong et al. [25]
constructed an end-to-end real-time network for crack segmentation at the pixel level. To
better extract crack features, an asymmetric convolution enhancement (ACE) module and
the residual expanded involution module (REI) were embedded. In the studies of Sun
et al. [26], Shen and Yu [27], and Ji et al. [28], DeepLabv3+ was employed to automatically
detect pixel-level cracks. Zhang et al. [29] suggested Res-UNet as a method for the auto-
matic detection of cracks at the pixel level. Zhu et al. [30] and Kang et al. [31] quantified
the detected cracks at the pixel level and extracted the crack skeleton with the distance
transform method (DTM). Tang et al. [32] proposed a new crack backbone refinement
algorithm, and the average simplification rate of the crack backbone and the average error
rate of direction determination were both improved. Due to the modification being based
on the rough skeleton obtained after performing the traditional thinning algorithm [33],
there is room for improvement in measurement efficiency. Recently, domain adaptation has
been widely used to generate large amounts of perfectly supervised labelled synthetic data
for hard-to-label tasks such as semantic segmentation [34,35]. Stan et al. [36] developed
an algorithm adapted to the training of semantic segmentation models, which showed
good generalization in the unlabeled target domain; Marsden et al. [37] proposed a simple
framework using lightweight style transformation that allows pre-trained source models to
effectively prevent forgetting when adapting to a sequence of unlabeled target domains.
However, there are several restrictions on these studies: (1) All of these studies detect
cracks under ideal backgrounds, such as surfaces made entirely of concrete or asphalt.
However, such backgrounds are not in line with the most common engineering practices,
because actual crack detection tasks are always conducted on more complex backgrounds
mixed with surrounding objects such as trees and vehicles [38]. Thus, it is challenging
to distinguish cracks from complex backgrounds. (2) Due to the lack of sensitivity to
image details, previous deep learning methods are prone to giving false positives for
crack-like objects and expanding the detection range of crack edges. (3) In addition, the
post-quantitative processing, such as DTM, following the detection of cracks remains
an obstacle for pixel segmentation, because it always exhibits local branching and end
discontinuities when applied to irregular cracks. The research question of this study is
how to accurately segment and quantify road cracks under complex backgrounds and,
furthermore, how to achieve more accurate crack shape extraction and more accurate
calculation of crack length and width under various common realistic interferences, such
as vehicles, plants, buildings, shadows, and dark light conditions.
In the present study, an integrated framework for road crack segmentation and surface
feature quantification under complex backgrounds is proposed. Compared with the current
Remote Sens. 2023, 15, 1530 3 of 18

state-of-the-art research in the same field, the main contributions of the present study are
as follows:
• An integrated framework for road crack detection and quantification at the pixel level
is proposed. Compared with previous crack detection and segmentation algorithms,
the framework enables more accurate detection, segmentation, and quantification of
road cracks in complex backgrounds, where various common realistic interferences,
such as vehicles, plants, buildings, shadows, or dark light conditions, can be found;
• An attention gate module is embedded in the original Res-UNet to effectively im-
prove the accuracy of road crack segmentation. Compared with YOLACT++ and
DeepLabv3+ algorithms, the modified Res-UNet shows higher segmentation accuracy;
• A new surface feature quantification algorithm is developed to accurately detect
the length and width of segmented road cracks. Compared with the conventional
DTM method, the developed algorithm can effectively prevent problems such as local
branching and end discontinuity.
In summary, the purpose of the present study is to accurately detect cracks in roads
under more realistic conditions and to accurately quantify the detected cracks.
In the proposed framework, three separate computer vision algorithms are innova-
tively combined: (1) firstly, the real-time object detection algorithm YOLOv5 [39] is utilized
for object-level crack detection; (2) secondly, a modified Res-UNet is constructed by em-
bedding an attention gate module to more accurately segment the cracks at the pixel level;
(3) finally, a new surface feature quantification algorithm is developed to more accurately
calculate the length and width of segmental road cracks by removing local branching and
crack end loss. The proposed framework is compared with several existing methods to ver-
ify its accuracy. The comparison results show that the modified Res-UNet has higher crack
segmentation accuracy, and the developed crack quantification algorithm is more effective
than the conventional algorithm in preventing local branching and end discontinuities.
This study is organized as follows: Section 2 provides the proposed architecture;
Section 3 introduces the details of the implementation; experiment results and discussion
are presented in Section 4; concluding remarks are provided in Section 5.

2. Methodology
To detect, segment, and quantify road cracks from complex backgrounds, this study
proposes a fully automated architecture, which is shown in Figure 1. YOLOv5, as a
single-stage object detection algorithm, has the advantages of fast detection speed, easy
deployment, and good small target detection. It has been applied in many engineering
practices [40] and is very suitable for road detection tasks with tight time constraints
and high safety risks. Therefore, the YOLOv5-based approach [41] is first employed to
locate road crack areas with bounding boxes, as shown in Stage 1 of Figure 1. Then, the
area of the bounding box is extracted and sent into the modified Res-UNet algorithm in
Stage 2. For more accurate crack segmentation at the pixel level from the bounding boxes,
the original Res-UNet model is modified by embedding an attention gate and proposing
a new combined loss function in this study. Finally, in Stage 3, a novel surface feature
quantification algorithm is proposed to determine the length and width of the segmented
cracks. Note that all image binarization is carried out with Gaussian and Weiner filters
to reduce noise and uninteresting areas. The primary benefit of the proposed approach
over traditional methods is the significant improvement in accuracy and efficiency of road
crack segmentation in complex backgrounds. Meanwhile, a novel quantification algorithm
is developed to finely analyze the surface feature information with a focus on the crack
morphology. The details of each step of the proposed architecture are introduced in the
following subsections.
Remote Sens. 2023, 15, x FOR PEER REVIEW 4 of 19

in complex backgrounds. Meanwhile, a novel quantification algorithm is developed to


in complex backgrounds. Meanwhile, a novel quantification algorithm is developed to
finely analyze the surface feature information with a focus on the crack morphology. The
finely analyze the surface feature information with a focus on the crack morphology. The
Remote Sens. 2023, 15, 1530 details of each step of the proposed architecture are introduced in the following subsec-
4 of 18
details of each step of the proposed architecture are introduced in the following subsec-
tions.
tions.

Figure 1. Flowchart of the proposed architecture for detecting and quantifying road cracks.
Figure1.1. Flowchart
Figure Flowchartof
ofthe
theproposed
proposedarchitecture
architecturefor
fordetecting
detectingand
andquantifying
quantifyingroad
roadcracks.
cracks.

2.1. YOLOv5 for Road Crack Detection


2.1. YOLOv5
YOLOv5 for for Road
Road Crack Detection
In the first stage
stage of theproposed
proposed approach,YOLOv5 YOLOv5isisutilized
utilizedtoto detect road cracks
In the first stageofofthe
the proposedapproach,
approach, YOLOv5 is utilized detect
to road
detect cracks
road in
cracks
in images with complex and various backgrounds. Specifically, the road crack
images with complex and various backgrounds. Specifically, the road crack images are first images are
in images with complex and various backgrounds. Specifically, the road crack images are
first
inputinput to backbone
to backbone to extract
to extract crackcrack features;
features; then, then, feature
feature fusionfusion is performed
is performed inusing
in neck neck
first input to backbone to extract crack features; then, feature fusion is performed in neck
using
Feature Feature
PyramidPyramid
NetworkNetwork
(FPN) (FPN)
[42] and[42] and Pyramid
Pyramid Attention
Attention Network Network
(PAN) (PAN) [43];
[43]; finally,
using Feature Pyramid Network (FPN) [42] and Pyramid Attention Network (PAN) [43];
finally, the predicted
the predicted values values
of classofprobability,
class probability, item and
item level, level,bounding
and bounding box location
box location of road of
finally, the predicted values of class probability, item level, and bounding box location of
road
crackscracks are output.
are output. The architecture
The architecture of theof the YOLOv5
YOLOv5 is demonstrated
is demonstrated in Figure
in Figure 2, in-
2, including
road cracks are output. The architecture of the YOLOv5 is demonstrated in Figure 2, in-
cluding three
three parts: parts: backbone,
backbone, neck, and neck, and prediction.
prediction.
cluding three parts: backbone, neck, and prediction.

Figure 2. Schematic
Schematic representation
representation of YOLOv5 architecture.
Figure 2. Schematic representation of YOLOv5 architecture.
illustrated in
As illustrated in Figure
Figure 2, the first part of the architecture is backbone, backbone, whose main
As illustrated
function is to
toextract
extract infeatures
Figure 2,fromthe first part ofimage
theinput
input the architecture is backbone, whose main
function is features from the image ororvideo video frames.
frames. Backbone
Backbone consists
consists of
function
of three is
main to extract
modules, features
namely from the
Focus, input image
Convolution or3 video
(C3), frames.
and SpatialBackbone
Pyramid consists
Pooling of
three main modules, namely Focus, Convolution 3 (C3), and Spatial Pyramid Pooling
three main
(SPP) [41].
[41]. The modules,
The raw
raw data namely
data are
are firstFocus,
first divided Convolution
divided by by the
the Focus3 (C3),
Focus module and
module into Spatial
into four Pyramid
four parts,
parts, with Pooling
with each
each
(SPP)
(SPP)
part [41]. The raw
representing two data are first divided
downsamplings. A by the binary
lossless Focus downsampled
module into four parts,
feature mapwith
is each
then
part representing two downsamplings. A lossless binary downsampled feature map is
part representing
generated two downsamplings.
by convolutionally merging A lossless
these binary along
components downsampled
the the
channel feature map is
dimension.
then generated by convolutionally merging these components along channel dimen-
thenC3
The generated
module by convolutionally
consists of several merging these
structural modulescomponents
referred along
to the channel
astobottleneck dimen-
residuals.
sion. The C3 module consists of several structural modules referred as bottleneck resid-
sion.
Note The
that C3
to module
transfer consists
the of
residual several
features structural
while modules
keeping referred
the output to as
depth bottleneck
constant, resid-
two
uals. Note that to transfer the residual features while keeping the output depth constant,
uals. Note
convolution that to
layers transfer
and an the residual
addition features
with the while
initial keeping
amount the output
constitute depth
the constant,
input to the
two convolution layers and an addition with the initial amount constitute the input to the
two convolution
remaining structurallayers and an
module. addition
Finally, the with
SPP the initialmaximized
performs amount constitute the input
collaboration with tofour
the
remaining structural module. Finally, the SPP performs maximized collaboration with
remaining
core dimensions structural module. the
and combines Finally, the SPP
properties performs
to obtain maximized collaboration with
four core dimensions and combines the properties tomulti-scale feature
obtain multi-scale information.
feature infor-
fourIncore
orderdimensions
to fully and combines
extract the fusionthefeatures,
properties the toneck
obtain multi-scale
consisting of FPN feature
and infor-
PAN
mation.
mation.pyramid structures is introduced between the backbone and prediction layers. By
feature
In order to fully extract the fusion features, the neck consisting of FPN and PAN fea-
usingInanorderFPNto fully extractrobust
architecture, the fusion features,
semantic the neck consisting
characteristics of FPN and from
can be transmitted PAN the fea-
ture pyramid structures is introduced between the backbone and prediction layers. By
ture pyramid
highest structures
to the lowest is introduced
feature maps. Thisbetween the backbone
architecture ensures not andonlyprediction
that the layers.
details By of
small objects are correct, but also that large objects can be represented in an abstract way.
In addition, the PAN architecture relays accurate localization information across feature
maps with varying granularity. Through the integral operation of the FPN and PAN, the
neck achieves a satisfactory feature fusion capability.
using an FPN architecture, robust semantic characteristics can be transmitted from the
highest to the lowest feature maps. This architecture ensures not only that the details of
small objects are correct, but also that large objects can be represented in an abstract way.
In addition, the PAN architecture relays accurate localization information across feature
Remote Sens. 2023, 15, 1530 5 of 18
maps with varying granularity. Through the integral operation of the FPN and PAN, the
neck achieves a satisfactory feature fusion capability.
In the prediction, a vector containing the target object’s category probability, item
grade,Inand
the bounding
prediction,
boxa vector containing
location theThere
is returned. targetare
object’s category levels
four detection probability,
in the item
net-
grade, and bounding box location is returned. There are four detection levels
work, each of which has a size-specific feature map for detecting targets of varying in the network,
sizes.
each of
After which
that, has a size-specific
appropriate vectors arefeature
obtainedmap for each
from detecting targets
detection of varying
layer, sizes.
and finally theAfter
an-
that, appropriate vectors are obtained from each detection layer,
ticipated bounding box and item categorization are generated and labelled.and finally the anticipated
bounding box and item categorization are generated and labelled.
2.2. Modified Res-UNet for Crack Region Segmentation
2.2. Modified Res-UNet for Crack Region Segmentation
In the second stage, the crack pixels are segmented from the bounding box with Res-
In the second stage, the crack pixels are segmented from the bounding box with Res-
UNet, which employs skip connections to transmit contextual and spatial data between
UNet, which employs skip connections to transmit contextual and spatial data between
the encoder and decoder [29]. This connection helps to retrieve vital spatial data lost dur-
the encoder and decoder [29]. This connection helps to retrieve vital spatial data lost
ing downsampling. However, considering the similarity between crack-like objects and
during downsampling. However, considering the similarity between crack-like objects and
crack
crackedges
edgesand
andtheir
theirnear
nearbackground,
background,passing
passingall
allthe
theinformation
informationin inthe
theimage
imagethrough
through
the
the skip connections can result in poor crack segmentation, especially for some
skip connections can result in poor crack segmentation, especially for some blurred
blurred
cracks.
cracks. Therefore, the architecture of Res-UNet is modified in the present study, asshown
Therefore, the architecture of Res-UNet is modified in the present study, as shown
in
in Figure 3. Specifically,
Figure 3. Specifically,the
thenetwork
network was
was improved
improved in two
in two mainmain aspects:
aspects: first, first, the ex-
the extracted
tracted crackfeatures
crack detail detail features were enhanced
were enhanced by embedding
by embedding an attention
an attention gatesecond,
gate [44]; [44]; second,
a new
acombined
new combined loss function was proposed to improve the accuracy of segmentation.
loss function was proposed to improve the accuracy of segmentation. A detailed A
detailed description of these two improvements is described
description of these two improvements is described as follows. as follows.

Figure
Figure3.3.Architecture
Architectureof
ofthe
themodified
modified Res-UNet.
Res-UNet.

2.2.1. Attention Gate


2.2.1. Attention Gate
The attention mechanism of image segmentation is derived from the way human visual
The attention mechanism of image segmentation is derived from the way human vis-
attention works, i.e., focusing on one region of an image and ignoring other regions [45]. In
ual attention works, i.e., focusing on one region of an image and ignoring other regions
this study, attention gates are embedded to update model parameters in spatial regions rel-
[45]. In this study, attention gates are embedded to update model parameters in spatial
evant to crack segmentation, and its structure is shown in Figure 4. Two inputs are fed into
regions relevant to crack segmentation, and its structure is shown in Figure 4. Two inputs
the attention gate, namely the tensor of feature pE from the low-level encoder component,
are fed into the attention gate, namely the tensor of feature pE from the low-level encoder
and the tensor of feature pD from the prior layer of the decoder component. Since pD comes
component, and the tensor of feature pD from the prior layer of the decoder component.
from a more fundamental layer of the network, it has less dimensionality and reflects the
Since pD comes from a more fundamental layer of the network, it has less dimensionality
features more accurately than pE . Therefore, before adding the two features element by
and reflects the features more accuratelyDthan pE. Therefore, before adding the two fea-
element, an upsampling operation on p is required to ensure that the dimensions are
tures element
equivalent to lby element,
= Up an upsampling
(pD ). Subsequently, operation
to reduce the on pD is required
computational to the
cost, ensure that the
channels are
dimensions
compressedare equivalent
by feeding datatofrom
l = Up (p ). Subsequently,
D
multiple sources into atolinear
reduce the computational
conversion cost,a
layer utilizing
the channels are
channel-wise 1× compressed by feedinglayer,
1 × 1 convolutional data and
fromthen
multiple sources
each piece into a linear
is inserted conver-
individually.
sion layer utilizing a channel-wise 1 × 1 × 1 convolutional layer, and then
Note that throughout the summation process, the weight of alignment is greater, and the each piece is
inserted individually. Note that throughout the summation process,
weight of misalignment is less. By using a Rectified Linear Unit (ReLU) activation functionthe weight of
and a convolution for the expected features, the channel specification can be reduced to
Fint . Afterwards, a sigmoid layer projects the attention coefficients (weights) onto the range
[0, 1], with larger coefficients indicating greater significance. Eventually, the attentional
parameter is multiplied by a factor with primary source p vector to scale it according to
significance. The entire attention-gating procedure is described as

p̂k = δ T (ε 1 (WpT pik + WlT li + bl )) + bδ (1)


icance. Eventually, the attentional parameter is multiplied by a factor with primary source
p vector to scale it according to significance. The entire attention-gating procedure is de-
scribed as

Remote Sens. 2023, 15, 1530


pˆ k = δ T (ε1 (WpT pik + WlT li + bl )) + bδ 6 of(1)
18

λ = ε ( pˆ , l ;Θ )
k
k i 2k
k
i att
(2)
λi = ε 2 ( p̂ , li ; Θ att ) (2)
where εƐ11 and
where and εƐ22represent
representthe theReLU
ReLUand andsigmoid
sigmoidactivation
activationfunctions, respectively.ΘΘattatt
functions,respectively.
represents the attention gate’s
represents gate’s parameters,
parameters,which
whichinclude:
include:linear
linear transformations
transformations ∈l ℝ∈F
WlW l

×FF l
int
×
R , Wp ∈F int ,W
k F ×F
int
F k ×
ℝp ∈ ,Rδ ∈ ℝ , δ, ∈
F
int F
int ×1 andF × 1 , and corresponding
R corresponding
l bias terms bδ ∈
bias ℝ, blb∈δ ∈
terms F int
F
ℝ R. , bl ∈ R .int

Figure 4. Illustration
Figure 4. Illustration of
of the
the attention
attention gate.
gate.
2.2.2. Combined Loss
2.2.2. Combined Loss
The difference between network predictions and ground truth is usually described by
The difference between network predictions and ground truth is usually described
a loss function. In order to minimize the loss function, a stochastic gradient descent (SGD)
by a loss function. In order to minimize the loss function, a stochastic gradient descent
optimizer can be used to optimize the network weights. In addition, segmenting cracks
(SGD)
from optimizer
the images iscan be used
a binary to optimize
classification the network
problem. weights.
Therefore, Instudy,
in this addition, segmenting
the binary cross
cracks from the images is a binary classification
entropy (BCE) loss function is employed: problem. Therefore, in this study, the bi-
nary cross entropy (BCE) loss function is employed:
BCE (h, (vh), =
BCE =∑
v) − −i hihlog vi
i log vi
i
(3)
(3)

where ii represents
where represents the
the category,
category,in inthis study,i i∈∈
this study, {0,{0, vi presents
1};1}; thethe
vi presents prediction; and
prediction; hi
and
represents the actual category assigned to each identified pixel. The cracks and
hi represents the actual category assigned to each identified pixel. The cracks and back- backgrounds
are considered
grounds at the same
are considered at level by BCE,
the same levelwhich
by BCE,effectively solves the problem
which effectively caused
solves the problemby
the different sizes of these two categories.
caused by the different sizes of these two categories.
In
In this
this study,
study, the
the image
image segmentation
segmentation results
results can
can be
be evaluated
evaluated by
by the Dice coefficient,
the Dice coefficient,
which
which can
can be
be described
describedas:as:

(h,
DD | H∩∩VV| |
22| H 2∑
( hv, )v )== | H |+|V | ==∑i vi + ∑

2 i viihvii hi (4)
(4)
| H | + |V |  i vi +  i hi
i hi

where H
where and V
H and V represent
represent the
the actual
actual and
and anticipated
anticipated item
item volumes,
volumes,respectively.
respectively.
To
To resolve the imbalance between the crack zone and the
resolve the imbalance between the crack zone and the background,
background, the
the Dice
Dice loss
loss
and BCE loss are merged as:
and BCE loss are merged as:
L(h, v) = (1 − β) BCE(h, v) + βD (h, v)
2∑ v i h i +ï
(5)
= (1 − β)(∑ hi log vi ) + β( ∑ vi +∑ h +ï )
i i
i i i

where ï is a negligible quantity, typically 1 × 10−10 , which is primarily utilized to avoid


division by zero. β is set to 0.5 by experimental test.

2.3. Novel Algorithm for Crack Quantification


In this section, the segmented cracks are quantified at the pixel level with the proposed
algorithm. As a comparison, the traditional DTM procedure is first introduced, as shown
in Figure 5. Firstly, the binary image is converted into a binary matrix, i.e., with “1” or
“0” representing whether it is a crack pixel or not, respectively. Next, a labelling operator
examines all clusters of pixels with the same value (i.e., 1 or 0), starting from the top left
corner, and assigns a unique value to each cluster. In this way, all pixels are categorized
into different clusters and assigned with numbers (1–5), as the first operation is depicted in
2.3. Novel Algorithm for Crack Quantification
the contour,
In this and for transverse
section, the segmentedcracks,cracks
its lower
are left corner isatthe
quantified thestarting pointwith
pixel level (transverse
the pro-
and longitudinal cracks are distinguished based on the rotation
posed algorithm. As a comparison, the traditional DTM procedure is first introduced, angle R of the crack, R ≥as
45° for longitudinal
shown in Figure 5. and R < 45°
Firstly, the for transverse).
binary image is converted into a binary matrix, i.e., with
(3) Negative contour: Similar
“1” or “0” representing whether it is to thea previous
crack pixel step,
or the
not,pixels in Đ are Next,
respectively. sortedain clock-
labelling
Remote Sens. 2023, 15, 1530 wise to obtain the full contour point set Ĝ. The first half of Ĝ is specified
operator examines all clusters of pixels with the same value (i.e., 1 or 0), starting from the as the positive
7 of 18
contour
top left point
corner,setand
Ĝnegassigns
. a unique value to each cluster. In this way, all pixels are cate-
gorized into differentFor
(4) Center pixels. each and
clusters longitudinal
assignedcrack with contour,
numbersall points
(1–5), in positive
as the contouris
first operation
ĒFigure
pos and5.negative
depicted Apparently, contour Ĝ neg are traversed
some crack pixels
in Figure 5. Apparently, some crack to find
are connected a pairs
pixels arebut of points
have different
connected with equivalent
cluster
but have numbers.
different y-
clus-
values (x-values
Therefore,
ter numbers. for transverse
the Therefore,
same operator cracks),
the needs and the
to be applied
same operator centroid
needs again coordinates
to betoapplied
combineagain of each
theseto pair
pixels. of points
Afterthese
combine the
are
abovedetermined
steps, the as:
final image matrix can be obtained. To calculate
pixels. After the above steps, the final image matrix can be obtained. To calculate the the length and width of a
length and width of a crack, it isnecessary n − xp |
to| xdetermine
crack, it is necessary to determine the center pixel of each cluster formed by the preceding
the center pixel of each cluster
labelling
formed by operator with the labelling
the preceding
c
= x p algorithm.
Xoperator
parallelthinning +with the parallel
Details can be found
thinning in the work
algorithm. of
Details
 2 (7)
Lee et al. [46].
can be found in the work of Lee et al. [46].Y c = y ( if y = y )
 p p n

where Xc and Yc represent coordinates of the pixels in Ĉ; xp, yp and xn, yn are coordinates of
the pixels in Ēpos and Ĝneg, respectively.
(5) Post-processing: Due to the irregularity of the crack contour, multiple centroids
may be obtained at sharply varying edges, such as the 7th, 8th, and 9th edge points of the
positive contour in Figure 8. The point with the smallest average Euclidean distance from
its neighboring points is retained, while the rest are removed. In this way, the final set of
centroids λ is obtained.
Eventually,
Figure
Figure theDTM
5.5. Traditional
Traditional crack
DTM length L can be obtained by calculating and accumulating the
approach.
approach.
distance between each pair of neighboring pixels in the center point set λ:
Through
Through the thecalculation,
calculation, ititisisfound that when extracting the crack
crackcenter
centerpixel
pixelwith
the traditional parallel thinning
the traditional parallel thinning 
i =1
found
( X ic+1 that
L = algorithm, X when
−there c 2
i )are (extracting
+issues c the
Yi +c 1 −ofYilocal
2 with
+1 ) branches and loss at (8)
algorithm, there are issues of local branches and lossthe at
crack ends, as illustrated in Figure 6a, which obviously leads to an inaccurate calculation
the crackc ends,
c as illustrated
c c in Figure 6a, which obviously leads to an inaccurate calcula-
where
of the X i+1, Yi+1
crack and XTherefore,
length. i , Yi are coordinates of the pixels
the morphological in theofcenter
features cracks point set λ.considered,
are fully Moreover,
tion of the crack length. Therefore, the morphological features of cracks are fully consid-
the
andcrack
a newwidthcrackW can be determined
quantification algorithmas: is developed in this study, as shown in Figure 6b.
ered, and a new crack quantification algorithm is developed in this study, as shown in
By extracting the coordinates of the crack edge ) and performing an average calculation,
Figure 6b. By extracting the coordinates W = of M in the(τcrack
×2+ 1 and performing
edge an average (9)
cal-
the aforementioned limitations can be well-addressed, and the final result is depicted in
culation, the aforementioned limitations can be well-addressed, and the final result is de-
where
Figure τ6c.represents the distance
The complete flowchartbetween the crack’s
of the proposed edgefeature
surface and the corresponding
quantification center
algorithm
picted in Figure 6c. The complete flowchart of the proposed surface feature quantification
pixel.
is shown in Figure 7, and the specific implementation process is as follows:
algorithm is shown in Figure 7, and the specific implementation process is as follows:
(1) Image preprocessing: The image matrix shown in Figure 5 based on the traditional
crack skeleton detection method is obtained, and the object contour set is found by apply-
ing the “findContours” [47] function in OpenCV. The optimal contour, including the max-
imum number of internal closed pixels, is determined for each region:
T = Γ (x, y) (6)
x =1 y=1

where T represents the number of pixels in the target area; Г (x, y) represents the grayscale
value of the target; “0” and “1” for the background and target, respectively. Based on the
extracted contour pixel coordinates (x, y), the crack contour point set Đ is generated. Note
that the top left corner of the image is chosen as the coordinate origin.
Schematicdiagram
Figure6.6.Schematic
Figure diagramofofcrack
crackskeleton
skeletonextraction:
extraction:(a)
(a)traditional
traditionalapproach
approachhas
haslocal
localbranches
branches
and crack end
and crack end loss in thinning; (b) schematic diagram of the proposed method; (c) thinning results
thinning; (b) schematic diagram of the proposed method; (c) thinning results of
of the proposed method.
the proposed method.

(1) Image preprocessing: The image matrix shown in Figure 5 based on the traditional
crack skeleton detection method is obtained, and the object contour set is found by applying
the “findContours” [47] function in OpenCV. The optimal contour, including the maximum
number of internal closed pixels, is determined for each region:

T= ∑ ∑ Γ(x, y) (6)
x =1 y =1

where T represents the number of pixels in the target area; Γ (x, y) represents the grayscale
value of the target; “0” and “1” for the background and target, respectively. Based on the
extracted contour pixel coordinates (x, y), the crack contour point set Ð is generated. Note
that the top left corner of the image is chosen as the coordinate origin.
Remote
RemoteSens. 2023,15,
Sens.2023, 15,1530
x FOR PEER REVIEW 98 of 18
19

Remote Sens. 2023, 15, x FOR PEER REVIEW 9 of 19

Figure7.7. Flowchart
Figure Flowchartof
ofthe
theproposed
proposedcrack
crackskeleton
skeletonextraction
extractionalgorithm.
algorithm.

(2) Positive contour: The pixels in Ð are sorted in counterclockwise to obtain the full
contour point set Ē. The first half of Ē is specified as the positive contour point set Ēpos , as
shown in Figure 8. For longitudinal cracks, the starting point is the upper left corner of the
contour, and for transverse cracks, its lower left corner is the starting point (transverse and
longitudinal cracks are distinguished based on the rotation angle R of the crack, R ≥ 45◦
for longitudinal
Figure and
7. Flowchart ofRthe 45◦ for transverse).
< proposed crack skeleton extraction algorithm.

Figure 8. Steps of the proposed crack skeleton extraction, quantification algorithm: (a) obtain posi-
tive and negative contours; (b) extract center point; (c) skeleton extraction results.

3. Implementation Details
In this section, the proposed method is compared with the state-of-the-art crack iden-
tification, extraction, and quantification methods, respectively. It is worth noting that, to
increase the confidence of the evaluation, the PyTorch framework, which is consistent
Figure8.8.Steps
Figure Stepsofofthe
theproposed
proposedcrackcrackskeleton extraction,
skeleton quantification
extraction, algorithm:
quantification (a) (a)
algorithm: obtain positive
obtain posi-
with
and the
negativeoriginal network
contours; (b) [48],
extract is
center used in
point; this
(c) study,
skeleton and both
extraction training
results.
tive and negative contours; (b) extract center point; (c) skeleton extraction results.
and testing are
based on the widely used publicly available datasets.
(3) Negative contour:
3. Implementation Details Similar to the previous step, the pixels in Ð are sorted in
3.1. Datasets
clockwise to obtain the full contour point set Ĝ. The first half of Ĝ is specified as the
In this section, the proposed method is compared with the state-of-the-art crack iden-
positive To contour
obtain point
the set Ĝ
stable neg .
weights, the YOLOv5 and modified Res-UNet models are re-
tification,
(4) extraction,
Center pixels. and
For quantification
each longitudinalmethods, respectively.
crack of
contour, It is worth
all points in notingcontour
positive that, to
quired
increase to be
the pre-trained
confidence first,
of theand the information
evaluation, the the
PyTorch dataset used
framework, for training
which is is listed
consistent
ĒinposTable
and negative contourtheĜ neg are traversed to findimages
a pairsareof pointsthewith equivalent
with the 1. Specifically,
original network training
[48], andinvalidation
is used this study, and both from
training andRoad Damage
testing are
Detection
based on the (RDD) dataset
widely used[49], while
publicly the test datasets.
available images are from Hunan University [50]. In
this study, two types of cracks are considered, namely transverse cracks and longitudinal
3.1. Datasets
To obtain the stable weights, the YOLOv5 and modified Res-UNet models are re-
quired to be pre-trained first, and the information of the dataset used for training is listed
in Table 1. Specifically, the training and validation images are from the Road Damage
Remote Sens. 2023, 15, 1530 9 of 18

y-values (x-values for transverse cracks), and the centroid coordinates of each pair of points
are determined as: (
| xn − x p |
Xc = x p + 2 (7)
Y c = y p ( i f y p = yn )

where Xc and Yc represent coordinates of the pixels in Ĉ; xp , yp and xn , yn are coordinates
of the pixels in Ēpos and Ĝneg , respectively.
(5) Post-processing: Due to the irregularity of the crack contour, multiple centroids
may be obtained at sharply varying edges, such as the 7th, 8th, and 9th edge points of the
positive contour in Figure 8. The point with the smallest average Euclidean distance from
its neighboring points is retained, while the rest are removed. In this way, the final set of
centroids λ is obtained.
Eventually, the crack length L can be obtained by calculating and accumulating the
distance between each pair of neighboring pixels in the center point set λ:
q
L= ∑ ( Xic+1 − Xic )2 + (Yic+1 − Yic+1 )2 (8)
i =1

where Xic+1 , Yic+1 and Xic , Yic are coordinates of the pixels in the center point set λ. Moreover,
the crack width W can be determined as:

W = Min(τ ) × 2 + 1 (9)

where τ represents the distance between the crack’s edge and the corresponding center pixel.

3. Implementation Details
In this section, the proposed method is compared with the state-of-the-art crack
identification, extraction, and quantification methods, respectively. It is worth noting that,
to increase the confidence of the evaluation, the PyTorch framework, which is consistent
with the original network [48], is used in this study, and both training and testing are based
on the widely used publicly available datasets.

3.1. Datasets
To obtain the stable weights, the YOLOv5 and modified Res-UNet models are required
to be pre-trained first, and the information of the dataset used for training is listed in Table 1.
Specifically, the training and validation images are from the Road Damage Detection (RDD)
dataset [49], while the test images are from Hunan University [50]. In this study, two types
of cracks are considered, namely transverse cracks and longitudinal cracks. It is worth
noting that the cracks used for training and validation in this study are mostly wide cracks.
This is due to the fact that the wide cracks (i.e., width > 2 mm) [31] are more harmful to
road structure and are more visible for collection. In preprocessing of training YOLOv5, the
resolution of all images is resized to 1280 × 1280 pixels. In the 120 images used for testing,
the wide (width > 2 mm), medium (1 mm < width < 2 mm), and thin (width < 1 mm) cracks
are 70%, 20%, and 10%, respectively.
The training images for the modified Res-UNet model are taken from the publicly
available road crack dataset [51–53], which was gathered under various illumination
circumstances (including shadow, occlusion, low contrast, and noise). In preprocessing,
the images used for both training and validation are resized to 448 × 448 pixels. Testing
images are those cropped by bounding boxes generated by YOLOv5. All images in the
dataset for training and validation are chosen at random, and part of the image samples
are shown in Figure 9.
Table 1. Image dataset
Table 1.for crackdataset
Image detection and segmentation.
for crack detection and segmentation.
Training Validation
Training Test
Validation Test
(a) YOLOv5(a) YOLOv5
Number of images
Number of images2200 2200240 240 120 120 10 of 18
Remote Sens. 2023, 15, 1530
Resolution Resolution 1280 × 1280 12801280 × 1280
× 1280 1920 ×
1280 × 12801080, 4032
1920××3024
1080, 4032 × 30
(b) Modified(b) Res-UNet
Modified Res-UNet
Number of images
Number of images3800 3800360 360 120 120
Table 1. Image dataset for crack detection and segmentation.
Resolution Resolution 448 × 448 448 448 × 448
× 448 307 ×
448 × 448 706, 908 × 129
307 et al.
× 706, 908 × 129 e
Training Validation Test
The training images for theimages
The training modifiedfor Res-UNet model
the modified are taken
Res-UNet from
model arethe publicly
taken from the pub
(a) YOLOv5
available
Number ofroad crack dataset
available
images road [51–53],
2200 which[51–53],
crack dataset was
240 gathered under
which was variousunder
gathered
120 illumination cir-
various illumination
cumstances
Resolution (including shadow,
cumstances ×occlusion,
(including
1280 1280shadow,low contrast,
× 1280 and
occlusion,
1280 noise).
1920In
low contrast, preprocessing,
×and
1080,noise).
4032 ×In3024the
preprocessing
images used Res-UNet
(b) Modified for both training
images used forand
bothvalidation arevalidation
training and resized to are
448resized
× 448 pixels.
to 448Testing images
× 448 pixels. Testing im
Number
are thoseof images
cropped
are those 3800 boxes
by bounding
cropped by bounding 360
generated by YOLOv5.
boxes generatedAllby 120
images inAll
YOLOv5. theimages
datasetinforthe datase
Resolution
training and validation are448 × 448 at random,
chosen 448 × 448
and part of the 307 × samples
image 706, 908 ×are
129shown
et al.
training and validation are chosen at random, and part of the image samples are sh
in Figure 9. in Figure 9.

Figure ResultsFigure
Figure9.9.Results 9. Results
ofofidentifying
identifying andofsegmenting
and identifying
segmenting and segmenting
cracks based
cracks onon
based cracks
the based
publicly
the on the publicly
available
publicly dataset
available available dataset
images:
dataset
images:
(a) identifying images:
(a) identifying
cracks; (a)(b)
identifying
cracks;
and cracks; and (b)
and (b) segmenting
segmenting cracks. segmenting cracks.
cracks.

The
Theimages
imagesused Theto
used toimages
test
testtheused
the to testmodified
YOLOv5,
YOLOv5, the YOLOv5,
modified modified
Res-UNet,
Res-UNet, andRes-UNet,
andthe and the
thedeveloped
developed developed c
crack
crack
quantification quantification
quantification algorithm
algorithm are algorithm
are collected are collected
collected with
with an
an iPhone
iPhonewith
12 an iPhone with
12 equipped
equipped 12 equipped
with Feiyu with Feiyu
Feiyu Vimble
Vimble 33 Vim
Handheld Handheld
Gimbal [54], as Gimbal
shown [54],
in as
Figure
Handheld Gimbal [54], as shown in Figure 10a. shown
10a. in Figure 10a.

Figure 10. Devices for image collection and network training: (a) Handheld Gimbal: Feiyu Vim
Figure10.
10.Devices
Devicesfor
for imagecollection
collectionand
andnetwork
networktraining:
training:(a)
(a)Handheld
HandheldGimbal:
Gimbal:Feiyu
FeiyuVimble
Vimble3;
Figure 3; andimage
(b) Deep learning server: Super Cloud R8428 G11.
3; and (b) Deep learning server: Super Cloud R8428
and (b) Deep learning server: Super Cloud R8428 G11. G11.

3.2. Training Configuration


YOLOv5 and the modified Res-UNet are trained on a deep learning server (Super
Cloud R8428 G11) with six Nvidia GeForce RTX 3060 (12 GB of memory), as shown in
Figure 10b. The operating system is Ubuntu 20.04 with Pytorch 1.9.1, CUDA 11.0, and
CUDNN 8.04.
The hyperparameters for YOLOv5 are as follows: batch size (32), learning rate (0.001),
momentum (0.9), weight decay (0.0005), and training epoch (1000). The adaptive moment
estimation optimizer is employed in the training process. As for the modified Res-UNet,
the tuned hyperparameters are as follows: batch size (64), weight decay (0.0001), and the
stochastic gradient descent (SGD) optimizer.

3.3. Evaluation Metrics


To evaluate the experimental results of YOLOv5, the modified Res-UNet, and the
developed quantification algorithm, five performance metrics are considered: mean average
precision (mAP), mean intersection of the union (IoU), pixel accuracy (PA), Dice coefficient
Remote Sens. 2023, 15, 1530 11 of 18

(DICE), and root mean square (RMS) error. In particular, the average precision (AP)
represents the area under the precision–recall curve (P-R curve), while mAP represents the
average value of different categories of AP:

N R1
∑ 0 P( R)dR
AP 1
mAP = = (10)
N N
where P is the proportion of all predicted positive samples that are correctly detected, and
R is the proportion of all actual positive samples that are successfully detected; N refers
to the number of crack categories, in this study mainly transverse cracks and longitudinal
cracks are considered, thus the value is taken as 2. IoU is the ratio between the intersection
and union of the candidate boxes generated and the original marked boxes, which can be
expressed as:
area( Ta ∩ Tb )
IoU = (11)
area( Ta ∪ Tb )
where Ta represents the ground-truth crack pixels, and Tb denotes the predicted crack
pixels. PA is the number of correctly predicted pixels out of the total pixels, which can be
expressed as:
k
∑ pii
i =0
PA = (12)
k k
∑ ∑ pij
i =0 j =0

Dice coefficient is adopted to evaluate the ensemble similarity, as shown below:

2TP
DICE = (13)
FP + 2TP + FN
where TP represents the number of true pixels predicted as positive, FP is the number of
false pixels predicted as positive, and FN is the number of false pixels predicted as negative.
The value of Dice ranges from 0 to 1, with the number indicating better model performance.
RMS error can be determined as:
s
2
∑ik=1 ( P − T )
RMS error = (14)
k

where k is the total number of test images (120 in this study), P represents the quantification
result, and T represents the ground truth.

4. Experiment Results and Discussion


To validate the performance of the proposed approach, real road cracks with various
backgrounds are collected and tested in this section. Furthermore, the results are compared
with two state-of-the-art deep learning algorithms, YOLACT++ [55] and DeepLabv3+ [56].

4.1. Road Crack Detection


The test images are collected in different real scenes with clear backgrounds, shadows,
dark light, and lane lines, respectively. The results of the crack detection using the YOLOv5
model are shown in Figure 11, which shows that all transverse cracks (labelled as “CrackT”)
and longitudinal cracks (labelled as “CrackL”) are all accurately detected with an mAP
of 91%.
4.1. Road Crack Detection
The test images are collected in different real scenes with clear backgrounds, shad-
ows, dark light, and lane lines, respectively. The results of the crack detection using the
YOLOv5 model are shown in Figure 11, which shows that all transverse cracks (labelled
Remote Sens. 2023, 15, 1530 12 of 18
as “CrackT”) and longitudinal cracks (labelled as “CrackL”) are all accurately detected
with an mAP of 91%.

Figure 11.
Figure 11. Outcomes
Outcomesof ofYOLOv5-based
YOLOv5-basedroad roadcrack
crackdetection: (a)(a)
detection: Road
Road crack I; (b)
crack Road
I; (b) crack
Road II: II:
crack
shadow; (c) Road crack III: dark light; and (d) Road crack IV: lane line.
shadow; (c) Road crack III: dark light; and (d) Road crack IV: lane line.

4.2.
4.2. Region Crack Segmentation
To
To segment
segment and
and evaluate
evaluate cracks
cracks from
from images,
images, image
image boxes containing
containing cracks
cracks detected
detected
by
by YOLOv5
YOLOv5are arefed
fedinto
intomodified
modifiedRes-UNet.
Res-UNet.For Forthe
thepurpose
purpose ofof
illustrating
illustrating thethe
efficiency
efficiencyof
the modified
of the Res-UNet,
modified sevenseven
Res-UNet, different UNet-based
different modelsmodels
UNet-based for segmenting cracks, namely
for segmenting cracks,
U-Net
namely[57], Res-UNet
U-Net [29], CrackUNet15
[57], Res-UNet [58], CrackUNet19
[29], CrackUNet15 [58], UNet-VGG19
[58], CrackUNet19 [59], UNet-
[58], UNet-VGG19
InceptionResNetv2
[59], UNet-InceptionResNetv2 [59], and UNet-EfficientNetb3 [59], are selected forbased
[59], and UNet-EfficientNetb3 [59], are selected for comparison com-
on the testing
parison baseddatasets. The comparison
on the testing datasets. Theresults are listedresults
comparison in Tableare2,listed
and the outcomes
in Table of the
2, and the
original
outcomes Res-UNet and theRes-UNet
of the original modified Res-UNet are provided
and the modified in Figure
Res-UNet 12. It can be
are provided seen from
in Figure 12.
Table
It can 2bethat the
seen modified
from Table 2Res-UNet achieves Res-UNet
that the modified the highestachieves
IoU, PA,the and DICE.IoU,
highest Specifically,
PA, and
the
DICE.values of average
Specifically, theIoU obtained
values by UNet,
of average IoURes-UNet,
obtained CrackUNet15,
by UNet, Res-UNet, CrackUNet19,
CrackUNet15,UNet-
VGG19, UNet-InceptionResNetv2, and UNet-EfficientNetb3
CrackUNet19, UNet-VGG19, UNet-InceptionResNetv2, and UNet-EfficientNetb3 areare 78.63%, 80.30%, 83.89%,
84.78%,
78.63%, 84.53%, 83.98%, 84.78%,
80.30%, 83.89%, 84.36%,84.53%,
and 87.00%,
83.98%,respectively;
84.36%, and and the DICE
87.00%, obtainedand
respectively; by the
the
modified Res-UNet was improved by 7.86%, 6.06%, 3.93%, 2.65%, 2.88%,
DICE obtained by the modified Res-UNet was improved by 7.86%, 6.06%, 3.93%, 2.65%, 3.60%, and 3.13%,
respectively.
2.88%, 3.60%,Itandcan 3.13%,
be seenrespectively.
from Figure 12 thatbe
It can theseen
random
frominterference
Figure 12 that noisetheisrandom
effectively
in-
reduced by the embedded attention gates, and there is a significant improvement in the
terference noise is effectively reduced by the embedded attention gates, and there is a
crack-like feature detection using the different weight distribution methods. Specifically, the
significant improvement in the crack-like feature detection using the different weight dis-
IoU of the cracks segmented by the Res-UNet with embedded attention gates is improved
tribution methods. Specifically, the IoU of the cracks segmented by the Res-UNet with
by 6.7% compared with the original model.
embedded attention gates is improved by 6.7% compared with the original model.
Table 2. The results of different UNet-base models on the test dataset.
Table 2. The results of different UNet-base models on the test dataset.
Model
Model Threshold
Threshold IoU
IoU(%)
(%) PA (%)
PA (%) DICE (%)
DICE (%)
UNet
UNet 78.63
78.63 90.41
90.41 85.28
85.28
Res-UNet 80.30 92.06 87.08
Res-UNet 80.30 92.06 87.08
CrackUNet15 83.89 94.63 89.21
CrackUNet15
CrackUNet19 83.89
84.78 94.63
95.86 89.21
90.49
CrackUNet19 0.5 84.78 95.86 90.49
UNet-VGG19 84.53 95.41 90.26
0.5
UNet-InceptionResNetv2
UNet-VGG19 83.98
84.53 94.72
95.41 89.54
90.26
UNet-EfficientNetb3
UNet-InceptionResNetv2 84.36
83.98 95.16
94.72 90.01
89.54
Modified Res-UNet 87.00 98.47 93.14
UNet-EfficientNetb3 84.36 95.16 90.01
Modified Res-UNet
In addition, 87.00
the proposed approach is compared with 98.47 93.14segmenta-
two existing crack
tion neural networks (YOLACT++ and DeepLabv3+). The YOLACT++ and DeepLabv3+
networks are trained with publicly available crack segmentation datasets [51–53], which
contain noise such as moss on crack, title lines, etc. These two networks were tested with
the same 120 images as the proposed method, and several typical examples are shown
in Figure 13. These images are taken from different environments with a variety of back-
grounds, including lawns, vehicles, buildings, and at night. As can be seen in the last
column of Figure 13, the cracks can be accurately detected and segmented even under weak
illumination conditions. Meanwhile, the details of the three methods are listed in Table 3.
As shown in the table, the proposed approach achieved an average IoU of 87.00%, while
YOLACT++ and DeepLabv3+ achieved an average IoU of only 48.02% and 57.14%, respec-
tively. This indicates that the proposed method significantly outperforms DeepLabv3+ and
YOLACT++ networks for crack identification and segmentation in complex environments.
Remote
Remote Sens. 2023,15,
Sens. 2023, 15,1530
x FOR PEER REVIEW 13 of
13 of 18
19

Figure 12. Comparisons of the original and modified Res-UNet to segment cracks from the origi-
nal images.

In addition, the proposed approach is compared with two existing crack segmenta-
tion neural networks (YOLACT++ and DeepLabv3+). The YOLACT++ and DeepLabv3+
networks are trained with publicly available crack segmentation datasets [51–53], which
contain noise such as moss on crack, title lines, etc. These two networks were tested with
the same 120 images as the proposed method, and several typical examples are shown in
Figure 13. These images are taken from different environments with a variety of back-
grounds, including lawns, vehicles, buildings, and at night. As can be seen in the last col-
umn of Figure 13, the cracks can be accurately detected and segmented even under weak
illumination conditions. Meanwhile, the details of the three methods are listed in Table 3.
As shown in the table, the proposed approach achieved an average IoU of 87.00%, while
YOLACT++ and DeepLabv3+ achieved an average IoU of only 48.02% and 57.14%, respec-
tively. This indicates that the proposed method significantly outperforms DeepLabv3+
and YOLACT++
Figure
Figure 12. networks
12. Comparisons
Comparisons ofofthe
thefor crack
original
original identification
andand modified
modified andtosegmentation
Res-UNet
Res-UNet to segment
segment in complex
crackscracks
from from
the environ-
the
original origi-
images.
ments.
nal images.

In addition, the proposed approach is compared with two existing crack segmenta-
tion neural networks (YOLACT++ and DeepLabv3+). The YOLACT++ and DeepLabv3+
networks are trained with publicly available crack segmentation datasets [51–53], which
contain noise such as moss on crack, title lines, etc. These two networks were tested with
the same 120 images as the proposed method, and several typical examples are shown in
Figure 13. These images are taken from different environments with a variety of back-
grounds, including lawns, vehicles, buildings, and at night. As can be seen in the last col-
umn of Figure 13, the cracks can be accurately detected and segmented even under weak
illumination conditions. Meanwhile, the details of the three methods are listed in Table 3.
As shown in the table, the proposed approach achieved an average IoU of 87.00%, while
YOLACT++ and DeepLabv3+ achieved an average IoU of only 48.02% and 57.14%, respec-
tively. This indicates that the proposed method significantly outperforms DeepLabv3+
and YOLACT++ networks for crack identification and segmentation in complex environ-
ments.

Figure 13. Comparisons of the proposed approach, YOLACT++, and DeepLabv3+ for segmenting
cracks from images with complex backgrounds.

Table 3. The information of the three methods and the comparison of the evaluation metrics.

YOLACT++ DeepLabv3+ Proposed Approach


Training data Public Public Public
Label type Pixel mask Pixel mask Bounding box + Pixel mask
Testing data self-collected self-collected self-collected
Test data 120 120 120
PA (%) 63.24 72.32 98.47
DICE (%) 57.21 64.49 93.14
Average IoU (%) 48.02 57.14 87.00
Label type Pixel mask Pixel mask Bounding box + Pixel mask
Testing data self-collected self-collected self-collected
Test data 120 120 120
PA (%) 63.24 72.32 98.47
Remote Sens. 2023, 15, 1530 DICE (%) 57.21 64.49 93.14 14 of 18
Average IoU (%) 48.02 57.14 87.00

4.3. Quantification
4.3. Quantification of of Crack
Crack Surface
Surface Feature
Feature
In this
In this section,
section, thethe segmented
segmented crackscracks areare analyzed
analyzed by by the
the proposed
proposed quantification
quantification
algorithm to determine
algorithm to determine their width and length in terms of pixels. To evaluate the effec-
tiveness of
tiveness of the
the proposed
proposed algorithm
algorithm in in crack
crack quantification,
quantification, aa self-made
self-made dataset
dataset containing
containing
the ground
the ground truthtruth is
is constructed.
constructed. This This dataset
dataset contains
contains 100100 binary
binary images,
images, each
each of of which
which
has 130
has 130 ×× 130 pixels. In order to better fit the actual actual scene,
scene, different
different distances
distances and and angles
angles
between the
between the camera
camera headhead andand the
the objects
objects areare also
also considered.
considered. The The results
results ofof the
the proposed
proposed
algorithm are demonstrated
algorithm demonstrated as as binary
binary graphs
graphsininFigure
Figure14, 14,where
wherethe theblack
blackpixels
pixelsrepre-
rep-
sent the
resent theextracted
extracted crack
crackedges, thethe
edges, green pixels
green represent
pixels the results
represent of theofthinning
the results algo-
the thinning
algorithm
rithm [46],[46], and and
the the orange
orange pixels
pixels represent
represent the the results
results of this
of this study.
study. In addition,
In addition, all
all the
the identification
identification results
results of the
of the proposed
proposed algorithm
algorithm areare compared
compared withwith
thethe ground
ground truth,
truth, as
as shown
shown in Table
in Table 4. The
4. The valuesvalues in table
in the the table represent
represent the minimum
the minimum crack crack
width,width, the
the maxi-
maximum
mum crackcrack width,width,
and the and thelength.
crack crack length.
It is shownIt isinshown
Table 4inthat
Table
the4proposed
that the proposed
algorithm
algorithm has a very low error in terms of both width and length
has a very low error in terms of both width and length identification. Compared with the identification. Compared
with
ground the ground
truth, the truth, the overall
overall accuracyaccuracy
and total andRMS
total RMS
error error
of theofdeveloped
the developed algorithm
algorithm are
are
95%95% andand2.12.1 pixels
pixels forfor length
length andwidth,
and width,respectively,
respectively,whilewhile the
the conventional thinning
thinning
algorithm
algorithm is is only
only 92%
92% accurate.
accurate. This
This isis due
due toto the
the large
large error
error in
in the
the calculation
calculation of of crack
crack
length
length by by the
the conventional
conventional thinning
thinning method,
method, as as shown
shown in in Figure
Figure 15.15. It
It can
can bebe seen
seen from
from
Figure
Figure 15 that the method proposed in this study effectively avoids local branching and
15 that the method proposed in this study effectively avoids local branching and
end
end loss
loss when
when extracting
extracting the the crack
crack skeleton.
skeleton.

Figure 14.
Figure 14. Crack
Crackquantification
quantification results
results of the
of the proposed
proposed algorithm:
algorithm: (a–h) represents
(a–h) represents differentdifferent crack
crack instances.
instances.
Table 4. Comparison of ground truth and the results of the proposed algorithm.

Instance Ground Truth Predicted Result Error


Crack-1 (3, 7, 152) * (2, 7, 150) (1, 0, 2)
Crack-2 (4, 11, 224) (4, 11, 223) (0, 0, 1)
Crack-3 (3, 17, 267) (2, 18, 264) (1, 1, 3)
Crack-4 (4, 26, 110) (4, 24, 105) (0, 2, 5)
Crack-5 (2, 20, 145) (2, 19, 143) (0, 1, 2)
Crack-6 (2, 11, 201) (2, 10, 197) (0, 1, 4)
Crack-7 (3, 134) (3, 131) (0, 3)
Crack-8 (8, 17, 297) (7, 19, 296) (1, 2, 1)
Crack-9 (14, 21, 129) (16, 19, 126) (2, 2, 3)
Crack-10 (9, 33, 276) (9, 31, 274) (0, 2, 2)
Crack-11 (7, 8, 277) (5, 8, 275) (2, 0, 2)
Crack-12 (4, 13, 56) (3, 12, 57) (1, 1, 1)
Crack-13 (3, 9, 298) (3, 8, 297) (0, 1, 1)
Crack-14 (9, 17, 335) (7, 15, 332) (2, 2, 3)
Crack-15 (4, 7, 227) (4, 8, 223) (0, 1, 4)
Crack-7 (3, 134) (3, 131) (0, 3)
Crack-8 (8, 17, 297) (7, 19, 296) (1, 2, 1)
Crack-9 (14, 21, 129) (16, 19, 126) (2, 2, 3)
Crack-10 (9, 33, 276) (9, 31, 274) (0, 2, 2)
Crack-11 (7, 8, 277) (5, 8, 275) (2, 0, 2)
Remote Sens. 2023, 15, 1530 15 of 18
Crack-12 (4, 13, 56) (3, 12, 57) (1, 1, 1)
Crack-13 (3, 9, 298) (3, 8, 297) (0, 1, 1)
Crack-14 (9, 17, 335) (7, 15, 332) (2, 2, 3)
Table 4. Cont.
Crack-15 (4, 7, 227) (4, 8, 223) (0, 1, 4)
Crack-16
Instance (8, 3, 124)
Ground Truth (8, 3, 122) Result
Predicted (0,Error
0, 2)
Crack-17
Crack-16
(4, 7, 378)(8, 3, 124) (4, 8,(8,
374)
3, 122)
(0, 1, 4)
(0, 0, 2)
Crack-18
Crack-17 (12, 9, 194)
(4, 7, 378) (13, 9,(4,195)
8, 374) (1,
(0,0,
1, 1)
4)
Crack-19
Crack-18 (5, 6, 325)
(12, 9, 194) (5, 7,(13,
321)9, 195) (0,
(1,1,
0, 4)
1)
Crack-19
Crack-20 (3, 4, 102)(5, 6, 325) (3, 5, 100)321)
(5, 7, (0, 1, 4)
(0, 1, 2)
Crack-20 (3, 4, 102) (3, 5, 100) (0, 1, 2)
* The values in the table represent the minimum crack width, the maximum crack width, and the
*crack
The values in the table represent the minimum crack width, the maximum crack width, and the crack length.
length.

Figure 15.
Figure 15. Comparison
Comparisonbetween
betweenthe
thetraditional
traditionalthinning method
thinning method and thethe
and proposed algorithm
proposed in this
algorithm in
study when extracting the crack skeleton: (a) Road images; (b) Traditional thinning results; and
this study when extracting the crack skeleton: (a) Road images; (b) Traditional thinning results; and (c)
Our thinning results.
(c) Our thinning results.

4.4. Limitations
4.4. Limitations and
and Future
Future Discussion
Discussion
The proposed
The proposed framework
framework performs
performs well
well ininthe
theidentification,
identification,segmentation.
segmentation. and
and
measurementofofroad
measurement road cracks
cracks in complex
in complex environments.
environments. Depending
Depending on various
on various require-
requirements
in practice,
ments the proposed
in practice, method
the proposed can either
method perform
can either the crack
perform detection
the crack tasktask
detection alone, or
alone,
directly detect
or directly cracks
detect and
cracks segment
and segmentthem. It takes
them. It takesabout
about4242
msmsper 640××640
per640 640sized
sizedimage
image
for YOLOv5, while it takes only 0.64 s per 100 × 100 sized bounding box for the modified
Res-UNet. However, the proposed framework indeed has some limitations as follows:
(1) Although the cropped images provided by YOLOv5 have 91% accuracy, the remaining
9% may produce poor results in the modified Res-UNet; therefore, hyperparameter tuning
of the network is required. (2) In terms of accuracy, a minimum width of cracks in the
images should be guaranteed to be greater than two pixels when using the proposed
framework; therefore pre-processing of the collected images is required. (3) Our current
research is at the pixel level, and the distance mapping relationship between the real world
and digital images is our next research focus.

5. Conclusions
To achieve an accurate assessment of road cracks under complex backgrounds, an
integrated framework that combines crack detection, segmentation and quantification is
proposed in the present study. Crack regions were first detected with YOLOv5, then fed
into the modified Res-UNet model for crack segmentation, and finally the width and length
of the cracks were extracted based on the proposed crack quantification algorithm. Based
on the identification results, the following conclusions are obtained:
Remote Sens. 2023, 15, 1530 16 of 18

The proposed method can accurately detect cracks at pixel level and shows good
robustness under the interference of darkness, shadows, and various noises;
The accuracy of Res-UNet for segmenting cracks is effectively improved by embed-
ding an attention gate and proposing a new combined loss function, and the IoU of the
segmented cracks is improved by 6.7%;
Compared with YOLACT++ and DeepLabv3+, the proposed method shows higher
accuracy for crack segmentation under complex backgrounds with an mAP of 91% and an
average IoU of 87%;
The developed crack quantification algorithm can effectively reduce local branching
and crack end loss, and improve the accuracy of measuring the length of cracks by 3%
compared with the traditional method.
In summary, the proposed integrated method makes contributions by boosting the
efficiency of segmentation and quantification of road cracks when the background is full
of other objects (e.g., vehicles, buildings, and plants). Compared with the cost of an
inspection vehicle [60], the cost of the proposed method is much lower, about 4% of that
of an inspection vehicle. However, there are some limitations to this study. First, there
is a lot of room for accuracy improvement to achieve reliable crack inspection in real
applications. Second, the tests were mainly conducted with cracks that were obvious and
large-scale. In addition to complex backgrounds, future studies can explore cracks with
more complex features.

Author Contributions: Conceptualization, L.D. and Y.L.; methodology, L.D. and Y.L.; validation,
L.D. and A.Z.; formal analysis, L.D. and J.G.; investigation, Y.L. and A.Z.; writing—original draft
preparation, A.Z. and J.G.; writing—review and editing, L.D. and Y.L.; visualization, A.Z.; supervision,
Y.L. All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by the National Natural Science Foundation of China (No.52278177)
and the Science and Technology Innovation Leader Project of Hunan Province, China (No. 2021RC4025).
Data Availability Statement: The data presented in this study are available from the corresponding author.
Acknowledgments: We would like to express our gratitude to the editor and reviewers for their
valuable comments.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. National Bureau of Statistics. National Data. Available online: https://data.stats.gov.cn/ (accessed on 1 January 2022).
2. The State Council. Policy Analyzing. Available online: http://www.gov.cn/zhengce/2022-05/11/content_5689580.htm (accessed
on 11 May 2022).
3. Ministry of Transport and Logistic Services. Road Maintenance. Available online: https://mot.gov.sa/en/Roads/Pages/
RoadsMaintenance.aspx (accessed on 15 September 2022).
4. Kee, S.-H.; Zhu, J. Using Piezoelectric Sensors for Ultrasonic Pulse Velocity Measurements in Concrete. Smart Mater. Struct. 2013,
22, 115016. [CrossRef]
5. Zoidis, N.; Tatsis, E.; Vlachopoulos, C.; Gotzamanis, A.; Clausen, J.S.; Aggelis, D.G.; Matikas, T.E. Inspection, Evaluation and
Repair Monitoring of Cracked Concrete Floor Using NDT Methods. Constr. Build. Mater. 2013, 48, 1302–1308. [CrossRef]
6. Li, J.; Deng, J.; Xie, W. Damage Detection with Streamlined Structural Health Monitoring Data. Sensors 2015, 15, 8832–8851.
[CrossRef] [PubMed]
7. Dery, L.; Jelnov, A. Privacy–Accuracy Consideration in Devices that Collect Sensor-Based Information. Sensors 2021, 21, 4684.
[CrossRef] [PubMed]
8. Jiang, S.; Zhang, J.; Wang, W.; Wang, Y. Automatic Inspection of Bridge Bolts Using Unmanned Aerial Vision and Adaptive Scale
Unification-Based Deep Learning. Remote Sens. 2023, 15, 328. [CrossRef]
9. Fiorentini, N.; Maboudi, M.; Leandri, P.; Losa, M.; Gerke, M. Surface Motion Prediction and Mapping for Road Infrastructures
Management by PS-Insar Measurements and Machine Learning Algorithms. Remote Sens. 2020, 12, 3976. [CrossRef]
10. Zhu, Y.; Tang, H. Automatic Damage Detection and Diagnosis for Hydraulic Structures Using Drones and Artificial Intelligence
Techniques. Remote Sens. 2023, 15, 615. [CrossRef]
11. Chen, M.; Tang, Y.; Zou, X.; Huang, K.; Li, L.; He, Y. High-Accuracy Multi-Camera Reconstruction Enhanced by Adaptive Point
Cloud Correction Algorithm. Opt. Lasers Eng. 2019, 122, 170–183. [CrossRef]
Remote Sens. 2023, 15, 1530 17 of 18

12. Al Duhayyim, M.; Malibari, A.A.; Alharbi, A.; Afef, K.; Yafoz, A.; Alsini, R.; Alghushairy, O.; Mohsen, H. Road Damage Detection
Using the Hunger Games Search with Elman Neural Network on High-Resolution Remote Sensing Images. Remote Sens. 2022, 14,
6222. [CrossRef]
13. Lee, T.; Yoon, Y.; Chun, C.; Ryu, S. CNN-Based Road-Surface Crack Detection Model that Responds to Brightness Changes.
Electronics 2021, 10, 1402. [CrossRef]
14. Zhong, J.; Zhu, J.; Huyan, J.; Ma, T.; Zhang, W. Multi-Scale Feature Fusion Network for Pixel-Level Pavement Distress Detection.
Autom. Constr. 2022, 141, 104436. [CrossRef]
15. Hacıefendioğlu, K.; Başağa, H.B. Concrete Road Crack Detection Using Deep Learning-Based Faster R-Cnn Method. Iran. J. Sci.
Technol. Trans. Civ. Eng. 2022, 46, 1621–1633. [CrossRef]
16. Que, Y.; Dai, Y.; Ji, X.; Leung, A.K.; Chen, Z.; Tang, Y.; Jiang, Z. Automatic Classification of Asphalt Pavement Cracks Using A
Novel Integrated Generative Adversarial Networks and Improved Vgg Model. Eng. Struct. 2023, 277, 115406. [CrossRef]
17. Du, Y.; Pan, N.; Xu, Z.; Deng, F.; Shen, Y.; Kang, H. Pavement Distress Detection and Classification Based on YOLO Network. Int.
J. Pavement Eng. 2021, 22, 1659–1672. [CrossRef]
18. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. Yolov6: A Single-Stage Object Detection
Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976.
19. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object
Detectors. arXiv 2022, arXiv:2207.02696.
20. Ultralytics. Yolov8. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 January 2023).
21. Liu, C.; Sui, H.; Wang, J.; Ni, Z.; Ge, L. Real-Time Ground-Level Building Damage Detection Based on Lightweight and Accurate
Yolov5 Using Terrestrial Images. Remote Sens. 2022, 14, 2763. [CrossRef]
22. Shokri, P.; Shahbazi, M.; Nielsen, J. Semantic Segmentation and 3d Reconstruction of Concrete Cracks. Remote Sens. 2022, 14, 5793. [CrossRef]
23. An, Q.; Chen, X.; Wang, H.; Yang, H.; Yang, Y.; Huang, W.; Wang, L. Segmentation of Concrete Cracks by Using Fractal Dimension
and Uhk-Net. Fractal Fract. 2022, 6, 95. [CrossRef]
24. Zhang, Y.; Fan, J.; Zhang, M.; Shi, Z.; Liu, R.; Guo, B. A Recurrent Adaptive Network: Balanced Learning for Road Crack
Segmentation with High-Resolution Images. Remote Sens. 2022, 14, 3275. [CrossRef]
25. Yong, P.; Wang, N. RIIAnet: A Real-Time Segmentation Network Integrated with Multi-Type Features of Different Depths for
Pavement Cracks. Appl. Sci. 2022, 12, 7066. [CrossRef]
26. Sun, X.; Xie, Y.; Jiang, L.; Cao, Y.; Liu, B. DMA-Net: Deeplab with Multi-Scale Attention for Pavement Crack Segmentation. IEEE
Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [CrossRef]
27. Shen, Y.; Yu, Z.; Li, C.; Zhao, C.; Sun, Z. Automated Detection for Concrete Surface Cracks Based on Deeplabv3+ BDF. Buildings
2023, 13, 118. [CrossRef]
28. Ji, A.; Xue, X.; Wang, Y.; Luo, X.; Xue, W. An Integrated Approach to Automatic Pixel-Level Crack Detection and Quantification of
Asphalt Pavement. Autom. Constr. 2020, 114, 103176. [CrossRef]
29. Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U-Net. IEEE Geosci. Sens. Lett. 2018, 15, 749–753. [CrossRef]
30. Zhu, Z.; German, S.; Brilakis, I. Visual Retrieval of Concrete Crack Properties for Automated Post-Earthquake Structural Safety
Evaluation. Autom. Constr. 2011, 20, 874–883. [CrossRef]
31. Kang, D.; Benipal, S.S.; Gopal, D.L.; Cha, Y.-J. Hybrid Pixel-Level Concrete Crack Segmentation and Quantification Across
Complex Backgrounds Using Deep Learning. Autom. Constr. 2020, 118, 103291. [CrossRef]
32. Tang, Y.; Huang, Z.; Chen, Z.; Chen, M.; Zhou, H.; Zhang, H.; Sun, J. Novel Visual Crack Width Measurement Based on Backbone
Double-Scale Features for Improved Detection Automation. Eng. Struct. 2023, 274, 115158. [CrossRef]
33. Zhang, T.Y.; Suen, C.Y. A Fast Parallel Algorithm for Thinning Digital Patterns. Commun. ACM 1984, 27, 236–239. [CrossRef]
34. Guizilini, V.; Li, J.; Ambrus, , R.; Gaidon, A. Geometric Unsupervised Domain Adaptation for Semantic Segmentation. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021;
pp. 8537–8547.
35. Toldo, M.; Michieli, U.; Zanuttigh, P. Unsupervised Domain Adaptation in Semantic Segmentation via Orthogonal and Clustered
Embeddings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 3–7
January 2021; pp. 1358–1368.
36. Stan, S.; Rostami, M. Unsupervised Model Adaptation for Continual Semantic Segmentation. In Proceedings of the AAAI
Conference on Artificial Intelligence, online, 2–9 February 2021; pp. 2593–2601.
37. Marsden, R.A.; Wiewel, F.; Döbler, M.; Yang, Y.; Yang, B. Continual Unsupervised Domain Adaptation for Semantic Segmentation
Using A Class-Specific Transfer. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua,
Italy, 18–23 July 2022; pp. 1–8.
38. Zhu, J.; Zhong, J.; Ma, T.; Huang, X.; Zhang, W.; Zhou, Y. Pavement Distress Detection Using Convolutional Neural Networks
with Images Captured via UAV. Autom. Constr. 2022, 133, 103991. [CrossRef]
39. Ruiqiang, X. YOLOv5s-GTB: Light-Weighted and Improved Yolov5s for Bridge Crack Detection. arXiv 2022, arXiv:2206.01498.
40. Jing, Y.; Ren, Y.; Liu, Y.; Wang, D.; Yu, L. Automatic Extraction of Damaged Houses by Earthquake Based on Improved YOLOv5:
A case study in Yangbi. Remote Sens. 2022, 14, 382. [CrossRef]
41. Ultralytics. Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 17 January 2022).
Remote Sens. 2023, 15, 1530 18 of 18

42. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; Volume 1,
pp. 2117–2125.
43. Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid Attention Network for Semantic Segmentation. arXiv 2018, arXiv:1805.10180.
44. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.
Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999 2018.
45. Hou, H.; Lan, C.; Xu, Q.; Lv, L.; Xiong, X.; Yao, F.; Wang, L. Attention-Based Matching Approach for Heterogeneous Remote
Sensing Images. Remote Sens. 2023, 15, 163. [CrossRef]
46. Lee, T.C.; Kashyap, R.L.; Chu, C.N. Building Skeleton Models via 3-D Medial Surface Axis Thinning Algorithms. Graph. Models
Image Process. 1994, 56, 462–478. [CrossRef]
47. Home—OpenCV. Available online: https://opencv.org (accessed on 1 November 2021).
48. Pytorch. Available online: https://pytorch.org/ (accessed on 15 June 2021).
49. Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Sekimoto, Y. RDD2020: An Annotated Image Dataset for Automatic Road
Damage Detection Using Deep Learning. Data Brief 2021, 36, 107133. [CrossRef] [PubMed]
50. Road-Crack-Images-Test. Available online: https://www.kaggle.com/datasets/andada/road-crack-imagestest (accessed on
25 January 2023).
51. Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.-M. How to
Get Pavement Distress Detection Ready for Deep Learning? A Systematic Approach. In Proceedings of the 2017 International
Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2039–2047. [CrossRef]
52. Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell.
Transp. Syst. 2016, 17, 3434–3445. [CrossRef]
53. Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. Cracktree: Automatic Crack Detection from Pavement Images. Pattern Recognit. Lett.
2012, 33, 227–238. [CrossRef]
54. Feiyu. Vimble 3. Available online: https://www.feiyu-tech.cn/vimble-3/ (accessed on 23 March 2022).
55. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT++: Better Real-Time Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
2022, 44, 1108–1121. [CrossRef] [PubMed]
56. Chen, L.C.; Zhu, Y.; Papandreou, G. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In
Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818.
57. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image
Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241.
58. Zhang, L.; Shen, J.; Zhu, B. A Research on An Improved Unet-Based Concrete Crack Detection Algorithm. Struct. Health Monit.
2021, 20, 1864–1879. [CrossRef]
59. Liu, F.; Wang, L. Unet-Based Model for Crack Detection Integrating Visual Explanations. Constr. Build. Mater. 2022, 322, 126265. [CrossRef]
60. Radopoulou, S.C.; Brilakis, I. Automated Detection of Multiple Pavement Defects. J. Comput. Civ. Eng. 2017, 31, 04016057. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like