Modeling The Chloride Migration of Recycled Ag

Journal of Cleaner Production 407 (2023) 136968

Modeling the chloride migration of recycled aggregate concrete using

ensemble learners for sustainable building construction
Emadaldin Mohammadi Golafshani a, *, Alireza Kashani a, Ali Behnood b, Taehwan Kim a
School of Civil and Environmental Engineering, University of New South Wales, Sydney, NSW, 2052, Australia
Indiana Department of Transportation, IN, USA


Handling Editor: Zhen Leng The use of supplementary cementitious materials such as slag and recycled aggregate in concrete can mitigate
some of the negative environmental impacts of using virgin materials. However, the durability of recycled
Keywords: aggregate concrete (RAC) and its resistance to harsh environmental conditions such as chloride penetration must
Recycled aggregate concrete be investigated before practical applications. The rapid chloride migration test (RCMT) is one of the well-
established tests that can provide valuable estimations of the concrete quality against chloride penetration.
Rapid chloride migration test
RCMT coupled with machine learning techniques can lead to authentic models, which could save in time, cost,
Ensemble learners
SHAP analysis materials, and the need for skilled technicians. In this study, five homogeneous ensemble learners, including two
types of bagging and three types of boosting techniques, were developed to model the RCMT output using a
comprehensive database collected from the literature. Different types of analysis, including statistical measures,
SHapley Additive exPlanations (SHAP) sensitivity analysis, SHAP parametric study, and comparison study, were
conducted to examine the performance of the developed models and the effects of the input features on pre­
dictions. The results show that the developed extreme gradient boosting learner with the mean absolute per­
centage errors of about 9% possesses excellent capability for modeling the RCMT of RAC. Besides, the RCMT
testing age is the most influential factor affecting the RCMT output, and the amounts of natural fine aggregate
and superplasticizer are in the following orders. Finally, a graphical user interface (GUI) was designed, which
allows the users to insert the input features and obtain the RCMT output in a user-friendly environment.

calcium hydroxide to calcium silicate hydrate products. Thus, concrete

mixtures containing slag can be expected to have higher strength and
1. Introduction
reduced permeability, especially in the long term, while being more
economical compared to control mixtures (Shah and Huseien, 2020).
The consumption of virgin materials in concrete mixtures, as well as
On the other hand, the partial or full replacement of virgin coarse
the greenhouse gas emissions induced by cement production, are two of
aggregates with recycled coarse aggregates (RCAs) obtained from the
the main factors making the construction industry one of the major
demolition of concrete structures is another sustainable approach to
environmental pollutants worldwide. Many endeavors have been con­
reducing the pressure on the consumption of natural resources (Mog­
ducted to reduce the devastating effects of this industry (Chen et al.,
haddas et al., 2022). However, because of the presence of residual
2023). Partially replacing cement with supplementary cementitious
mortar in RCA, it has lower density and higher water absorption
materials (SCMs) has been one of the effective solutions for producing
compared to virgin quarry aggregates (Rahal, 2007). This can lead to
concretes with better long-term mechanical properties and durability
concretes with inferior properties and different behavior compared to
performance (Pazouki et al., 2021). As a by-product of iron
normal concrete (Bahraq et al., 2022). Numerous attempts have been
manufacturing, ground granulated blast furnace slag (called slag in this
carried out to scrutinize the influence of RCA on the mechanical prop­
study) is used as an SCM in concrete (Tahwia et al., 2022). Because slag
erties and durability performance of recycled aggregate concrete (RAC)
contains high contents of calcium oxide and silicon dioxide, it has both
(Guo et al., 2018). There is a general agreement in research studies
pozzolanic and cementing properties. This, in turn, can lead to a denser
indicating that the increase in the RCA contents and RCA water
interfacial transition zone (ITZ) through mechanisms such as converting

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

Nomenclature RWA Recycled coarse aggregate water absorption

CS Compressive strength at 28 days
RAC Recycled aggregate concrete TA Testing age
RCMT Rapid chloride migration test RAR Recycled coarse aggregate ratio
SHAP Shapley Additive explanations DTR Decision tree regressor
SCM Supplementary cementitious material SDR Standard deviation reduction
RCA Recycled coarse aggregate RFL Random forest learner
RCPT Rapid chloride penetration test BETL Bagged extra trees learner
DC Direct current ABL Adaptive boost learner
CMC Chloride migration coefficient GBL Gradient boosting learner
ML Machine learning XGBL Extreme gradient boosting learner
C Cement RMSE Root mean squared error
S Slag RRMSE Relative RMSE
W Water MAE Mean absolute error
NCA Natural coarse aggregate R-squared Coefficient of determination
NFA Natural fine aggregate PI Performance index
SP Superplasticizer

absorption negatively impacts the mechanical properties and durability the properties of RAC using ensemble techniques are listed in Table 1.
performance of RAC (Guo et al., 2018). All studies given in this table declare that the ensemble models
Among various types of chemical attacks that a concrete structure outperform other types of base ML models. Regarding the chloride
can experience during its service life, chloride penetration is the domi­ penetration resistance of RAC, K. H. Liu et al. (2022) developed several
nant damage that causes the corrosion of steel bars and deteriorates the individual and ensemble ML models for predicting the charged passed of
mechanical performance of reinforced concrete structures (Liang et al., the rapid chloride penetration test (RCPT) using 226 experimental data
2021). Environmental factors consisting of high temperature and rela­ gathered from the literature. They concluded that the developed
tive humidity as well as harsh exposure zones, including submerged, gradient-boosting decision tree is the best predictive model compared to
tidal, splash, and salt spray exacerbate the chloride ions penetration into other developed ML models. In addition, while developing the
concrete and reduce the service life of reinforced concrete structures. gradient-boosting decision tree, they did not investigate the effect of
The chloride ingress into concrete is a long process that may occur over some hyperparameters (e.g., the maximum tree depth), which can
the course of several years. However, to simulate this process in the significantly affect the accuracy of the developed model. For the chlo­
laboratory, a few standardized short-term chloride penetration resis­ ride penetration resistance of concrete mixtures containing SCMs, Quan
tance tests have been proposed to investigate the concrete quality Tran (2022) employed several ML techniques, modeled the chloride
against chloride ions penetration. The rapid chloride penetration test diffusion coefficient of concrete using 127 data samples, and claimed
(RCPT), rapid chloride migration test (RCMT), resistivity test, and that the developed gradient boosting model had the best accuracy
pressure penetration test are the most popular short-term laborator­ among all other developed ML models. Besides, the developed models
y-based tests of chloride penetration resistance of concrete. RCPT and were made based on the default hyperparameters’ values given in the
RCMT are two well-established short-term chloride penetration resis­ sklearn library of Python, not the optimal values.
tance tests designed based on imposing a direct current (DC) voltage on A considerable number of experimental studies have been carried out
concrete specimens exposed to chemical solutions. There are several to investigate the chloride penetration of RAC using RCMT. Collecting
criticisms toward the RCPT consisting of chloride ion movement, the the data from relevant studies and modeling the RCMT results of RAC
role of other ions like hydroxide on the results, the non-steady-state
migration measurement, and concrete sample heating because of the
high level of applied voltage (Bagheri and Zanganeh, 2012). In addition, Table 1
the use of SCMs in concrete mixtures reduces the hydroxide ions con­ The ensemble models used for modeling the mechanical properties and dura­
centration in the pore solution and higher resistance to chloride ion bility of RAC.
penetration is observed by the RCPT measurement. To address these Concrete type Modeled property Number of Ref
shortcomings, the RCMT was proposed which has a good correlation data
with the results of long-term chloride penetration tests (Bagheri and Self-compacting RAC Compressive 515 (de-Prado-Gil
Zanganeh, 2012). In RCMT, as per NT build 492 (NT Bulid 492, 1999), a strength et al., 2022)
standard concrete specimen is exposed to limewater on one side and 3% RAC with slag and fly RCPT 226 (K. H. Liu et al.,
ash 2022)
NaCl on another side under an applied voltage, and the chloride RAC without SCMs Sulfate resistance 143 (K. Liu et al.,
migration coefficient (CMC) is calculated using the Nernst-Einstein 2022)
equation (NT Bulid 492, 1999). Although this experimental test can RAC with slag and fly Elastic modulus 526 Han et al. (2020)
provide reliable information on the durability and chloride penetration ash
RAC without SCMs Compressive 721 Quan Tran et al.
resistance of concretes, the preparation of concrete specimens and
strength (2022)
apparatus as well as the experimental process of the RCMT, are time and RAC with slag Compressive 126 Imran et al.
resource intensive. strength (2022)
Several studies have been conducted in the last few years to model RAC without SCMs Compressive 209 Duan et al.
the mechanical properties and durability of various types of concretes strength (2020)
RAC without SCMs Compressive and 638–139 Yuan et al.
using machine learning (ML) techniques (Behnood and Golafshani,
flexural strengths (2022)
2021). Ensemble ML techniques, by combining several base ML models, RAC with silica fume, Carbonation depth 713 Nunez and Nehdi
have been employed successfully in previous years because of their slag, fly ash, and (2021)
highly reliable predictions. Examples of previous studies for modeling metakaolin

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

using ML techniques can provide several advantages, including (1) value (− 0.35 in Fig. 1(b)) of slag with output indicates the positive in­
Development of a unique predictive model integrating the individual fluence of this SCM in decreasing the chloride migration of RAC.
studies, (2) Reduction of the effect of noisy data of the experimental Moreover, the unpleasant effect of RCA on chloride migration can be
studies on predictions, (3) Generation of a reliable durability prediction observed from the positive values of the linear correlation values of RAR
model for engineers, and (4) Determination of the most influential fac­ (0.35 in Fig. 1(b)) and RWA (0.27 in Fig. 1(b)) with outputs.
tors affecting the chloride migration for optimizing RAC mix design. The statistical parameters of the input and output features in the
Because of the better performance of the ensemble models compared to RCMT database are given in Table 2. More than 55% and 29% of samples
individual ML models, two types of bagging ensemble techniques and in the gathered database include RCA and slag, respectively. This shows
three types of boosting ensemble techniques were employed to model the importance of slag as an SCM in concrete mixtures for the chloride
the RCMT of RAC. In addition, the influential hyperparameters of the migration reduction of RAC in past studies. Besides, the mean and
ensemble techniques were tuned in order to achieve accurate models. maximum values of the compressive strength of RAC give a hopeful
This paper is organised into the following sections. First, the description promise of achieving high-performance RAC, especially in practical
of the gathered data is given in Section 2. Then, Section 3 presents the applications. Fig. 2 visualizes the counterplots between the input and
research methodology used in this study. Finally, the results and dis­ output features using the probability distributions. As shown in this
cussion are presented in Section 4, followed by the summarized findings figure, increasing slag, NFA, SP, CS, and TA generally decreases the CMC
of this study in Section 5. of RAC. Besides, W, RAR, and RWA negatively influence the CMC of
RAC. However, the general trend is not apparent regarding C, and more
2. Data description investigations are required. The density of C contents is higher in the
range of almost [330 kg/m3, 450 kg/m3], and the corresponding CMC
A comprehensive database of the experimental observations of values are in the range of about [10 × 10− 12 m2/s, 14 × 10− 12 m2/s].
RCMT, as one of the well-established accelerated chloride penetration The probability distributions of S and RAR are denser around zero
tests, was gathered from the literature. The total number of samples indicating a zero amount of S and RAR in a significant number of data
gathered for the experimental observations for RCMT are 227, which records in the database. The probability distribution of W is higher be­
were obtained from 10 scientific studies. The references and the sample tween about 140 kg/m3 and 160 kg/m3 and the corresponding CMC
IDs used for developing ML models in this study are given in Appendix A. values vary between almost 9 × 10− 12 m2/s and 14 × 10− 12 m2/s. For
For modeling the RCMT of RAC, the contents (in kg/m3) of cement (C), NFA, two spots with high density can be observed, a small spot with a
slag (S), water (W), natural coarse aggregate (NCA), natural fine center of almost 700 kg/m3 and a big spot with a center of about 1000
aggregate (NFA), recycled coarse aggregate (RCA), superplasticizer kg/m3. The big spot has lower CMC compared to the small spot denoting
(SP), as well as the values of recycled coarse aggregate water absorption the higher amount of NFA can cause chloride migration reduction. In the
(RWA), compressive strength (MPa) at 28 days (CS), and the testing age case of RWA, two dense regions can be seen. The first region with a
in days (TA) were selected as the candidate input features of the ML center of zero is related to data records without RCA and the second
models. The linear correlation coefficients of the initial candidate input region with a center of almost 6.8% corresponds to the data records with
features are illustrated in Fig. 1(a), indicating that NCA and RCA are RCA. The centers of the dense clusters of SP and CS are almost 2.3 kg/m3
highly correlated. Hence, the RCA ratio (RAR) was defined as the and 52 MPa, respectively. Three distinct dense regions with centers of 7,
amount of RCA to the total amount of coarse aggregate, and replaced 28, and 91 days are observed in the probability distributions of TA
NCA and RCA in the final input features to reduce the multi-collinearity indicating the focus of researchers for measuring the CMC in these
risk. As shown in Fig. 1(b), the correlation coefficients of the mutual standard days.
final input features are less than 0.7, denoting a low multi-collinearity
risk in the defined system. The output of the RCMT system is the chlo­ 3. Research methodology
ride migration coefficient (CMC) of RAC. The negative linear correlation
Fig. 3 summarizes the general framework of this study to develop
reliable models for estimating the chloride penetration of RAC. Gath­
ering a reliable database and cleaning data substantially impact the
quality and credibility of the developed ML models. Next, the collected
database is partitioned into several groups for various purposes. After
developing the ensemble models, the reliability of the ML models is
examined using anonymous data, and the best one is chosen for more
analyses. A SHapley Additive exPlanations (SHAP)-based interpretation
method is used to find the relationships between the input and output

Table 2
Statistical parameters of the input and output features of the RCMT database.
Input Mean std Min 25% 50% 75% Max

C (kg/m3) 346.02 73.61 195 300 350 400 500

S (kg/m3) 32.36 54.91 0 0 0 65 170
W (kg/m3) 155.07 23.52 114 142 150 170 205.90
NFA (kg/ 880.61 173.33 529 462.5 721 901 1105
RAR 0.30 0.35 0 0 0.15 0.37 1
RWA (%) 3.37 3.26 0 0 4.30 7.1 9.28
SP (kg/m3) 2.40 1.47 0 1.85 2.28 2.55 6
CS (MPa) 49.13 10.06 21.83 42.83 49.88 56.64 72.60
TA (Days) 40.53 31.17 7 28 28 90 91
CMC ( × 12.41 4.19 3.99 9.6 12 14.55 27.29
10− 12 m2/
Fig. 1. The correlation coefficients of a) the initial features and b) the final
features of the defined RCMT system.

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

Fig. 2. Probability distribution counterplots between the input and output features.

3.1. Data preparation

After gathering the data through a comprehensive literature review

and determining the influential input features, the data preparation
phase was carried out. Duplicated, irrelevant, and inconsistent samples
were removed from the database. In addition, samples with more than
three missing data were discarded from the database. A small part of
missing data was observed in the SP feature among the 9 input features
of the RCMT database. The univariate and multivariate imputation
techniques are generally used to deal with the missing data (Liang et al.,
2022). Since less than 5% of data were missed for the SP feature, the
missing data of this feature were filled by the mean values of SP as per
the univariate imputation technique.

3.2. Data division

This section divides the database randomly into two main groups for
the development and testing phases. In the development phase, the
Fig. 3. The research methodology.
training-validating samples are served to develop the ensemble models,
while the performances of the developed models are verified using the
features to interpret the outcomes of the final ensemble model, which is
testing samples. For the development phase, the k-fold cross-validation
followed by developing a graphical user interface. More explanations of
technique is served, allowing all samples to be used in both the
the research methodology are given in the following subsections.
training and validating stages, as demonstrated in Fig. 4. In this regard,
the training-validating samples are partitioned randomly into k-folds
with almost equal sizes, and k ensemble models are created in total. In
each iteration, one ensemble model is generated in which (k-1) folds are

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

regression problems. The simplicity, interpretability, non-multi-

collinearity, as well as, robustness against overfitting, outliers, and
noises, and relatively inexpensive computations of DTR compared to
other ML methods have made this algorithm one of the most popular ML
techniques. As depicted in Fig. 6, this technique gradually splits the
database into smaller subsets commencing from the topmost node (the
Fig. 4. k-fold cross-validation technique. root node), assigns a unique feature to each tree node, and devotes the
conditional terms to the tree branches. At the end of a tree, there are leaf
used to train the ensemble model, and the remaining fold is served to nodes represented by constant values. The tree prediction can be ob­
validate the developed model. The process of generating ensemble tained for a given sample by following the features assigned to tree
models is repeated k times so that all k folds are used in the testing phase. nodes and the conditions devoted to the tree branches and finally
The average performance of k ensemble models in the validation phases reaching the leaf node. It is worth mentioning that the standard devia­
gives the developed algorithm performance. tion reduction (SDR) is a factor used in DTR to find the suitable features
of nodes in order to generate homogeneous subsets.

3.3. Homogeneous ensemble learners 3.3.2. Random forest learner (RFL)

RFL is a bagging ensemble technique that uses DTR as the base ML
The bias-variance trade-off is a challenging problem in developing algorithm. In this algorithm, random decision trees generate a popula­
ML models. Two types of homogeneous ensemble learners exist to cover tion of decision trees called a forest. Several important hyperparameters
this issue, including bootstrap aggregating (bagging) and boosting affect the quality of the generated forest, including the number of base
techniques. Bagging learner is an ensemble meta-ML technique to tackle ML models, the maximum number of features used for breaking the
the deficiency of base ML algorithms concerning the overfitting issue. nodes, the maximum tree depth, the minimum number of samples used
This learner type reduces the variance of predictions of base ML models for breaking the nodes, and the minimum number of samples in the
by aggregating several base ML models. For developing each base ML newly generated nodes. These hyperparameters should be set before
model, the bootstrapped samples are used by randomly selecting sam­ implementing the RFL algorithm. The ranges of hyperparameters can be
ples with replacements from a given database. The average of the pre­ determined using previous studies and trial and error, and their optimal
dictions of the independent base ML models is the prediction of the can be specified using the grid search technique. RFL has been used
ensemble bagging learner. The boosting learners are generally designed successfully in the field of concrete technology, such as modeling the CS
to achieve more accurate ML models by reducing the biases of pre­ of sustainable rice husk ash concrete (Iftikhar et al., 2022), the CS of
dictions of the base ML models. To do this, they use a sequential training high-performance concrete (Han et al., 2019), and internal damages of
procedure in which the samples with high errors in the recent base ML reinforced concrete (Chun et al., 2020) with the R-squared values of
model are boosted in the subsequent base ML model. A schematic 0.91, 0.93, and 0.89, respectively.
configuration of bagging and boosting learners is illustrated in Fig. 5.
In the following, the base ML algorithm used in this study and the 3.3.3. Bagged extra trees learner (BETL)
five homogenous ensemble models developed to model the CMC of RAC BETL is another bagging ensemble technique in which DTR is used as
are briefly explained. the base ML algorithm. BETL adds more randomness than the RFL in the
DTR generation process. There are generally two main differences be­
3.3.1. Decision tree regressor (DTR) algorithm tween the RFL and the BETL. First, the BETL uses the whole database to
DTR with tree structure formulation is an ML technique for modeling generate DTRs rather than the bootstrap sampling, which may cause the
bias decrement of the ensemble model. Second, the BETL uses a random
(a) Bootstrap phase Base learner Aggregating phase
selection for determining the splitting point of features in nodes
training phase regardless of the optimal process employed in the RFL, which may cause
the variance reduction of the developed ensemble model. In addition,
because of the elimination of the optimal process in the BETL, this
ensemble technique is faster than the RFL. Almost all hyperparameters
of both BETL and RFL are the same. BETL has been served in several
concrete technology research, such as modeling the CS of concrete
. . (Ekanayake et al., 2022), the CS of phase change materials integrated
.. ..
cementitious composites (Marani and Nehdi, 2020), and the CS of steel
k k fiber reinforced concrete (SFRC) subjected to elevated temperatures
(Shafighfard et al., 2022) with the R-squared values of 0.98, 0.96, and
0.92, respectively.
(b) Boosting phase Base learner Final ensemble
training phase learner phase

.. .
k k

Fig. 5. A schematic configuration of a) bagging learner and b) boosting learner

methodologies. Fig. 6. An example of a decision tree regressor (DTR) model.

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

3.3.4. Adaptive boost learner (ABL) 3.5. Evaluation of ensemble models

ABL is the first proposed boosting ensemble technique in which more
weights are assigned to the problematic samples in the base ML model in This study serves several error indices to evaluate the developed
the subsequent iteration and vice versa. In addition, more weights are ensemble models. These indices include root mean squared error
given to the stronger base ML models in the final ensemble model. Some (RMSE), relative RMSE (RRMSE), mean absolute error (MAE), coeffi­
influencing hyperparameters to be tuned in the ABL include the control cient of determination (R-squared), and performance index (PI), which
parameters of the base ML algorithm, the number of base ML models, are defined in Table 3. The R-squared of an ML model is between zero
and the learning rate. The learning rate determines the influence of each and one, where zero and one indicate the model’s least and perfect
base ML model on the overall ensemble model so that the contribution of prediction capability, respectively. For other error indices, the lower the
each base ML model increases in the overall ensemble model for a higher values, the better the prediction capability. RMSE and MAE are two
learning rate. ABL has been used in several studies in the field of con­ error indices with dimensions in which the former considers the squared
crete technology, including modeling the CS of high-performance con­ of prediction errors and the latter calculates the absolute prediction
crete from industrial wastes (Farooq et al., 2021), the flexural strength of errors. The RMSE is always equal to or greater than the MAE. In a
ultra-high-strength concrete (Wang et al., 2022), and the CS of particular case, when the prediction magnitudes of all samples are the
high-performance concrete (Li and Song, 2022), and the reported same, RMSE is equal to MAE. The higher difference between the RMSE
R-squared values are 0.92, 0.93, and 0.85, respectively. and MAE can signify high variance in the predictions and the existence
of samples with significant prediction errors. RRMSE and PI are two
3.3.5. Gradient boosting learner (GBL) dimensionless error indices. RRMSE≤0.1, 0.1<RRMSE≤0.2,
GBR is a generic ML algorithm that solves the weakness of the 0.2<RRMSE≤0.3, and 0.3<RRMSE of ensemble models denote the
developed base ML model in each iteration using gradients. This excellent, good, fair, and poor prediction capability of the model,
boosting learner creates the first base ML model using the gathered respectively (Golafshani et al., 2022). The error-index of PI considers
database, calculates the prediction errors, builds the new base ML model both RRMSE and R-squared simultaneously, and the PI value of close to
based on the errors obtained in the previous step, and combines the zero shows the excellent efficiency of the ML model. It is recommended
developed base ML models. The algorithm repeats the construction of that the PI value for a successful model should be less than 0.2 (Iftikhar
the base ML models based on the obtained errors until the total base ML et al., 2022).
models are constructed. For combining the base ML models, the learning
rate is defined to specify the contribution of the developed base ML 3.6. Verifying the ensemble model
models. GBL has been used in several concrete technology studies con­
sisting of prediction models for the flexural strength of ultra-high- After evaluating the performance of the developed ensemble models
strength concrete (Wang et al., 2022), the CS of high-performance based on the errors of the training-validating sample group, the effi­
concrete (Li and Song, 2022), and the CS of normal and ciencies of the ensemble models are evaluated according to the testing
high-performance concretes (Jia et al., 2022) with the reported sample. This phase determines the generality performance of the
R-squared values of 0.93, 0.94, and 0.96, respectively. developed models against unknown samples. In addition, different types
of statistical and visualization tests are carried out to examine the per­
3.3.6. Extreme gradient boosting learner (XGBL) formances of the ensemble models. Moreover, the accuracy of the best
XGBL uses the gradient boosting principle in a more sophisticated ensemble model is compared to other existing ML models to show the
fashion compared to the GBL. For this, the XGBL uses an advanced prediction capability of the developed ensemble model in this study
regularized formalization and can enhance the generality of the compared to other studies.
ensemble model. In addition, this boosting technique calculates the
second-order gradients in the loss function rather than the first-order
3.7. Interpretation of ensemble models using SHAP
gradient used in the GBL and gives more information about the
gradient direction. In addition to the hyperparameters of the GBL, there
After developing an ML model, it is required to interpret and extract
are several hyperparameters in the XGBL that should be optimized,
valuable results using an analytical technique. The “gain” parameter is
consisting of L1 and L2 regularization terms on weights. The former term
used in the ensemble-tree-based models for assigning features to nodes
can reduce the algorithm’s run time in the high dimensionality of the
and evaluating the importance of features. This parameter suffers from
problem, while the latter undertakes the regularization section of the
inconsistency in determining the feature importance and may assign less
learner. XGBL has been employed successfully for predicting the CS of
importance to a remarkable feature (Lundberg et al., 2018). Motivated
high-performance concrete from industrial wastes (Farooq et al., 2021),
by the game theory, Lundberg and Lee (2017) proposed the SHapley
the flexural strength of ultra-high-strength concrete (Wang et al., 2022),
Additive exPlanations (SHAP) technique to solve this problem. SHAP
and the CS of high-performance concrete (Li and Song, 2022) with the
reported R-squared of 0.90, 0.85, and 0.93, respectively.
Table 3
Error indices used in this study for evaluating ensemble models.
3.4. Hyperparameters of ensemble models
Statistical indicators Equations
The hyperparameters of the ensemble models include control pa­ RMSE
1 ∑N
(EOi − MOi )2
rameters regarding the base ML model and those related to the ensemble N i=1

model. These parameters affect the performance of the developed MAE 1 ∑ N

|EOi − MOi |
ensemble models. For this, a grid search procedure was utilized in this N i=1
study to find the optimal hyperparameters. Several tries and errors were EO
conducted to find suitable ranges for continuous and discrete hyper­ R-squared
1 − ∑i=1
(EOi − MOi )2
parameters. These parameters should be set before running an ensemble
i=1 (EOi − EO)

algorithm, and the average error of the developed ensemble model in k PI RRMSE
folds for validating sample groups determines the suitability of the
selected hyperparameters. EOi: the ith experimental observation; MOi: the ith model output; N: the
number of samples; EO: the average experimental observation.

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

uses the Shapely value to calculate the feature importance, makes the
developed ML models interpretable, and gives a bright insight into the
contributions of input features in predicting the output feature. The
Shapely value of the jth feature value (φj ) is calculated by the weighted
summation of the marginal contribution of the jth feature to the model
output over all possible subsets of features (S), excluding the jth feature,
as follows:
∑ |S|!(M − |S| − 1)!
φj = [f(S ∪ {j}) − f(S)]

where M is the total number of features, |S| is the cardinal number of S,

and f is the developed ensemble model. To compute the Shapely value, it
is necessary to calculate the developed ML model’s predictions for each
possible feature subset, which increases the computational complexity. Fig. 7. The performance of the ensemble models according to the number of
For this, the SHAP value is suggested in which the prediction of each base models.
subset of feature values is calculated using the expected value of the ML
prediction (Lundberg and Lee, 2017). In this study, the best ensemble number of base models, and the slopes of their RMSE curves are less than
model found in the previous step is employed for further evaluations, those of the RFL, BETL, and ABL models for the small number of base
including the SHAP sensitivity and parametric analyses. models. However, the slopes of the RMSE curves of the GBL and XGBL
models reduce gradually, and their RMSEs will be less than the RMSEs of
3.8. Graphical user interface the RFL, BETL, and ABL models after a certain number of base models.
The reason can be attributed to the smaller optimal maximum depths of
A graphical user interface is designed in this study to provide a user- the GBL and XGBL models compared to the RFL, BETL, and ABL models,
friendly environment for engineers to estimate the CMC of RAC through which causes more base models to be required in order to alleviate the
a well-defined interaction. negative effect of the less-complex tree models in model precision.
The error indices of all ensemble models for predicting the RCMT
4. Results and discussion system are given in Table 5, in which the best values in each phase are
bolded. The table shows that the XGBL model outperforms other
The Python programming language was used in this study to model developed models in all phases. The lower values of the error indices of
the CMC of RAC because of its user-friendly environment, available the developed models in the testing phase compared to the validating
packages, and flexibility. 80% of the gathered database were selected phase denote the reasonable generalization capability of the developed
randomly for the model development phase, and the rest were served for ensemble models. For all developed models, the RRMSE values are be­
the testing phase. For developing the ensemble models, the 5-fold cross- tween 0.1 and 0.2 in the validating phase, indicating the “Good” per­
validation method was used, and the hyperparameters of the ensemble formance of the proposed models. The PI values for all developed models
models were obtained as per the average errors of the validating phases are less than 0.2, which shows the success of the proposed ensemble
in the k-fold cross-validation process, as given in Table 4. The results models in modeling the RCMT system. In addition, the MAPE values of
show that the GBL and XGBL algorithms need remarkably smaller the XGBL models for validating and testing phases of the RCMT system
maximum tree depth for modeling the RCMT system compared to other are 8.93% and 6.69%, which are lower than other developed ensemble
ensemble algorithms. The highest optimal maximum depths are for the models. The R-squared values of all ensemble models are more than 0.7
ABL models. Besides, considerably more base models are required to showing a good harmony between the predictions and experimental
achieve the least errors for the GBL and XGBL algorithms. results. However, the precisions of the developed boosting models are
Fig. 7 plots the RMSEs of the five ensemble algorithms developed for higher than developed bagging models in both validating and testing
modeling the RCMT system in the validating phase. As illustrated in this phases. The developed XGBL model with the R-squared values of 0.94 in
figure, the errors of the developed models reduce by increasing the the testing phase has the best performance among all developed
number of base models. Since the RFL algorithm is a specific state of the ensemble models. For the testing phase, the developed XGBL model has
BETL algorithm and the XGBL algorithm is an extension of the GBL al­ 25.46%, 33.45%, 46.68%, and 46.51% lower RMSE values than the GBL,
gorithm, the RMSE curves of the RFL-BETL and GBL-XGBL models are ABL, BETL, and RFL models, respectively.
similar. The RMSE curves of the RFL, BETL, and ABL models are steep at To visualize the performance of the developed ensemble models, the
the low numbers of base models and will be stable quickly. The errors of CMC predictions by five ensemble models against the experimental
the GBL and XGBL models are higher than the other models for the least observations for the testing phase are shown in Fig. 8. The closer the
data points of an ensemble model scattering around the 1:1 line, the
Table 4 better the prediction ability of the model. The observations show that
Optimal hyperparameters values for all developed ensemble models. the developed XGBL model possesses higher efficiency in comparison
Models Hyperparameters with other ensemble models in modeling the RCMT systems. In addition,
±20% lines are drawn in the sub-figures showing which data points are
RFL Max depth = 15, Min samples split = 2, Min samples leaf = 1, Max number
of features = 3, Number of base models = 34 located in the prediction range of ±20%. The data points out of this
BETL Max depth = 15, Min samples split = 2, Min samples leaf = 1, Max number range are distinguished by red color. The developed XGBL model with
of features = 4, Number of base models = 21 only three out-of-range data samples (less than 7% of all testing data
ABL Max depth = 19, Min samples split = 3, Min samples leaf = 1, Max number
samples) has lower numbers of weak prediction points compared to
of features = 8, Learning rate = 1.1, Number of base models = 55
GBL Max depth = 2, Min samples split = 3, Min samples leaf = 1, Max number
other models. Four data points positioned outside the ±20% prediction
of features = 4, Learning rate = 0.4, Number of base models = 173 range are common in four or all developed ensemble models, which can
XGBL Max depth = 3, Subsample ratio of the training instance = 0.4, Subsample signify outliers. By removing these data points, the RMSE reductions are
ratio of columns for constructing trees = 0.6, L1 regularization term = 0, 33.82%, 41.16%, 27.69%, 11.03%, and 8.52% for the RFL, BETL, ABL,
L2 regularization term = 0.5, Learning rate = 0.2, Number of base models
GBL, and XGBL models, respectively. The developed XGBL model with
= 160

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

Table 5
Error indices of the developed ensemble models.
Models Phases RMSE (10− × m2/s) MAE (10− 12
× m2/s) MAPE (%) R-squared RRMSE PI

RFL Validating 2.1411 1.3290 13.4017 0.7379 0.1721 0.0926

Training-validating 0.9288 0.6071 5.7185 0.9507 0.0747 0.0378
Testing 1.9490 1.1791 12.2830 0.7794 0.1585 0.0842
BETL Validating 2.1794 1.3263 13.2951 0.7284 0.1752 0.0945
Training-validating 0.9330 0.5857 5.7037 0.9502 0.0750 0.0380
Testing 1.9552 1.1067 12.0615 0.7780 0.1590 0.0845
ABL Validating 2.0314 1.1061 10.5163 0.7641 0.1633 0.0871
Training-validating 0.5166 0.3204 2.7077 0.9847 0.0415 0.0208
Testing 1.3986 0.8302 8.3911 0.8864 0.1137 0.0586
GBL Validating 1.6978 0.9724 9.7009 0.8352 0.1365 0.0713
Training-validating 0.5112 0.3356 2.8758 0.9851 0.0411 0.0206
Testing 1.5666 0.9670 9.2887 0.8574 0.1274 0.0661
XGBL Validating 1.5814 0.9442 8.9368 0.8570 0.1271 0.0660
Training-validating 0.4738 0.3093 2.6559 0.9872 0.0381 0.0191
Testing 1.0425 0.7317 6.6951 0.9369 0.0848 0.0431

Fig. 8. The CMC predictions against the experimental observations for a) RFL, b) BETL, c) ABL, d) GBL, and e) XBGL models (Out-of-range data points are specified
by red color). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

the RMSE of 0.95 × 10− 12 m2/s has the best RMSE without considering possesses the first and third orders concerning the least mean bias error
outliers in the testing phase. and the standard deviation, respectively. In addition, the third rank of
Fig. 9 depicts the error probability distributions of the developed the least mean bias error and the first rank of the least prediction vari­
ensemble models for the RCMT system. The results indicate that all ability belongs to the developed ABL model.
developed models possess good distributions around zero error. How­ Only the developed XGBL model served as the best-developed
ever, the mean bias errors of the GBL and XGBL models are lower than ensemble model for the sensitivity analysis. For this purpose, the
other ensemble models. The error probability distributions of the RFL SHAP technique was used to determine the importance of each input
and BETL models are similar, showing a slight difference between their feature in the RCMT prediction. The average impact of each input
performances. The XGBL model has the second order regarding the least feature on the output was calculated based on the mean value of abso­
mean bias error and the standard deviation, while the GBL model lute SHAP values, as demonstrated in Fig. 10. The testing age, with the

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

slag, cement, natural fine aggregate, and superplasticizer in the concrete

mixture. The use of RCA in concrete mixtures is to alleviate the negative
environmental effects of RAC, including the preservation of precious
virgin resources and the prevention of landfilling, not the improvement
of concrete quality. In order to reduce the negative impacts of RCA on
the CMC for a given mix design and constant environmental conditions,
the best solution is to reduce the recycled aggregate water absorption
using surface treatment techniques that enhance the quality of recycled
aggregates by reducing its porous structure and making a better ITZ in
concrete, suggested in other studies (Bahraq et al., 2022) as well.
For the sake of comparison, the errors obtained by the XGBL model
for the RCMT system were compared with the errors of ML models re­
ported by other researchers for other types of concrete, as given in
Table 6. Generally, the database collected from the literature in this
study has a remarkably larger size than the databases in other studies, i.
e., 78.7% and 29.0% larger than the databases served in studies carried
Fig. 9. Error probability distributions of the developed ensemble models for
out by Quan Tran (2022) and Taffese and Espinosa-Leal (2022),
the RCMT system.
respectively. In the case of the RMSE values of the testing phase, the
RMSE value of the XGBL developed in this study is 72.0% and 51.4%
lower than the RMSE values in studies conducted by Quan Tran (2022)
and Taffese and Espinosa-Leal (2022), respectively. Both Quan Tran
(2022) and Taffese and Espinosa-Leal (2022) developed boosting models
for predicting the chloride migration coefficient of a similar type of
concrete, i.e., concrete containing silica fume, slag, and fly ash. How­
ever, Quan Tran (2022) used the default values of hyperparameters
given by Python, while Taffese and Espinosa-Leal (2022) optimized the
hyperparameters of their developed boosting model. It can be the reason
behind the better performance of the ensemble model developed by
Taffese and Espinosa-Leal (2022) compared to the model proposed by
Quan Tran (2022) and a justification for optimizing hyperparameters
served in this study.
One of the main concerns of civil engineers about the development of
ML techniques is their applicability in the real world. To alleviate this
concern, as illustrated in Fig. 12, a graphical user interface was designed
in which users can easily insert the requested input features, run the
toolbox, and see the predictions by different ensemble models. The
Fig. 10. The sensitivity analysis based on the best ensemble model for the
RCMT system.
designed interface can be shared upon reasonable requests.

5. Conclusions
highest mean SHAP value, is the most noteworthy feature affecting the
RCMT prediction, followed by the natural fine aggregate and super­
Rapid chloride penetration test of recycled aggregate concrete (RAC)
plasticizer. There is no significant difference between the mean SHAP
can provide valuable information about the durability of concrete and
values of water, slag, and cement features influencing the CMC. In
its resistance to chloride ions ingress. In this study, a comprehensive
addition, recycled coarse aggregate water absorption has slightly more
database of the rapid chloride migration test (RCMT) of RAC was
influence than the recycled coarse aggregate ratio on the model output.
gathered from the literature, and the chloride migration coefficient
To investigate the effects of the input features on the RCMT system, a
(CMC) of RAC was the system output. Five homogeneous ensemble
parametric study was conducted for all the input features, as shown in
models, including Random forest learner (RFL), Bagged extra trees
Fig. 11. For this, the SHAP values for each input feature were plotted
learner (BETL), Adaptive boost learner (ABL), Gradient boosting learner
against the feature values. Increasing cement, slag, natural fine aggre­
(GBL), and Extreme gradient boosting learner (XGBL), were developed
gate, superplasticizer, compressive strength, and testing age reduces the
to estimate the CMC of RAC. The following conclusions were drawn from
chloride migration of RAC. In contrast, the increment of water, recycled
this study.
coarse aggregate ratio, and recycled coarse aggregate water absorption
negatively impact the chloride migration of RAC. It is worth mentioning
• All developed ensemble models in this study possess “good perfor­
that increasing the recycled coarse aggregate ratio to more than 0.5
mance” in the validating phase, which shows the high generality
exacerbates the chloride migration of RAC. The increment rate of
capability of ensemble models in estimating the rapid chloride
chloride migration for the recycled coarse aggregate water absorption of
migration test output of RAC.
less than about 5% is remarkable. A significant difference exists between
• The developed XGBL and GBL models need less maximum tree depth
the chloride migration of RAC with and without slag. For compressive
and a higher number of based models compared to the developed
strengths of less than almost 40 MPa, the chloride migration of RAC is
RFL, BETL, and ABL models.
more crucial than concrete with compressive strengths of more than 40
• The developed XGBL model outperforms the other ensemble models
MPa. The negative impacts of the recycled coarse aggregate ratio and
with the root mean squared error, the mean absolute percentage
recycled coarse aggregate water absorption in the concrete mixture can
error, and the R-squared of 1.04 × 10− 12 m2/s, 6.69%, and 0.94 in
be related to the weaker interfacial transition zone of concrete for the
the testing phase, respectively. In addition, the prediction perfor­
higher range of RCA in the concrete mixture and the more porous
mance of the XGBL model in the testing phase is excellent, consid­
structure of RCA with higher water absorption compared to the normal
ering the relative root mean squared error of 0.08.
concrete. This negative impact can be compensated using less water,

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

Fig. 11. Parametric study of a) C, b) S, c) W, d) NFA, e) RAR, f) RWA, g) SP, h) CS, and i) TA on the CMC obtained by the XGBL model.

Table 6
Comparison of different ML models for the RCMT system.
12 12 12
Best developed ML model Concrete type Input feature Data RMSE (10− MAE (10− × R-squared (10− Refs
number number × m2/s) m2/s) × m2/s)

GBL Concrete with silica fume, 9 127 3.72 2.71 0.87 Quan Tran (2022)
slag, and fly ash
XGBL Concrete with silica fume, 12 176 2.14 1.35 0.86 Taffese and
slag, and fly ash Espinosa-Leal (2022)
Developed XGBL model RAC with slag 9 227 1.04 0.73 0.94 –
in this study

Fig. 12. A graphical user interface designed for modeling the RCMT output.

E. Mohammadi Golafshani et al. Journal of Cleaner Production 407 (2023) 136968

• Testing age, the amounts of natural fine aggregate, and super­ chloride ion penetration resistance tests of concrete are proposed to be
plasticizer are three crucial features affecting the CMC of RAC. There modeled using machine learning techniques. The correlation between
is no significant difference between the importance of water, slag, predictions of different machine learning models of various tests can
and cement contents affecting the CMC of RAC. Moreover, the provide valuable information. Because of the importance of recycled
recycled coarse aggregate ratio and the recycled coarse aggregate aggregate water absorption in the chloride permeability of concrete, as
water absorption have the least impact on the CMC of RAC. concluded in this paper, it is suggested to improve the quality of recycled
• The CMC of RAC is more crucial for the recycled coarse aggregate aggregate using novel surface treatment techniques to alleviate the
ratio of more than 0.5, the recycled coarse aggregate water absorp­ penetration of RAC. The chloride ion ingress resistance of concrete
tion of less than about 5%, and the compressive strength of less than containing recycled coarse and/or fine aggregates, waste materials, and
almost 40 MPa. In addition, there is a significant difference between other types of supplementary cementitious materials is another inter­
the chloride migration of RAC with and without slag. esting topic that can be modeled using machine learning techniques.
• Recycled aggregate ratio and water absorption were two parameters
used in this study as input features to model the CMC of RAC. These CRediT authorship contribution statement
two parameters affect negatively on the CMC of RAC and the effect of
the recycled coarse aggregate water absorption is slightly more than Emadaldin Mohammadi Golafshani: Conceptualization, Data
that of the recycled coarse aggregate ratio. curation, Formal analysis, Methodology, Software, Validation, Visuali­
• The recycled aggregate can alleviate the environmental effects of zation, Writing – original draft. Alireza Kashani: Investigation, Super­
RAC, including the preservation of precious virgin resources and the vision, Writing – review & editing. Ali Behnood: Investigation,
prevention of landfilling. Surface treatment techniques can reduce Supervision, Writing – review & editing. Taehwan Kim: Investigation,
the negative impacts of recycled aggregate on the CMC by reducing Supervision, Writing – review & editing.
their porous structure and making a better interfacial transition zone
in concrete. Declaration of competing interest

For future studies, it is suggested to use other types of machine The authors declare that they have no known competing financial
learning techniques, including fuzzy-inference systems and newly pro­ interests or personal relationships that could have appeared to influence
posed ensemble techniques such as CatBoost and LightGBM to model the the work reported in this paper.
chloride migration coefficient. In addition, the determination of
hyperparameters of ensemble algorithms using existing techniques is a Data availability
challenging task. To solve this problem, it is suggested to use meta­
heuristic optimization algorithms to tune the hyperparameters. Other Data will be made available on request.

Appendix A. The description of the RCMT database used in this study

Sample IDs Refs

RAPC-0-0, RAPC-30-0, RAPC-50-0, RAPC-100-0 Wu et al. (2022)

S1, …, S45 Amiri and Hatami
S1, …, S30 Meng et al. (2021)
NAC, RAC, A-RAC, AR-RAC, C-RAC, AC-RAC, LC-RAC Kazmi et al. (2020)
R0F0 Nawaz et al. (2020)
NC-0.33–0, NC-0.39–0, RC–II–0.33–30, RC–II–0.33-50, RC–II–0.33-100, RC–III–0.33–30, RC–III–0.33-50, RC–III–0.33-100, RC–III–0.39–30, RC–III–0.39- Bao et al. (2020)
50, RC–III–0.39-100
RC, C100C-LC, C100C-RW Pedro et al. (2017)
N-SSD-0, N-SSD-30, N-SSD-100, N-OD-100 Kim et al. (2013)
RC, C20, C50, C100 Amorim et al. (2012)

