1 s2.0 S1352231023004132 Main

Atmospheric Environment 310 (2023) 119987
Contents lists available at ScienceDirect
Atmospheric Environment
journal homepage: www.elsevier.com/locate/atmosenv
Air pollution prediction using machine learning techniques – An approach

to replace existing monitoring stations with virtual monitoring stations
A. Samad a, *, S. Garuda b, U. Vogt a, B. Yang b
a
Institute of Combustion and Power Plant Technology (IFK), Department of Flue Gas Cleaning and Air Quality Control, University of Stuttgart, Germany
b
Institute of Signal Processing and System Theory (ISS), University of Stuttgart, Germany
H I G H L I G H T S
• Machine Learning models are suitable for pollutant concentration prediction.

• Pollutant concentrations from nearby monitoring stations proved the most effective input parameter.
• The developed methodology is applicable to estimate pollutant concentrations at other locations.
• Virtual monitoring stations can substitute existing monitoring stations.
A R T I C L E I N F O A B S T R A C T
Keywords: Air pollution in the modern world is a matter of grave concern. Due to rapid expansion in commercial social, and
Machine learning economic aspects, the pollutant concentrations in different parts of the world continue to increase and disrupt
Prediction modelling human life. Thus, monitoring the pollutant levels is of primary importance to keep the pollutant concentrations
Air pollution prediction
under control. Regular monitoring enables the authorities to take appropriate measures in case of high pollution.
Multiple linear regression
However, monitoring the pollutant concentrations is not straightforward as it requires installing monitoring
Random forest
XGboost stations to collect the relevant pollutant data, which comes with high installation and maintenance costs. In this
Air quality research, an attempt has been made to simulate the concentrations of PM2.5, PM10, and NO2 at two sites in
Stuttgart (Marienplatz and Am Neckartor) using Machine Learning methods. These pollutants are measured with
the help of monitoring stations at these locations. Five Machine Learning methods, namely ridge regressor,
support vector regressor, random forest, extra trees regressor, and xtreme gradient boosting, were adopted for
this study. Meteorological parameters, traffic data, and pollutant information from nearby monitoring stations
for the period from January 01, 2018 to 31.03.2022 were considered as inputs to model the pollutants. From the
results, it was concluded that the pollutant information from the nearby stations has a significant effect in
predicting the pollutant concentrations. Further, it was investigated if a similar methodology can be applied at
other locations to estimate pollutant concentrations. This procedure was tested on the data of the monitoring
station Karlsruhe-Nordwest which is located in another German city named Karlsruhe. The results demonstrated
that this method is applicable in other areas as well.
1. Introduction 2021). Any substance that changes the natural composition of the air is
considered a pollutant (Baumbach, 1996). Apart from living organisms,
Air pollution is the most considerable environmental health risk in the pollutants would affect the properties, such as corroding the buil
all of Europe (European Environment Agency (EEA), 2022). It accounts dings/structures. In urban cities, the emissions from the combustion of
for mainly cardiovascular and respiratory diseases, causing loss of fossil fuels for various transport modes, industries, and household ac
healthy years of life and premature deaths. In 2019 alone, air pollution tivities account for the main percentage of emissions emitted into the
has taken a toll on nearly 307,000 lives in Europe (Khomenko et al., atmosphere (Mosley, 2014). Insufficient air quality monitoring is always
* Corresponding author. Institute of Combustion and Power Plant Technology (IFK), Department of Flue Gas Cleaning and Air Quality Control, University of
Stuttgart Pfaffenwaldring 23, 70569, Stuttgart, Germany.
E-mail address: abdul.samad@ifk.uni-stuttgart.de (A. Samad).
https://doi.org/10.1016/j.atmosenv.2023.119987
Received 3 April 2023; Received in revised form 21 July 2023; Accepted 27 July 2023
Available online 28 July 2023
1352-2310/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
A. Samad et al. Atmospheric Environment 310 (2023) 119987
a matter of concern (Duyzer et al., 2015). To tackle pollution, European making an approximate estimation. Following are the ML models that
Union (EU) came up with the approach in 2008 to measure the air were applied to predict pollutant concentrations.
quality in areas where people are affected adversely. In case of
exceedances of the legal limit values, an air pollution control plan 1.2.1. Ridge regression
(Luftreinhalteplan) needs to be established to reduce air pollution to Ridge regression is a parameter estimation method. In linear
stick to the limit values (EU, 2015a). One way to monitor air quality is regression, the parameters are the weights learned by the model and the
with the help of the air quality monitoring network, also referred to as a output is a linear combination of inputs. In the case of non-linearity
monitoring station. The ambient air quality directive lays down objec among the inputs, linear regression tries to fit a line. Though it has a
tives for ambient air quality and methods and criteria for assessing air low bias, it might suffer from high variance. However, a ridge regressor
quality in the member states (EU, 2015b). Achieving adequate coverage tries to add a bias into the coefficients to have a minimizing effect on
with an air quality monitoring network includes factors such as popu variance. The advantage of this method is that if there is noise in the
lation density, location, cost, and maintenance life-cycle of measuring original data, the bias forces the loss to be small by minimizing the
devices. Increasing the number of monitoring stations is not feasible as weights (McDonald, 2009).
per the limited public administration budgets (Spangl et al., 2007a).
Machine Learning (ML) can provide a solution to this problem. By uti 1.2.2. Support vector regression
lizing ML techniques robust models can be developed and specific re The concept of support vectors was originally introduced by Vapnik
lationships between data collected from monitoring stations and et al. (1996) to solve pattern recognition problems on multidimensional
pollutant concentrations at other spatial locations can be proposed. data. Using support vector regression, it is possible to reduce error or
loss within a defined range. Unlike ridge regression, outputs are often
1.1. Machine learning and its components approximated with a line. This method offers the flexibility that it fits
not only on a line but also on a curve depending on the type of function
Artificial intelligence is the highest paradigm where an element can used.
sense, reason, make decisions, and adapt from its mistakes. ML is a
subset of artificial intelligence which enables the element to learn 1.2.3. Ensemble methods
without explicitly being coded by a set of rules (Xu et al., 2021). Features The ensemble is a particular way to combine different models stra
or independent variables are the inputs provided to the algorithm tegically to solve a particular problem (Zhang and Ma, 2012). The main
enabling it to capture the relationship with the variable of interest. The goal of the ensemble is to eliminate the weakness of individual models
output to be estimated is called a target or dependent variable. At times by integrating them. There are many forms of ensembles, such as
features can be the input data itself, and sometimes new features could Bagging, Boosting and Stacking (Zhang and Ma, 2012). In Boosting, a
be created to provide new hints for the algorithm to learn. This concept sequence of models is built so that the residuals of one model are pro
of creating new features with the help of existing features is called vided as targets for the next model, and so on, which in turn reduces the
feature engineering (Dong and Liu, 2018). ML model is a method applied bias (Zhang and Ma, 2012). In Stacking, the outputs from individual
to the input data, that tries to detect its relationship with the target models are provided to another estimator, called the meta-model, to
value. This entire process is referred to as Training. When the model is obtain the final output (Wolpert, 1992). By far, the ensemble methods
applied to new unseen inputs, to evaluate the relationship learned is have shown promising results compared to any other ML models (Zhang
referred to as Testing (Goodfellow et al., 2016). The bias and variance and Ma, 2012). In a decision tree, branching at any node depends on a
hold fundamental importance in evaluating the performance of the ML feature that should minimize the loss function.
model. Bias is the degree of effectiveness with which the model learns
from the input data (training data). High bias means the model was 1.2.3.1. Bagging. Bagging is the short form for Bootstrapped Aggrega
unable to capture the relationship between the training data and the tion. In Bagging, individual models are trained separately and the output
target value. This phenomenon is referred to as underfitting (Goodfellow of all models is averaged to reduce variance (Zhang and Ma, 2012). The
et al., 2016). When the ML model is established and new unseen data base models are decision trees that easily tend to overfit. To eliminate
(test data) is supplied to make predictions, the extent to which the overfitting, the concept of Bagging is introduced. There are various
predictions correspond to the test data refers to variance. The model is forms of Bagging, such as random forest and extra trees regressor. In the
said to be overfitting if it learns the training data to the extent that it end, the output of different base learners is combined by taking the
negatively affects the model performance on new data (Goodfellow average (He et al., 2021).
et al., 2016). Thus, the goal of a model is to have low bias and low
variance. The hyperparameters are the external configuration of the 1.2.3.2. Random forest and extra trees regressor. Random forest is a
model that can be tuned to optimize the ML model algorithm, which variant of bagging that builds a multitude of decision trees to obtain the
minimizes the loss when applied to a particular data (Goodfellow et al., output (Liaw and Wiener, 2001). Sampling features are termed column
2016). sampling and data points as row sampling. Trees are built with row and
column samples. The advantage of building the model in such a way is
1.2. Machine learning models that it is robust in estimating new data points. Extra trees regressor
performs similarly to random forest, with one more level of randomi
Estimating the pollutant concentration can be carried out with the zation (Geurts et al., 2006). This method helps to achieve slightly better
help of traditional models, such as chemical transport and dispersion performance compared to the random forest, also helping in reducing
models. However, these models depend on several physical and chem the training time.
ical formulas, which makes it a challenging task (Vlasenko et al., 2021).
These models involve complex flow control equations. Although, with 1.2.3.3. Boosting. In Boosting, unlike Bagging, the fundamental differ
advancements in data-driven methods, processing these has become ence is that the models are built successively (Zhang and Ma, 2012). The
easier, still working with these is a challenging task. Applying the ML main idea is to build a base model and estimate the residual error (the
models has also produced reliable estimates lately and is thus used difference between actual and estimated target values). The next model
widely. The advantages of ML models are the ease of computing and is built based on residuals of the previous stage as the target. This pro
inexpensiveness compared to the traditional methods (Xing et al., 2020). cess is continued iteratively until the lowest residual error is reached.
Estimating pollutant concentration is an active area of research because Thus, the final model combines all models in each iteration (Han et al.,
one can try to reduce the dependency on networks or sensors used,
2
2019). XGBoost is a variant of boosting algorithm which includes both not include meteorological parameters for this estimation. In this
row and column sampling. manner, the same spatial pollutant monitoring networks were employed
in modelling (Hu et al., 2016). Support vector regression was also
applied to predict O3 concentration in Delhi, India. Different kernels
1.3. Performance evaluation metrics
related to support vector regression, such as linear, polynomial, and
radial basis functions were employed. The work stated the best possible
Different metrics are available to evaluate the model performance.
feature set to forecast O3 concentration with five input parameters,
This research work primarily focused on predicting the pollutant con
namely the ozone for the previous two days and meteorological inputs of
centration, which is a real-valued number. Hence, the regression error
air temperature, relative humidity, and sunshine hours. Finally, a
metrics were considered for this work as suggested in the study by
comparison between performance metrics of linear regression and
Botchkarev, Alexei (Botchkarev, 2018). For all the metric demonstra
multiple layer perceptron along with support vector regression was
tions, N denotes the number of data points, yi represents the actual value
performed concluding that support vector regression was able to capture
of each data point, and ̂ y i represents the predicted value of the data
non-linear trends effectively with radial basis function kernel used when
point.
compared to linear regression and multiple layer perceptron (Chelani,
2009). In another study, similar experiments were carried out to forecast
• R Square (R2): R2 demonstrates how much variability in the depen
the air quality index in Tehran utilizing the pollutant information of
dent variable can be explained by the model. It is represented as
various monitoring stations located across Tehran, from the past two
shown in Equation A1 in Appendix. As N is the number of the total
days to forecast the hourly air quality index for the next 24 h. The study
data points, and yi is the average of all total data points (Botchkarev,
was conducted for period of 2008–2013. This study explored different
2018). It is a good metric to check the fitting of the model with data
kernel functions and a forecast of a pollutant map with different AQI for
points. The R2 value near 1 represents the best possible accuracy. different locations in Tehran was obtained (Ghaemi et al., 2018).
• Mean Absolute Error (MAE): It is the magnitude of the difference The ensemble methods have been employed in the context of air
between the actual and predicted outcomes. For N data points, MAE pollution estimation because of their wide popularity and applicability.
is defined as shown in Equation A2 in Appendix (Botchkarev, 2018). In most of the studies using this method, meteorological parameters
This performance metric is robust to outliers. such as air temperature, air pressure, relative humidity, and wind speed
• Root Mean Squared Error (RMSE): It is expressed as the square root were considered as input parameters. These variables vary depending on
of the mean square error. The advantage of RMSE is the differen location and play a crucial role in rapidly varying pollutant concentra
tiability and can also be used as a loss function. RMSE can never be tions. PM2.5 forecasting was performed on a single monitoring station,
negative and is defined in Equation A3 in Appendix (Botchkarev, by including the mentioned meteorological parameters in Delhi, India.
2018). Overall eleven models were utilized and the R2 scores were compared.
• Mean Absolute Percentage Error (MAPE) defines the percentage However, in this study, the outputs from two different models were also
deviation of the predicted value from the actual value. For N data combined to see the improved performance. It was concluded that
points, it is expressed as shown in Equation A4 in Appendix combining two algorithms can produce slightly enhanced performance
(Botchkarev, 2018). overall when compared to a standalone algorithm (Kumar et al., 2020a).
In another study, a total of 23 features along with PM2.5 concentration
1.4. Literature review from 37 monitoring stations were considered. The R2 score obtained
with various ensemble ML models and artificial neural networks was
In a research study, the mentioned regression model was imple compared with and without the inclusion of aerosol optical depth. This
mented to estimate PM10 concentration (target) in Chonburi, Thailand study concluded that because of missing values in PM2.5 and aerosol
with the help of independent input data, which are meteorological and optical depth data obtained from satellite, the performance capabilities
pollutants. The meteorological inputs included air pressure, precipita of the artificial neural network were reduced (Zamani Joharestani et al.,
tion, air temperature, relative humidity, and wind speed. The pollutants 2019). A random forest model for ozone estimation was built at
included carbon monoxide (CO), nitrogen monoxide (NO), nitrogen Research Academy for Environmental Sciences in Beijing, China (Zhan
dioxide (NO2), sulfur dioxide (SO2), Black Carbon (BC), methane (CH4), et al., 2022). A linear hybrid machine learning model was applied for
Non-Methane Hydro Carbon (NMHC) and ozone (O3) from 2006 to 2008 PM2.5 concentration estimation in China (Song et al., 2021). In London,
(Saithanu and Mekparyup, 2014). Another study by Rybarczyk and PM2.5 has been estimated using the widely available PM10 and NOX
Zalakeviciute estimated PM2.5 concentration with regression models emissions with the help of regression modelling as well as the machine
based on time. Primarily the regression models were built by segregating learning method (Random Forest) and a combination of both (Analitis
the day into three-time segments such as 6 a.m.–10 a.m., 10 a.m. to 2 p. et al., 2020). In another study, different ML models were developed to
m. and 2 p.m.–7 p.m. for the capital city of Quito, Ecuador. In this study, estimate PM2.5 and NOX for three different monitoring networks using
initially, the models were built for each of the time periods based on ease local pollution estimates, meteorological data, and emissions from ve
of data availability. Traffic was considered for the three time periods and hicles. The main objective of this study was to check which of the ML
was segregated into high, medium and low. The model showed an R2 model provide the best possible performance and which were the
score of 0.27 with these settings. By adding meteorological data influential variables. Six ML models were investigated to estimate the
including solar radiation, air temperature, air pressure, precipitation, prediction capability of PM2.5 and NO2 (Li et al., 2020). Daily CO con
relative humidity, and wind speed as features, an improvement in R2 centration was estimated in Taiwan for the study period from 2000 to
score to 0.38. Finally, the trace gas concentrations SO2, NO2, O3 and CO 2018 from which the last two years were used for evaluation. Three
were considered as well, which improved the R2 score to 0.8. The lim models using a deep neural network, random forest and XGBoost were
itation of this study was the extra cost associated with measuring the used. The authors concluded that XGBoost had the highest R2 score of
trace gas concentrations (Rybarczyk and Zalakeviciute, 2017). 0.85, followed by random forest and neural network with 0.84 and 0.81
Support vector regression was used to estimate CO concentration for respectively. In comparison, a simple regression model yielded an R2
the region of New South Wales in Australia, dividing the entire region score of 0.69 (Wong et al., 2021). A machine learning method to esti
into 100 grids. The authors estimated CO concentration in all 100 grids mate PM2.5 concentrations was applied across China with remote
using four different sets of features that included CO concentrations sensing, meteorological parameters and land use information (Chen
from four monitoring stations within the study area, latitude and et al., 2018). In a study in Munich Germany, the XGBoost model was
longitude of the grids, hour, day of the week and season. The authors did built using meteorological parameters, precursors and simulations of
3
ozone concentration obtained from the CAMS2 dataset to estimate in Stuttgart, Germany to provide an opportunity to replace the existing
ozone concentration. The objective of this study was to investigate the monitoring stations with a virtual monitoring station. Another objective
significance of precursor information in modelling surface ozone using was to check the applicability of the developed methodology in other
ML. The meteorological parameters such as air temperature, relative locations apart from Stuttgart. To achieve these objectives multiple ML
humidity, boundary layer height, wind speed and wind direction as well models were tested with meteorological parameters, traffic data in the
as in-situ ozone precursors (NO, NO2 and CO), and satellite ozone pre form of a number of vehicles passing per minute and pollutant con
cursors (column NO2 and HCHO) along with CTM simulations (CAMS centrations from other monitoring stations as input variables.
model surface O3) were used as input parameters. Additionally, day of The findings of this study may hold relevance for policymakers and
the week and season was also considered (Balamurugan et al., 2022). researchers, enabling evidence-based decision-making and targeted
Satellite-Based estimates of daily NO2 exposure in China were tested pollution control measures, for instance to know at which location it is
using a hybrid random forest and spatiotemporal kriging model (Zhan important to have the monitoring station and, in its absence, can a
et al., 2018). model be used in order to continue the estimation. By fulfilling these
A study done by Isam Drewil and Jabbar Al-Bahadili in 2022 (Isam objectives, this research sets the stage for future investigations and the
Drewil and Jabbar Al-Bahadili, 2022) proposed a model that combines development of effective environmental management strategies, where
the Genetic Algorithm (GA) with Long Short Term Memory (LSTM) to monitoring is the starting point.
optimize hyperparameters and predict pollution levels for the next day,
focusing on four key pollutants: PM10, PM2.5, CO, and NOX. One of the 2. Methodology
primary challenges associated with LSTM is the selection of appropriate
parameters, such as window size and the number of units in LSTM. The 2.1. Conceptualization
application of the metaheuristic GA offers a successful solution to this
issue, allowing for more flexible performance in predicting pollution As mentioned before, the main goal of this research was to predict
levels (Isam Drewil and Jabbar Al-Bahadili, 2022). Another study per the pollutant concentrations at a certain location. The pollutants of in
formed by Du et al. (2020) covered the effectiveness of four advanced terest in this study were PM2.5, PM10, and NO2. To investigate the
machine learning methods for spatial data handling: SVM, applicability of the models, two locations in the city of Stuttgart were
semi-supervised and active learning, ensemble learning, and deep chosen. The data for this research was collected from different sources.
learning. These methods have been applied to address classification, After data acquisition, an initial analysis of the pollutants PM2.5, PM10,
regression, and inversion problems, showcasing their ability to improve and NO2 concerning outliers and missing values was conducted. As a
performance in spatial data analysis. However, it should be noted that continuation step, the relevant features and their relation to pollutant
the scope of machine learning and spatial data handling is broad, and concentration were investigated in detail. Further, the splitting of data
this review only covers a subset of methods (Du et al., 2020). Another into train and test was performed. Then the training of ML models was
method to investigate the spatial and temporal variations of atmospheric carried out with the choice of hyperparameter setting. In the end, it was
pollutants is the use of satellite-based measurement data with tested if the applied method can be used for other locations.
ground-based measurement results. Such a research was performed to
investigate seven cities located near the South Gobi deserts (Filonchyk 2.2. Study area
et al., 2020). The analysis covered the period from January 1, 2016 to
December 31, 2018. The main pollutants examined were particulate The pollutants of interest PM2.5, PM10 and NO2 were to be modeled
matters (PM2.5 and PM10) and gaseous pollutants (SO2, NO2, CO) at two locations in Stuttgart, namely Am Neckartor and Marienplatz
(Filonchyk et al., 2020). Kumar et al. focused on evaluating different which have different characteristics. In Fig. 1, the locations Am Neck
interpolation techniques for air quality mapping in Mumbai, India artor and Marienplatz are shown with blue and black location pins
(Kumar et al., 2020b). The authors compared the effectiveness of several respectively. Image 1 in this figure represents the aerial view of Am
interpolation methods, including inverse distance weight (IDW), Kriging Neckartor location and the small red circle indicates the position of the
(spherical and Gaussian), and spline techniques using data collected monitoring station. The monitoring station at Am Neckartor can be seen
from air quality monitoring stations in the city. In terms of statistical in Image 2. Similarly, Images 3 and 4 in this figure depict the aerial view
assessments, the IDW method indicated a better fit between predicted of Marienplatz and the monitoring station at Marienplatz respectively.
and observed values. These findings suggested that the IDW approach The distance between the two stations is around 3 km.
performs favorably among the interpolation techniques tested in this To perform any ML modelling, data acquisition is the first step. Data
study (Kumar et al., 2020b). were gathered from different sources. The pollutant and meteorological
data at Marienplatz were obtained from the Department of Flue Gas
1.5. Objectives Cleaning and Air Quality Control at the Institute of Combustion and
Power Plant Technology, University of Stuttgart (Samad and Vogt,
Forecasting and estimation of pollutants with the help of ML models 2020). The remaining pollutant and meteorological data were gathered
are becoming an active area of research. The detailed literature review from the Baden-Württemberg State Institute for the Environment
paved the path to knowing which inputs would be influential in esti (Landesanstalt für Umwelt Baden-Württemberg – LUBW). The LUBW
mating the pollutant concentrations. There are very few studies avail measures the pollutants by establishing a fixed air quality monitoring
able in which it is tried to implement the ML model built for one location network at different locations in the city. The traffic data were obtained
to another. In light of the background information provided, this by the integrated traffic control center in Stuttgart (IVLZ).
research addresses the knowledge gap concerning the estimation of
pollutant concentrations. The dependency of pollution concentration 2.3. Machine learning workflow
variation on the location along with pollution sources particular to the
locality that lead to pollution makes it a challenging task to apply such 2.3.1. Data processing
techniques. In general, the traffic trend can relate to the pollutant con The period for the entire study was from January 01, 2018 till
centration variation during peak and off-peak times. The motivation 31.03.2022. After data accumulation, the next step was data pre
behind this study lies in the need for accurate estimates of pollutant processing which enabled to identify data quality. In this step, missing
concentrations, especially in areas without monitoring stations. value analysis and outlier removal were carried out. Missing values are
The aim of this research is to model the pollutant concentrations the number of hourly observations that are not available for reasons
using ML models at selected locations of Marienplatz and Am Neckartor such as maintenance procedures or malfunctions in the monitoring
4
Fig. 1. Am Neckartor and Marienplatz monitoring stations, Stuttgart.
network. The missing values of the pollutants at the monitoring station identifying the variables as the pollutant concentration of one compo
Am Neckartor were below 5% of the total data and at Marienplatz, this nent can vary depending on temporal factors, meteorological factors,
count was below 20%. These values were not imputed since it does not traffic situation, topography and concentration of other pollutants. To
include any bias in the model (Demertzis et al., 2015). For removing the visualize this, the concentration variation of the pollutants concerning
outliers, interquartile range method was used with varying values of time, meteorological parameters, traffic and other pollutant concentra
fences as suggested in a study done by Hubert et al. (Hubert and Van tions is shown in the form of plots in Figs. 2, 3 and 5.
dervieren, 2008). An amalgamation of features was derived from the data collected to
estimate the hourly pollutant values. The features are broadly classified
2.3.2. Independent variables/features into temporal, meteorological, traffic, and pollutants from other
The selection of variables that were considered for this research spatially distributed monitoring networks. The temporal features
study was based on the literature review and the data availability. The include hourly, daily, weekly and monthly values as the target output
goal was to select the most informative and influential variables that variates with these features. Automobiles play an important role in air
have an impact on the outcome of interest while excluding irrelevant or quality and account for pollution (Long and Carlsten, 2022; Sun and
redundant variables. By considering this, domain knowledge was vital in Zhu, 2019). The traffic data obtained from the integrated traffic control
Fig. 2. Scatter plot of meteorological parameters and pollutants at monitoring station Am Neckartor.
5
Fig. 3. Scatter plot of meteorological parameters and pollutants at monitoring station Marienplatz.
center (IVLZ) Stuttgart provided the traffic information, i.e. minute

average of the number of vehicles passing by the monitoring stations. As
mentioned in the literature, meteorological parameters play a vital role
in the variation of pollutant concentrations. Occasionally these meteo
rological conditions change rapidly, resulting in the transportation of
pollutants. High wind speed causes the dispersion of pollutants, trans
porting them from a few meters to kilometers (Latini et al., 2002).
Precipitation causes pollutants to settle down, also at times the pollut
ants are trapped inside the snow (Tian et al., 2021). Both scenarios lower
the pollutant concentrations. Apart from the climatic conditions, the
topography of the city can account for increased concentrations as in the
case of Stuttgart (2008).
The considered meteorological features were air temperature, rela
tive humidity, air pressure, wind speed, global radiation and precipita
tion. The pair plots for the pollutants PM2.5, PM10 and NO2 and their
relationship with the meteorological parameters are shown in Figs. 2
and 3 for monitoring station Am Neckartor and Marienplatz respec
tively. It can be seen that the pollutant concentrations show a negative
trend with a high amount of precipitation and windspeed and vice versa.
The pollutants can alter the amount of light that can reach the earth’s
surface (Khodakarami and Ghobadi, 2016). Thus, lowering the radiation
levels may result from high pollutant levels (Khodakarami and Ghobadi,
2016). This trend was particularly seen in the case of PM2.5 and reduced
further for PM10 and NO2. Another notable observation was the impact
of humidity for all three pollutants where low pollutant concentrations
were observed with low humidity.
Every city has a different topology attributed to itself. Thus, moni
toring with the help of a single monitoring network is merely possible
Fig. 4. Air pollutant monitoring network in Stuttgart showing continuous
(Verghese and Nema, 2022). Monitoring stations can be broadly cate monitoring stations operated by LUBW (blue pin) and IFK, University of
gorized into hot spot, background, and commercial stations depending Stuttgart (black pin).
on the monitoring site. A hot spot station is one situated right next to the
source, like traffic. In contrast, the background station is where pollu stations at different sites, using this spatial information can assist in
tion levels are not directly affected by emission sources and are repre estimating the pollutants of interest PM2.5, PM10 and NO2 at the two
sented by land cover and population (Spangl et al., 2007b). At a locations Marienplatz and Am Neckartor. The pollutant concentration is
commercial station, pollutant levels are accounted for in both scenarios, affected by certain factors that are not considered for ML models to make
i.e., hotspot and background. In Fig. 4 the air pollution monitoring it less complex such as the location characteristics, topography, varying
network in Stuttgart is shown. The monitoring stations marked by blue emission sources, and occasional activities, e.g. construction, festivals,
pins are the ones operated by LUBW and the monitoring station marked etc. One of the reasons to include pollutant concentration from other
by black pin is operated by IFK, University of Stuttgart. The name of the stations as an input was to consider such factors indirectly using the
monitoring station, category and measured pollutants are listed in pollutant concentration of other stations. The relationship between
Table 1. pollutants measured at different monitoring stations is shown with a
Since the pollutants to be modeled are also measured by monitoring Spearman rank correlation matrix (Akoglu, 2018) in Fig. 5. This plot
6
Fig. 5. Spearman rank correlation matrix between pollutants.
April 01, 2021 to 31.03.2022. Approximately 75% of the data were used
Table 1
for training and the remaining test data (25%) was for one complete
Monitoring networks along with the list of pollutants being measured.
year. The advantage of selecting an entire year for test data is that it
Name of monitoring station Type Measured pollutants covers the seasonal variation that the pollutants can be influenced by
Bad Cannstatt Background PM10, PM2.5, NO2, NO and O3 during an entire year. The train and test split percentages were based on
Am Neckartor Hot spot, traffic PM10, PM2.5, NO2 and NO the literature review in which many authors suggested the division in
Arnulf Klett Platz Commercial PM10, PM2.5, NO2 and NO this range. A 10-fold cross-validation technique was employed, i.e.,
Hohenheimer Straβe Hot spot, traffic NO2 and NO
Hauptstatter Straβe Hot spot, traffic NO2 and NO
dividing the train set into ten equal parts and, each time, training nine
Marienplatz Commercial PM10, PM2.5, NO2, NO and O3 parts and evaluating the remaining part as the validation set. Thus, in
this way, the generalizing ability of the model increases while training
and hence when evaluated on the test may yield plausible results. This
measures the relationship between two variables in order of their ranks. form of evaluation on one part after training on nine parts is to find out
Thus, it essentially provides a measure of the monotonic relationship the best set of hyperparameters for each of the models.
between those two variables.
From the correlation matrix in Fig. 5, the pollutant at each station is 2.3.4. Model training and hyperparameter tuning
denoted as pollutant_stationname. The correlation ranges from − 1 to 1. Five different ML models were trained with the same training data
If the correlation is near to one then the features are positively corre obtained after the split. The training strategy is shown in Fig. 6, in which
lated, where − 1 means negatively correlated. It can be seen from the each row contains various temporal, meteorological, and traffic features
correlation matrix that PM2.5_Marienplatz has a decent correlation of and corresponding targets (one pollutant measurement) fed to each ML
0.76 and 0.7 with PM2.5_BadCanstatt and PM2.5_ArnulfKlettPlatz model. Here the row represents the hourly timestamp represented with
respectively. As a common observation, PM2.5 between different loca the naming convention T1 to Tn. For every timestamp, say Tx, the
tions is highly correlated compared to PM10 followed by NO2. Ozone is corresponding model outputs are named O1 to O5 for the ML models M1
negatively correlated with PM2.5, PM10 and NO2. This matrix serves as to M5. The minimum, average, and maximum values of the obtained
substantial support in including pollutants from other monitoring net outputs for every timestamp were analyzed. Obtaining minimum,
works in establishing the ML models at Marienplatz and Am Neckartor. average, and maximum value gives a range of pollutant concentrations
at each timestamp. Performing the average of results provides more
2.3.3. Train and test split robust outputs as suggested in several studies (Wichard and Ogorzalek,
After identifying the features, the next step was to divide the avail 2004; Maqsood et al., 2004; Talebizadeh and Moridnejad, 2011) that
able hourly data into training and test data. The training data was averaging could result in enhanced performance.
selected from January 01, 2018 to 31.03.2021 and the test period from The individual ML models along with their hyperparameter settings
7
Fig. 6. ML workflow including input features, models and outputs.
are listed in Table 2. The selection of ML models for this study was based The performance metrics of different models used in this study for
on the literature review. From the previous studies predicting air the PM2.5, PM10, and NO2 pollutants illustrating the values for R2, MAE,
pollutant concentrations, these ML models were proposed. All the used and RSME on training and test data for each pollutant are shown in Fig. 7
models are available in the Scikit-Learn3 library. The best possible for the monitoring station Marienplatz, Stuttgart. The error metric re
hyperparameter set was obtained after several trials avoiding the case of sults for the monitoring station Am Neckartor, Stuttgart are presented in
overfitting. Finally, the build models were evaluated on the test set. Figure A1 in Appendix. Training and test data for each pollutant were
applied to the following ML models independently: RIDGE, SVR, RFR,
3. Results and discussion ETR and XGBOOST. Firstly, the R2 metric for PM2.5 pollutant indicated
that it was always high when compared with the test data value for all
To estimate the hourly pollutant concentration in the study areas, the models. Especially the R2 value for the models RFR and ETR indicated
ML models are built for the following scenarios depending on the data significantly higher values for training data than the test data. Generally,
availability and the pollutant concentration variation concerning these bagging methods tend to overfit, which was observed in this case.
different features. Similarly, this trend was observed for PM10 and NO2 pollutants. How
ever, the RIDGE and SVR models recorded fewer R2 values for the
• Scenario 1: In this scenario, ML models were built by providing the training data. This behavior was common for all the pollutants measured
meteorological and temporal as input features. for the monitoring station Marienplatz, Stuttgart location. For the PM10
• Scenario 2: In scenario 2, to the features of scenario 1, traffic was pollutant, the RIDGE model resulted in higher R2 value for test data than
included as an input feature. the training data. Overall the models except for ETR and RFR were found
• Scenario 3: For scenario 3, along with the features of scenario 2, the to be underfitting with very low R2.
same pollutant concentration from the background monitoring sta The right-side graph of Fig. 7 demonstrates substantial error values
tion (Bad Cannstatt) was introduced. for both training and test data for the NO2 pollutant in comparison with
• Scenario 4: For scenario 4, along with the features of scenario 2, PM2.5 and PM10 pollutants. This observation was recorded for all the
pollutants from the other stations mentioned in Table 1 other than models used in this study at the Marienplatz, Stuttgart location. Also, the
monitoring station Marienplatz and monitoring station Am Neck difference in error performance metrics for training and test data for all
artor were considered. the pollutants was insignificant. This indicates that all models after
tuning hyperparameters performed similarly on train and test data. The
performance metrics for NO2 pollutant were found to be considerably
more substantial than PM10 and PM2.5.
3.1. Scenario 1 In Figure A1, the performance metrics of Am Neckartor are shown.
Considering the R2 score, it was similar to that of PM10 and PM2.5 at
In Scenario 1, temporal features such as month, day and hour were Marienplatz. However, for NO2, the models performed poorly regarding
considered. The meteorological features such as air pressure, precipi the test R2 score. It can be seen that the R2 score for NO2 on the test data
tation, global radiation, temperature, windspeed, and humidity were is negative, a clear scenario of underfitting which indicated that the
given as input features. The ML models were applied with a ten-fold models did not learn properly and new features were required. On the
cross-validation technique to obtain the optimal set of hyper other hand, the MAE and RMSE for both train and test data for PM10 at
parameters. After training, the models were evaluated using the test data Neckartor were twice as PM2.5 at Am Neckartor. Common observations
set to assess the model’s performance. The models were optimized by among both the performance metrics in Fig. 7 and A1 include overfitting
performing hyperparameter tuning. The performance metrics for indi of the Bagging-based methods especially for PM2.5 and PM10. The per
vidual pollutants at both locations were established with the purpose to formance metrics obtained for NO2 showed unsatisfactory results when
investigate the model performance. compared to the other two pollutants. The RIDGE model performed
poorly compared to the remaining four models in terms of low R2 score
Table 2 and high MAE and RMSE for both train and test data.
ML models along with Hyperparameters used. In a related study (Zamani Joharestani et al., 2019), focusing solely
ML models Hyperparameters and range on Delhi and employing meteorological parameters for PM2.5 estima
Ridge Regressor (RIDGE) Regularizer λ tion, the best performing model, a combination of extra trees and Ada
Support Vector Kernel, C, degree, epsilon Boost achieved a MAE of 14.3 μg/m3. In contrast, the MAE values for
Regressor (SVR) Scenario 1 were ranging between 6.2 and 6.4. In terms of NO2 estima
Random Forest Estimators, max_depth, min_samples_leaf, tion, the results in Scenario 1 demonstrated better R2 scores at both
Regressor (RFR) max_features
Extra Trees Regressor
Marienplatz and Am Neckartor compared to the models presented in the
(ETR) study done by Zhiyuan et al. (Li et al., 2020).
Xtreme Gradient Estimators, max_depth, min_samples_leaf,
Boosting (XGBOOST) learning_rate, subsample, column_sample_tree
8
Fig. 7. Scenario 1 performance metrics for pollutants PM2.5, PM10 and NO2 at monitoring station Marienplatz, Stuttgart.
3.2. Scenario 2 improvement compared to Scenario 1. Here too, the test MAE and RMSE
were comparable to train MAE and RMSE values. All the ensemble
In Scenario 2, in addition to the temporal and meteorological fea methods still tend to overfit while the other two are underfitting. For
tures, traffic counts, i.e., vehicles passing from the roads adjacent to the NO2, with respect to R2 score, an improvement was observed across all
monitoring station were also considered. Traffic is one of the contribu the algorithms on both train and test data. However, only a small drop
tors to air pollution (Long and Carlsten, 2022). Hence, the models were concerning MAE and RMSE values was noticed when compared to Fig. 7.
retrained as mentioned in Scenario 1 with the addition of traffic as a new In Figure A3 in Appendix section, the performance metrics at Am
parameter to the existing meteorological and temporal features. Neckartor are shown after providing the traffic feature. For PM2.5 and
In Figure A2, presented in Appendix, the performance metrics at PM10, there was no change in performance metrics with respect to
Marienplatz after adding the traffic feature are shown. The R2 score, Scenario 1. All the performance metrics remained unchanged even after
MAE and RMSE for train and test data for PM2.5 and PM10 show no the addition of traffic. However, for NO2, there was an improvement
9
observed. It is assumed to be that since the monitoring station is adjacent respective parameters. For PM10, compared to Scenario 2, the MAE and
to the traffic source (federal highway) and the traffic emissions RMSE were reduced by 4 μg/m3, also an increase in the R2 score was
contribute more to NO2 than particulate matter (LUBW and Land noticed. The ensemble models especially XGBOOST showed the best
esanstalt für Umwelt Baden-Württemberg, 1999), the NO2 concentra performance across the train and test in all performance metrics. The R2
tions can be predicted better than particulate matter. The R2 score of the scores were positive for NO2 at Am Neckartor, still, a substantial dif
training data set increased and a reduction in MAE and RMSE was seen ference between train and test results existed. The train and test MAE
when compared to NO2 in Figure A1. The R2 nearly doubled from 0.35 to and RMSE were reduced by 20% when compared to performance met
0.6, while there was a reduction in train MAE and RMSE by approxi rics of NO2 for Scenario 2. Finally, from Scenario 3 it was observed that
mately 15%–20%. On the test data set, when the same models were providing the background pollutants had a positive effect on the pre
evaluated still the R2 was negative but a slight decrease was noticed. A diction model results, which was highly prominent in the case of PM2.5
similar scenario was observed even with test RMSE and MAE. followed by PM10 and then NO2. Also, this indicated the capabilities of
The key observations in Scenario 2 were that by adding traffic no ML models to have a reasonable pollutant concentration estimation with
significant improvement was noticed in the performance metrics of even having pollutant from one monitoring station included as a feature.
PM2.5 and PM10 at both Marienplatz and Neckartor. However, there was Thus, the effect of pollutant concentration as an input feature showed a
an improvement in NO2 performance metrics at both locations. This significant impact on predicting the pollutant concentrations.
positive effect was noticeable more at Am Neckartor than at Marienplatz
as Am Neckartor is a traffic hot spot and hence more sensitive to traffic. 3.4. Scenario 4
In reference to the study performed by Rybarczyk and Zalakeviciute
(2017), the authors primarily focused on developing a regression model In Scenario 4, to the features mentioned in Scenario 2 (temporal,
with weighted coefficients for estimating PM2.5 for which an RMSE meteorological and traffic), pollutant concentrations from the moni
value of 7.3 was obtained by incorporating meteorological parameters toring stations as mentioned in Table 1 were provided as input features
and traffic. Interestingly, it is similar to the results of Scenario 2, where to check if adding them as input features to the model can further
all the models yielded RMSE values ranging from 5.9 to 6.5. improve the model evaluations. No pollutants from Marienplatz and Am
The difference between the two studies is the treatment of traffic, as Neckartor were provided as input features as the pollutants from those
the current research study used vehicle count data measured through monitoring stations were to be modeled. To estimate the NO2 concen
sensors, whereas the study mentioned above (Rybarczyk and Zalakevi tration at Am Neckartor and Marienplatz, not only the NO2 from the
ciute, 2017) categorized traffic as low, medium, and high. remaining stations provided but also PM10, NO, and O3 concentrations
were considered as input features. In this manner, the effect of cross-
3.3. Scenario 3 sensitivity between the pollutants can be established.
In Fig. 8, the performance metrics for all three pollutants for Scenario
In this Scenario 3, apart from the features mentioned in the previous 4 are displayed. Compared to scenario 3, there is a slight improvement in
scenario, a similar pollutant concentration was added as an extra feature performance concerning PM2.5 and PM10 performance metrics. For
from the background station at Bad Cannstatt. Hence, to estimate the PM2.5 the MAE was below 2 μg/m3 and RMSE was below 2.5 μg/m3 for
PM2.5 concentration at Marienplatz and Am Neckartor apart from tem all models. Even in the case of PM10, MAE was around 2.5 μg/m3 for all
poral, meteorological and traffic features, the PM2.5 concentration at models on both train and test data. All the models seem to be performing
Bad-Cannstatt was also provided. Similarly, to estimate PM10 and NO2 well for PM2.5. For PM10 and NO2, RFR, ETR, and XGBOOST performed
concentrations at both Marienplatz and Am Neckartor PM10 and NO2 better because of their ability to capture non-linearity. Another notable
from Bad Cannstatt were provided respectively. The key idea was that comparison to scenario 3 performance metrics was that NO2 prediction
air pollutant concentration at nearby stations is correlated with the improved by nearly 20%, keeping the RMSE value below 10 μg/m3 for
concentration at the site under consideration due to the dispersion and all the models except RIDGE. This further signifies the ability of
advection of air pollutants in the area. The distance between the ensemble methods to capture the non-linear relationship.
monitoring stations Am Neckartor and Bad Cannstatt is around 4 km, The performance metrics for all three pollutants for Am Neckartor
while between Marienplatz and Bad Cannstatt is around 7 km. are given in Figure A6 in Appendix. Similar to Marienplatz, the PM2.5
Figure A4 in Appendix shows that the performance metrics at Mar and PM10 performance metrics improved slightly. However, significant
ienplatz improved when compared to the previous two scenarios. For all improvement was seen in the performance of NO2 metrics. Compared to
the pollutants, one common observation was that the effect of under NO2 in Scenario 3, the performance of the models improved consider
fitting reduced compared to Scenario 2. This can be seen especially by ably. The R2 metric for train data which was previously around in the
comparing the R2 scores of all the models for each pollutant. To start range of 0.6–0.8 further increased up to 0.9, whereas for the test data, an
with the pollutant PM2.5, improved R2 scores were observed for the train improvement was noticed reaching a value between 0.7 and 0.8. Even
data compared to PM2.5 in Scenario 2. The ensemble methods out the phenomenon of underfitting was eliminated, achieving a good
performed the RIDGE and SVR, also the MAE and RMSE for the test data amount of generalization. Also, the error metrics MAE and RMSE were
were reduced to half. When the same models were applied to the test halved for the train and test data. Thus, pollutant concentrations from
data, a similar performance was noticed in all performance metrics. The other monitoring stations played an outstanding role in NO2 concen
ML models were able to capture the trends and learn better, hence tration estimation.
generalizing well on the new unseen test data. The main reason could be In reference to the study performed by Rybarczyk and Zalakeviciute
attributed to the fact that PM2.5 at Marienplatz which was the pollutant (2017), the authors observed an improvement in PM2.5 concentration
to be estimated had a strong co-relation with PM2.5 at Bad Cannstatt prediction after including trace gases as a feature. This outcome aligns
(0.76). Similarly, for PM10 and NO2, a similar pattern was noticed. In the with the results of Scenario 4. It is worth mentioning that this study
case of PM10, the MAE, and RMSE were reduced by nearly 40%. How (Rybarczyk and Zalakeviciute, 2017) considered data for two months,
ever, for NO2 a reduction of 25% was seen. however the current research prediction set comprised a complete year.
Figure A5 in Appendix depicts the performance metrics of Am Additionally, the study done by Kumar et al. (2020b) estimated
Neckartor. For PM2.5 at Am Neckartor, a decent generalization was pollutant concentrations using conventional methods such as Inverse
observed with respect to training and test data across all models. This Distance Weighting (IDW) and kriging on a monthly basis at various
can be seen by observing the respective performance metrics of R2, MAE sites in Delhi. The IDW and kriging methods exhibited an average per
and RMSE for the train and test data. When compared to Marienplatz, centage error of around 45% for NO2. In contrast, the results in Scenario
better performance was achieved due to the higher correlation of 4 showed an absolute percentage error of approximately 25% for NO2 at
10
Fig. 8. Scenario 4 performance metrics for pollutants PM2.5, PM10 and NO2 at monitoring station Marienplatz, Stuttgart.
Marienplatz and Am Neckartor. Consequently, when compared to con every time step (hour) three estimations namely minimum, average and
ventional methods, the developed approach resulted in an average maximum were obtained from the models. For the following results,
reduction in percentage error of around 20%. only the average values of all model outputs were considered as the
Since Scenario 4 proved to be the best one compared to other sce predicted outcome. One reason to use averaged outcomes was that the
narios, a residual error plot between the actual and predicted PM2.5, overall MAE decreased slightly. After averaging the results MAE for
PM10 and NO2 concentrations was made for this scenario that is shown PM2.5 at Marienplatz and Neckartor was 1.4 and 1.1 μg/m3 respectively.
in Fig. 9 at the location Am Neckartor. The results for the location For NO2, at Marienplatz and Neckartor, the average MAE was 6.3 and
Marienplatz can be seen in Figure A7 in Appendix section. Residuals are 5.3 μg/m3 respectively. Thus, by averaging a small reduction of MAE
the difference between actual and predicted outcomes. The advantage of was obtained, when compared with individual models test MAE in
the residual plot is that the overall range of MAE for individual time Scenario 4.
steps can be observed. These residual plots are based on the test data. For When the predictions were near to actual concentrations, the
11
Fig. 9. Residual error plots of pollutants (a) PM2.5, (b) PM10 and (c) NO2 at Am Neckartor, Stuttgart.
residual error was close to zero (indicated via the red line). For all three scenarios is presented. The predicted outcomes are the minimum,
pollutants a common observation was that the residuals were centered average and maximum pollutant concentrations. Fig. 10 presents the
around zero, not inclining towards either side heavily, which indicated predicted and actual concentrations of PM2.5 across all four scenarios on
that models can be used for future evaluation. However, a few obser the test data at Marienplatz. Hourly pollutant concentrations were
vations at the Marienplatz location were seen having residual errors of estimated using the ML model, which were averaged for 24 h for these
±15 μg/m3 which could be potentially linked to some specific events graphs. Since the test data is spread from April 2021 to March 2022, for
that are particular to that location. better understanding, the complete year is divided into four quarters
The PM2.5 residual errors with respect to pollutants at Am Neckartor from Q1 to Q4. For scenario 1 and scenario 2, a notable deviation was
showed a better fit compared to PM2.5 at Marienplatz. However, for observed for predicted and actual concentration values. Also, the pre
PM10 (green line) in December to January, some extreme outliers were dicted outcomes were unable to cover sudden increases in concentration
noticed, which were not captured by the models. For NO2, the residual values, which is visible in Q2 and Q4. For better understanding, a
errors observed were within the range of ±15 μg/m3. traceback of train data was performed, where the range of PM2.5 during
June was found to be between 7 and 12 μg/m3. So, estimating PM2.5
with meteorological and traffic parameters led to a mediocre perfor
3.5. Summary
mance. However, in scenario 3, when pollutant concentration was
provided, the trends were captured across all four quarters. However, in
In this section, a comparison of every pollutant across all four
12
Fig. 10. Comparison of PM2.5 across different scenarios at Marienplatz, Stuttgart.
Q4, there was a slight overestimation of pollutant concentration. In concentration at Am Neckartor in scenarios 1 and 2, the sudden changes
scenario 4, even after providing all the pollutants, no significant were followed better compared to the PM10 concentration at Mar
improvement was seen. Nevertheless, the performance metrics show a ienplatz. Still, a deviation with respect to the actual values was detected.
small improvement in the R2 score and an overall decrease in test MAE By adding the spatial pollutant concentrations, the performance was
and RMSE values. A similar phenomenon was noticed concerning PM2.5 improved, which can be observed in scenarios 3 and 4.
at Am Neckartor across all four scenarios shown in Figure A8 in The NO2 comparison across all scenarios for locations of Marienplatz
Apeendix. and Am Neckartor are presented in Fig. 12 and A10 in Appendix
In Fig. 11 and A9 in Appendix PM10 is compared across all four respectively. The impact of providing traffic was found to be minimal at
scenarios at both the locations of Marienplatz and Am Neckartor Marienplatz. However, the effect of providing the pollutants from other
respectively. For Scenario 1 and 2, the models were unable to capture monitoring stations was widely noticed even in the case of NO2 at both
the sudden changes similar to the previous results. However, for PM10 locations. For the NO2 results at Am Neckartor, a significant
13
Fig. 11. Comparison of PM10 across different scenarios at Marienplatz, Stuttgart.
improvement was observed between the predicted and actual values in between this station and two stations in Stuttgart is around 75 km. Two
every scenario. Finally, the best performance was obtained in scenario 4, main reasons for choosing this particular location were the ease of data
where the predicted and actual values showed a strong correlation. access and the availability of monitoring stations nearby measuring the
required parameters. However, no traffic data was available on this
location. Meteorological parameters were available at Karlsruhe
3.6. Feasibility of the proposed concept Nordwest, which included air temperature, air pressure, precipitation
and wind speed. The same pollutants PM2.5, PM10 and NO2 were
In scenario 4, it was observed that adding pollutants from other modeled at this location. Table 3 shows the list of monitoring stations in
monitoring stations resulted in enhanced performance of the ML models. Karlsruhe along with the measured parameters.
A similar concept was applied to a monitoring station Nordwest in To estimate PM2.5, PM10 and NO2 at Karlsruhe Nordwest,
Karlsruhe, to check the feasibility of the developed method. The distance
14
Fig. 12. Comparison of NO2 across different scenarios at Marienplatz, Stuttgart.
meteorological parameters and pollutants from the remaining three

Table 3
monitoring stations were given as inputs. To estimate PM10 at Karlsruhe
LUBW Monitoring stations in Karlsruhe with the list of pollutants being
Nordwest, all four measured pollutants from Eggenstein, three from
measured.
Reinhold Frank Straβe and two from Pflinztal Karlsruher Straβe were
Monitoring stations Pollutants monitored provided to all the five ML models. In Figure A11 in Appendix section,
Karlsruhe Nordwest PM2.5, PM10, NO2, NO, O3 the performance metrics of the results are shown. The results show that
Eggenstein PM2.5, PM10, NO2, NO, O3 the performance of PM2.5 was comparatively better than the other two
Reinhold Frank Straße PM2.5, NO2, NO
pollutants because of their homogeneous distribution. For PM2.5, all the
Pflinztal Karlsruher Straße NO2, NO
models obtained an R2 score of above 0.9 on both train and test data.
Performance metrics displayed similar results and a very low MAE in the
range of 1–2 μg/m3 was obtained on train and test data. For PM10, higher
15
R2 values were obtained for SVR and ensembles compared to RIDGE. For Marienplatz and Am Neckartor, the residuals at Karlsruhe Nordwest
all algorithms, the R2 were in the range of 0.92–0.94 on the train data were lower.
and 0.83 to 0.84 on the test data. From the PM10 performance metrics,
the MAE for all the models was around 1.3–1.6 μg/m3 on the train data 4. Conclusions
and 2.1–2.9 μg/m3 on the test data. With this performance, the models
were neither underfitting nor overfitting and decent generalization was In this research, four different scenarios were explored to investigate
noticed on new unseen data. The performance metrics for NO2 pollutant the performance of ML models for estimating pollutant concentration.
showed similar results. From the results, it can be concluded that the pollutants from other
The comparison between the predicted and actual pollutant values is monitoring stations as an input feature, played a significant role in
presented in Fig. 13. Here instead of predicted minimum, average and estimation. In each scenario, an improvement in performance was seen
maximum hourly values, for better visualization daily plot is presented. with the addition of a new feature.
It can be seen from all three subplots that the pollutants were able to In scenario 1, a mediocre performance was obtained as the models
learn the trends and also capture the fluctuations during the entire time were unable to capture any fluctuations and were only able to detect
duration of test data. simple moving averages. Also, in this scenario overfitting of ensemble
In Figure A12 in Appendix, the residual plot of pollutants PM2.5, models (RF, ETR, XGBoost) and underfitting of RIDGE and SVR were
PM10 and NO2 at Nordwest, Karlsruhe is shown. Here, the average observed, across all the three pollutants at both locations. The MAE for
values from all 5 models were taken and subtracted from the actual the pollutant PM2.5 was similar for both locations, however, the MAE for
value. Most of the residuals lie in the range of ±7.5 μg/m3. However, PM10 and NO2 at Neckartor was twice compared to Marienplatz. To
some deviations were noticed during January to March where the enhance the predicting ability, in scenario 2, traffic data was added as an
models were underperforming, with room for improvement. Also, for extra feature to explore its impact. An improvement was observed for
PM10 and NO2 the residuals were observed to lie within the range of pollutant NO2 at Am Neckartor. However, no improvements were
±20 μg/m3. When compared to PM10 and NO2 residual plots of observed for the remaining pollutants at both locations. After adding
Fig. 13. Comparison of performance metrics for pollutants (a) PM2.5, (b) PM10 and (c) NO2 predicted and actual values at Nordwest, Karlsruhe.
16
pollutants from a background monitoring station (Bad Cannstatt) in References

scenario 3, significant improvement was observed across all pollutants
in both locations. At Am Neckartor the impact of providing the back Akoglu, H., 2018. User’s guide to correlation coefficients. Turkish journal of emergency
medicine, Bd. 18 (Nr. 3, S), 91–93.
ground pollutants was more visible. A similar phenomenon was Analitis, A., Barratt, B., Green, D., Beddows, A., Samoli, E., Schwartz, J., Katsouyanni, K.,
observed at Marienplatz where the MAE for PM10 and NO2 were 2020. Prediction of pm2.5 concentrations at the locations of monitoring sites
reduced. Another notable aspect in this scenario 3, was that the problem measuring pm10 and nox, using generalized additive models and machine learning
methods: a case study in london. Atmospheric Environment, Bd. 240 (S), 117757
of overfitting and underfitting was eliminated. Finally, in scenario 4, [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1352231
when pollutants from other monitoring stations were also added to the 020304891.
existing features in scenario 2, the best possible performance was ob Balamurugan, V., Balamurugan, V., Chen, J., 2022. Importance of ozone precursors
information in modelling urban surface ozone variability using machine learning
tained with the lowest MAE for all the pollutants. The impact was more algorithm. Scientific Reports, Bd. 12 (S), 5646.
prominent for NO2 at both Marienplatz and Am Neckartor. However, in Baumbach, G., 1996. Air Quality Control. Formation and Sources, Dispersion,
the case of PM2.5 and PM10 there was only a slight decrease in MAE for Characteristics and Impact of Air Pollutants - Measuring Methods, Techniques for
Reduction of Emissions and Regulations for Air Quality Control. Springer Berlin
both locations was observed. The results from residual plots for scenario
Heidelberg (Environmental Engineering), Berlin, Heidelberg.
4 showed that the models were able to capture most of the trends and Botchkarev, A., 2018. Performance Metrics (Error Measures) in Machine Learning
achieve decent generalizing ability. The comparison of this research Regression, Forecasting and Prognostics: Properties and Typology“ arXiv preprint
with existing approaches for pollutant estimation revealed the effec arXiv:1809.03006.
Chelani, A., 2009. Prediction of daily maximum ground ozone concentration using
tiveness of the developed method in achieving accurate results. Addi support vector machine. Environmental monitoring and assessment, Bd. 162 (S),
tionally, this technique outperformed conventional methods such as 169–76.
inverse distance weighting and kriging. Chen, G., Li, S., Knibbs, L.D., Hamm, N., Cao, W., Li, T., Guo, J., Ren, H., Abramson, M.J.,
Guo, Y., 2018. A machine learning method to estimate pm2.5 concentrations across
Finally, to estimate PM2.5, PM10 and NO2 at Karlsruhe Nordwest, a China with remote sensing, meteorological and land use information. Science of The
similar set of features as described in scenario 4 was employed without Total Environment, Bd. 636 (S), 52–60. https://www.sciencedirect.com/science/
traffic data as it was unavailable. After evaluating the results, it was article/pii/S0048969718314281.
Demertzis, K., Bougoudis, I., Liadis, L., 2015. “Hisycol a hybrid computational
concluded that the ML models were able to estimate pollutant concen intelligence system for combined machine learning: the case of air pollution
trations at other locations as well. Hence, the developed technique can modeling in athens“. Neural Computing and Applications, Bd 27, 5.
be transferred to any location where pollutant prediction is required. Dong, G., Liu, H. (Eds.), 2018. Feature Engineering for Machine Learning and Data
Analytics, first ed. CRC Press. https://doi.org/10.1201/9781315181080.
The scope of this research in the field of air quality monitoring can be Du, P., Bai, X., Tan, K., Xue, Z., Samat, A., Xia, J., Li, E., Su, H., Liu, W., 2020. Advances
significant as by applying this method, the monitoring stations can be of four machine learning methods for spatial data handling: a review. J geovis spat
replaced with ML models, creating a virtual monitoring station. In this anal 4, 13. https://doi.org/10.1007/s41651-020-00048-5, 2020.
Duyzer, J., van den Hout, D., Zandveld, P., van Ratingen, S., 2015. Representativeness of
manner dependency on the monitoring stations can be reduced and high
air quality monitoring networks. Atmos. Environ. 104, 88–101. https://doi.org/
costs can be avoided. This can also assist the respective authorities to 10.1016/j.atmosenv.2014.12.067, 2015.
identify the minimum number of monitoring stations to achieve EU, 2015a. Consolidated Text: Directive 2008/50/EC of the European Parliament and of
maximum coverage within a city. Furthermore, this study can benefit the Council of 21 May 2008 on Ambient Air Quality and Cleaner Air for Europe.
http://data.europa.eu/eli/dir/2008/50/2015-09-18.
from incorporating other neural networks such as CNN-LSTM (Convo EU, 2015b. Commission Directive (EU) 2015/1480 of 28 August 2015 Amending Several
lution Neuron Network - Long Short Term Memory) in capturing tem Annexes to Directives 2004/107/EC and 2008/50/EC of the European Parliament
poral dependencies and patterns in data, which could further enhance and of the Council Laying Down the Rules Concerning Reference Methods, Data
Validation and Location of Sampling Points for the Assessment of Ambient Air
the accuracy of pollutant estimation. The limitation of this study is that Quality. Available online: http://data.europa.eu/eli/dir/2015/1480/oj.
the forecasting of pollutant concentration is not possible as the data from European Environment Agency (EEA), 2022. Air Quality in Europe 2022 Report.
other monitoring stations is required for prediction. Further research Publications Office. https://doi.org/10.2800/488115. ISBN: 978-92-9480-515-7,
ISSN: 1977-8449.
can be done in this regard so that the pollutant concentrations can be Filonchyk, M., Hurynovich, V., Yan, H., Yang, S., 2020. Atmospheric pollution
forecasted for the next days which would enable proactive measures and assessment near potential source of natural aerosols in the South Gobi Desert region,
decision-making based on anticipated air quality conditions. China. GIScience Remote Sens. 57 (2), 227–244. https://doi.org/10.1080/
15481603.2020.1715591.
Geurts, P., Ernst, D., Wehenkel, L., 2006. Extremely randomized trees. Machine learning,
CRediT authorship contribution statement Bd. 63 (Nr. 1, S), 3–42.
Ghaemi, Z., Alimohammadi, A., Farnaghi, M., 2018. Lasvm-based big data learning
system for dynamic prediction of air pollution in tehran. Environmental Monitoring
A. Samad: Conceptualization, Methodology, Software, Data cura
and Assessment, Bd. 190, 4.
tion, Investigation, Writing – original draft, Supervision, Writing – re Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT press. ISBN:
view & editing. S. Garuda: Conceptualization, Methodology, Software, 9780262035613.
Data curation, Investigation, Writing – original draft, Formal analysis, Han, Y., Wu, J., Zhai, B., Pan, Y., Huang, G., Wu, L., Zeng, W., 2019. Coupling a bat
algorithm with xgboost to estimate reference evapotranspiration in the arid and
Visualization, Writing – review & editing. U. Vogt: Conceptualization, semiarid regions of China. Advances in Meteorology, Bd. (S), 1–16, 102019.
Methodology, Supervision, Writing – review & editing. B. Yang: Su He, L., Cheng, Y., Li, Y., Li, F., Fan, K., Li, Y., 2021. An improved method for soil moisture
pervision, Writing – review & editing. monitoring with ensemble learning methods over the Tibetan plateau. IEEE Journal
of Selected Topics in Applied Earth Observations and Remote Sensing, Bd. PP, S. 1–1.
Hu, K., Sivaraman, V., Bhrugubanda, H., Kang, S., Rahman, A., 2016. Svr based dense air
Declaration of competing interest pollution estimation model using static and wireless sensor network. In: 2016 IEEE
SENSORS, pp. 1–3. S.
Hubert, M., Vandervieren, E., 2008. An adjusted boxplot for skewed distributions.
The authors declare that they have no known competing financial Computational Statistics Data Analysis, Bd. 52 (Nr. 12, S), 5186–5201 [Online].
interests or personal relationships that could have appeared to influence Available: https://www.sciencedirect.com/science/article/pii/S0167947307
the work reported in this paper. 004434.
Isam Drewil, G., Jabbar Al-Bahadili, R., 2022. Air pollution prediction using LSTM deep
learning and metaheuristics algorithms. Measurement: Sensors 24 (2022), 100546.
Data availability https://doi.org/10.1016/j.measen.2022.100546. ISSN 2665-9174.
Khodakarami, J., Ghobadi, P., 2016. Urban pollution and solar radiation impacts.
Renewable and Sustainable Energy Reviews, Bd. 57. S. 965–976.
Data will be made available on request. Khomenko, S., Cirach, M., Pereira-Barboza, E., Mueller, N., Barrera-Gómez, J., Rojas-
Rueda, D., de Hoogh, K., Hoek, G., Nieuwenhuijsen, M., 2021. Premature mortality
Appendix A. Supplementary data due to air pollution in European cities: a health impact assessment. The Lancet
Planetary Health, Bd. 5, e121–e134. Nr. 3, S. https://www.sciencedirect.com/scienc
e/article/pii/S2542519620302722.
Supplementary data to this article can be found online at https://doi. Kumar, S., Mishra, S., Singh, S.K., 2020a. A machine learning-based model to estimate
org/10.1016/j.atmosenv.2023.119987. pm2.5 concentration levels in Delhi’s atmosphere. Heliyon, Bd. 6 (Nr. 11, S), e05618
17
[Online]. Available: https://www.sciencedirect.com/science/article/pii/S2405844 Talebizadeh, M., Moridnejad, A., 2011. Uncertainty analysis for the forecast of lake level
020324610. fluctuations using ensembles of ann and anfis models. Expert Systems with
Kumar, A., Dhakhwa, S., Dikshit, A.K., 2020b. Comparative evaluation of fitness of Applications, Bd. 38 (Nr. 4, S), 4126–4135 [Online]. Available: https://www.scienc
interpolation techniques of ArcGIS using leave-one-out scheme for air quality edirect.com/science/article/pii/S0957417410010328.
mapping. J geovis spat anal 6, 9. https://doi.org/10.1007/s41651-022-00102-4, Tian, X., Cui, K., Sheu, H.-L., Hsieh, Y.-K., Yu, F., 2021. Effects of rain and snow on the air
2022. quality index, PM2.5 levels, and dry deposition flux of PCDD/fs. Aerosol and Air
Latini, G., Grifoni, R.C., Passerini, G., 2002. Influence of meteorological parameters on Quality Research, Bd. 21 (Nr. 8, S), 210158, 10.4209%2Faaqr.210158.
urban and suburban air pollution. WIT Transactions on Ecology and the Vapnik, V., Golowich, S., Smola, A., 1996. Support vector method for function
Environment, Bd. 53. approximation, regression estimation and signal processing. In: Mozer, M.,
Li, Z., Yim, S.H.-L., Ho, K.-F., 2020. High temporal resolution prediction of street-level Jordan, M., Petsche, T., Bd, Hg (Eds.), Advances in Neural Information Processing
pm2.5 and nox concentrations using machine learning approach. Journal of Cleaner Systems, vol. 9. MIT Press [Online]. Available: https://proceedings.neurips.
Production, Bd. 268 (S), 121975 [Online]. Available: https://www.sciencedirect.co cc/paper/1996/file/4 f284803bd0966cc24fa8683a34afc6e-Paper.pdf.
m/science/article/pii/S0959652620320229. Verghese, S., Nema, A.K., 2022. Optimal design of air quality monitoring networks: a
Liaw, A., Wiener, M., 2001. Classification and regression by randomforest. Forest, Bd. 23, systematic review. Stoch. Environ. Res. Risk Assess. S. 1–16.
11. Vlasenko, A., Matthias, V., Callies, U., 2021. Simulation of chemical transport model
Long, E., Carlsten, C., 2022. Controlled human exposure to diesel exhaust: results estimates by means of a neural network using meteorological data. Atmospheric
illuminate health effects of traffic-related air pollution and inform future directions. Environment, Bd. 254 (S), 118236 [Online]. Available: https://www.sciencedirect.
Particle and Fibre Toxicology, Bd. 19 (Nr. 1, S), 1–35. com/science/article/pii/S1352231021000546.
LUBW, Landesanstalt für Umwelt Baden-Württemberg, 1999. Wirkungen von Emissionen Wichard, J., Ogorzalek, M., 2004. Time series prediction with ensemble models. S.. In:
des Kfz Verkehrs auf Pflanzen und die Umwelt“ [Online; accessed October 26, 2021]. 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat.
[Online]. Available: https://pudi.lubw.de/detailseite/-/publication/12203-Wirkun No.04CH37541), Bd., vol. 2, pp. 1625–1630. vol. 2.
gen_von_Emissionen_des_Kfz-Verkehrs_auf_Pflanzen_und_die_Umwelt_-_Literaturstu Wolpert, D.H., 1992. “Stacked generalization“. Neural networks, Bd 5 (Nr. 2, S),
die.pdf. 241–259.
Maqsood, I., Khan, M., Abraham, A., 2004. An ensemble of neural networks for weather Wong, P.-Y., Hsu, C.-Y., Wu, J.-Y., Teo, T.-A., Huang, J.-W., Guo, H.-R., Su, H.-J., Wu, C.-
forecasting“. Neural Computing and Applications, Bd 13 (S), 112–122. D., Spengler, J.D., 2021. Incorporating land-use regression into machine learning
McDonald, G.C., 2009. Ridge regression. Wiley Interdisciplinary Reviews: Computational algorithms in estimating the spatial-temporal variation of carbon monoxide in
Statistics, Bd. 1 (Nr. 1, S), 93–100. taiwan. Environmental Modelling Software, Bd. 139 (S), 104996 [Online]. Available:
Mosley, S., 2014. Environmental History of Air Pollution and Protection, the Basic https://www.sciencedirect.com/science/article/pii/S1364815221000396.
Environmental History. Springer, pp. 143–169, 2014, S. Xing, J., Zheng, S., Ding, D., Kelly, J.T., Wang, S., Li, S., Qin, T., Ma, M., Dong, Z.,
Rybarczyk, Y., Zalakeviciute, R., 2017. Regression Models to Predict Air Pollution from Jang, C., et al., 2020. Deep learning for prediction of the air quality response to
Affordable Data Collections“. IntechOpen. https://doi.org/10.5772/ emission changes. Environmental science & technology, Bd. 54 (Nr. 14, S),
intechopen.71848. Kap. 2. [Online]. Available: 8589–8600.
Saithanu, K., Mekparyup, J., 2014. Using multiple linear regression to predict pm 10 Xu, Y., Liu, X., Cao, X., Huang, C., Liu, E., Qian, S., Liu, X., Wu, Y., Dong, F., Qiu, C.-W.,
concentration in chonburi, Thailand. Global Journal of Pure and Applied Qiu, J., Hua, K., Su, W., Wu, J., Xu, H., Han, Y., Fu, C., Yin, Z., Liu, M., Roepman, R.,
Mathematics, Bd. (10), 835–839, 122014. Dietmann, S., Virta, M., Kengara, F., Zhang, Z., Zhang, L., Zhao, T., Dai, J., Yang, J.,
Samad, A., Vogt, U., 2020. Assessing the effect of traffic density and cold airflows on the Lan, L., Luo, M., Liu, Z., An, T., Zhang, B., He, X., Cong, S., Liu, X., Zhang, W.,
urban air quality of a city with complex topography using continuous measurements. Lewis, J.P., Tiedje, J.M., Wang, Q., An, Z., Wang, F., Zhang, L., Huang, T., Lu, C.,
Modern Environmental Science and Engineering, Bd. 6 (S), 529–541. Cai, Z., Wang, F., Zhang, J., 2021. Artificial intelligence: a powerful paradigm for
Song, Z., Chen, B., Huang, Y., Dong, L., Yang, T., 2021. Estimation of pm2.5 scientific research. The Innovation, Bd. 2, 100179. Nr. 4, S. https://www.sciencedi
concentration in China using linear hybrid machine learning model. Atmospheric rect.com/science/article/pii/S2666675821001041.
Measurement Techniques, Bd. 14 (Nr. 8, S), 5333–5347 [Online]. Available: htt Zamani Joharestani, M., Cao, C., Ni, X., Bashir, B., Talebiesfandarani, S., 2019. PM2.5
ps://amt.copernicus.org/articles/14/5333/2021/. prediction based on random forest, xgboost, and deep learning using multisource
Spangl, W., Schneider, J., Moosmann, L., Nagl, C., 2007a. Representativeness and remote sensing data. Atmosphere, Bd. 10 (Nr. 7) [Online]. Available: https://www.
Classification of Air Quality Monitoring Stations. Umweltbundesamt GmbH, Vienna, mdpi.com/2073-4433/10/7/373.
Austria. Available online: https://www.umweltbundesamt.at/fileadmin/site/publik Zhan, Y., Luo, Y., Deng, X., Zhang, K., Zhang, M., Grieneisen, M., di, B., 2018.
ationen/REP0121.pdf. (Accessed 10 August 2019). Satellitebased estimates of daily no2 exposure in China using hybrid random forest
Spangl, W., Schneider, J., Moosmann, L., Nagl, C., 2007b. Representativeness and and spatiotemporal kriging model. Environmental Science Technology, Bd. 52, 3.
Classification of Air Quality Monitoring Stations. Umweltbundesamt. Zhan, J., Liu, Y., Ma, W., Zhang, X., Wang, X., Bi, F., Zhang, Y., Wu, Z., Li, H., 2022.
Stuttgart, Stadtklima 21, 2008. Grundlagen zum Stadtklima und zur Planung Stuttgart Ozone formation sensitivity study using machine learning coupled with the
21. Amt für Umweltschutz, Abt. Stadtklimatologie. reactivity of volatile organic compound species. Atmospheric Measurement
Sun, Z., Zhu, D., 2019. Exposure to outdoor air pollution and its human health outcomes: Techniques, Bd. 15 (Nr. 5, S), 1511–1520 [Online]. Available: https://amt.copernic
a scoping review. PLOS ONE, Bd. 14 (Nr. 5, S), 1–18. https://doi.org/10.1371/ us.org/articles/15/1511/2022/.
journal.pone.0216550 [Online]. Available: Zhang, C., Ma, Y., 2012. Ensemble Machine Learning: Methods and Applications.
Springer Publishing Company, Incorporated.
18

1 s2.0 S1352231023004132 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S1352231023004132 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1352231023004132 Main

Uploaded by

Copyright:

Available Formats

Atmospheric Environment 310 (2023) 119987

Contents lists available at ScienceDirect

Air pollution prediction using machine learning techniques – An approach

• Machine Learning models are suitable for pollutant concentration prediction.

Fig. 1. Am Neckartor and Marienplatz monitoring stations, Stuttgart.

center (IVLZ) Stuttgart provided the traffic information, i.e. minute

Fig. 5. Spearman rank correlation matrix between pollutants.

Fig. 6. ML workflow including input features, models and outputs.

Fig. 10. Comparison of PM2.5 across different scenarios at Marienplatz, Stuttgart.

Fig. 11. Comparison of PM10 across different scenarios at Marienplatz, Stuttgart.

Fig. 12. Comparison of NO2 across different scenarios at Marienplatz, Stuttgart.

meteorological parameters and pollutants from the remaining three

pollutants from a background monitoring station (Bad Cannstatt) in References

You might also like