Microchim. Acta 148, 293–298 (2004)
DOI 10.1007/s00604-004-0279-2
Original Paper
Multivariate Statistical Assessment of Air Quality: A Case Study
Vasil Simeonov1;, Stefan Tsakovski1 , Tomaz Lavric2 , Pavlina Simeonova3 , and Hans Puxbaum2
1
2
3
Chair of Analytical Chemistry, Faculty of Chemistry, University of Sofia ‘‘St. Kl. Okhridski’’,
J. Bourchier Blvd. 1, 1164 Sofia, Bulgaria
Institute of Chemical Technologies and Analytics, Vienna University of Technology, Getreidemarkt 9=164,
1060 Vienna, Austria
Institute of Solid State Physics, Bulgarian Academy of Sciences, Tzarigradsko Chaussee 72, 1784 Sofia, Bulgaria
Received March 12, 2004; accepted August 31, 2004; published online November 5, 2004
# Springer-Verlag 2004
Abstract. The present paper deals with the application of several chemometrical methods (cluster and
principal components analysis, source apportioning on
absolute principal components scores) to an aerosol
data collection from Unterloibach, Austria. It is shown
that seven latent factors explaining almost 80% of the
total variance are responsible for the data structure and
are conditionally identified as ‘‘secondary aerosol’’,
‘‘mineral dust’’, ‘‘oil burning’’, ‘‘lead smelter’’, ‘‘coal
burning’’, ‘‘salt’’ and ‘‘fertilizer’’ emission sources.
Furthermore, the contribution of each identified source
to the formation of the particle total mass and chemical
compounds total concentration is calculated. Thus, a
reliable assessment of the air quality in the region is
performed. The requirements of the sustainability concept for ecological indicators in this case is easily
transformed into a multivariate statistical problem taking into account not separate indicators but the specific
multivariate nature of aerosol pollution.
Key words: Chemometrics; air quality; cluster analysis; principal
components analysis; source apportioning.
In recent years the concept of sustainable development has been successfully introduced and exploited
Author for correspondence. E-mail: VSimeonov@chem.unisofia.bg
not only as a legislative and political formula but also
as the basis for various research projects. The current
discussion about sustainability is limited mostly to
descriptive and normative elements. The question is
not whether a certain activity is sustainable but what
recommendation is needed in order for an activity to
be considered sustainable. In terms of environmental
problems, this seems quite formal and undefined [1].
For instance, the traditionally accepted idea of sustainability is to find a sound compromise between
the constantly increasing needs of mankind for energy
and raw materials on one hand, and, on the other, the
social requirement for a clean environment and better chances for the coming generations. In order to
achieve sustainability, people will need to reach a
mutual agreement or reliable solutions concerning a
bunch of individual problems such as future development of clean technologies and products, controlling
the prices of natural energy resources or funding alternative energy supplies etc.
Obviously rapid technological and social development requires some kind of sustainability metrics
in order to control and establish the system of
sustainable development. The concept requires the
implementation of standard sustainability measures
in the industrial sector, in environmental policy, in
economics, and in social life. The challenge to
294
become environmentally relevant has led to the
development of two important concepts for sustainability indicators:
– The P–S–R indicator concept (the pressure of the
socio-economic activities into natural systems
leads to observable changes in the state of the
environmental systems, which causes respective
response or socio-economic measures to reduce
the hazardous effects);
– The D–P–S–R indicator concept (the socio-economic drivers cause the pressure, which changes
the state and calls for response).
Support for these basic concepts reflected in the
creation of various eco-efficiency indicators, which
track and report energy, waste and water parameters
required for the definition of sustainability [2]. Such
indicators have been introduced in the economy and
there are efforts to find appropriate eco-efficiency
indicators for social life. However, these metrics lead
to univariate estimates of the real environmental problems: keeping a certain value within the allowable
limits; following a trend; resolving a local pollution
problem; comparing a product with a price; assessment of public opinion about an environmental issue
etc. All present efforts concentrate on the definition
and calculation of indicators yet for separate applications – to production, to economics or the social
sciences.
The natural environment is, indeed, a multivariate
complex system, and its quality assessment with
respect to sustainability requires multivariate approaches and metrics. The capability of chemometrics
and environmetrics to handle multivariate systems
and objects has helped many environmental studies
[3–6] in reaching the correct data classification,
modeling and interpretation. Thus, chemometrics
turns out to be a very effective tool for problem solving and decision-making. It is the aim of the present
study to illustrate some of the multivariate solutions
offered by chemometrics when applied to environmental studies. This is how chemometrics could be
used as a metrics tool in sustainability research. The
multivariate statistical treatment of the monitoring
data of aerosols from the border between Austria
and Slovenia made it possible to extract important
information about environmental pollution in the
region and to perform a reliable assessment of the
air quality.
V. Simeonov et al.
Experimental
Sampling Site and Sampling Procedure
The sampling site of Unterloibach is located in the Austrian province of Carynthia at a height of 629 m a.s.l. (altitude 46 320 1800 and
longitude 14 480 5200 ). Unterloibach is a typical rural site located
near the border of Slovenia. Nearly 15 km northeast there is a lead
smelter (on Slovenian territory) which is still active, and to the east,
again in Slovenia, a steel work produces special quality steels. To
the south you will find the largest Slovenian coal power station
which supplies over 75% of the country’s electricity.
The aerosol data was gathered in the period between March 1999
and February 2000. Sampling was performed with a high-volume
sampler (Digitel DHA-80), which is a completely automated device
described in detail elsewhere [7]. The aerosol particles of class PM10
were collected daily on quartz fiber filters (QAT-UP, Pallflex, USA),
thus allowing the determination of the carbon content. The complete
description of the sampling device and the pre-sampling preparation
of the filters can be found in [7].
Analytical Procedures
The particle total mass was determined by weighing the sampling
filters before and after sampling according to the CEN standard [8].
Determination of the water-soluble ions (cations: sodium,
ammonium, potassium, magnesium and calcium; anions: chloride,
nitrate, sulfate) was performed by using two ion-chromatographic
systems after extraction of the filters by deionised water in ultrasonic bath for 20 min.
The concentration of the heavy metals was determined by atomic
absorption spectrometry. One quarter of the filter was cut with
ceramic scissors, and the sample was weighted and extracted with
10 mL 10% HNO3. The analytical procedures are described in detail
elsewhere [9, 10].
The analytical procedure for the determination of carbon (total
carbon, TC, black carbon, BC and organic carbon, OC) used the
developments of the well-established approaches of Puxbaum [9]
for sample burning in oxygen atmosphere (TC), optical determination
(BC) and the difference between TC and BC for OC determination.
Chemometrical Methods
In the data treatment approaches of environmetrics, both unsupervised and supervised techniques are used. In the first case, data
mining is performed spontaneously, in a hierarchical way, from
the data set. In the latter case, a preliminary step of learning (training) is necessary to derive a treatment (classification) rule based
on grouping of objects with known origin or behaviour. This rule
allows interpreting new objects with unknown origin or behaviour in
the classes offered by the classification rule. This case study uses
three major classification procedures only.
Cluster analysis is a well-known and widely used classification
approach for environmetrical purposes with its hierarchical and nonhierarchical algorithms [3, 5].
In order to cluster objects characterized by a set of variables (e.g.
sampling sites by chemical concentrations or pollutants), one has to
determine their similarity. To avoid the influence of data size, a
preliminary step of data scaling is necessary (e.g. autoscaling or
z-transform, range scaling, logarithmic transformation) where normalized dimensionless numbers replaces the real data values.
Thus, even serious differences in absolute (concentration) values
295
Multivariate Statistical Assessment of Air Quality
are reduced to close numbers. Then the similarity (or more strictly,
the distance) between the objects in the variable space can be
determined. Very often the Euclidean distance (ordinary, weighted,
standardized) is used for clustering purposes. Another way of measuring similarity is to calculate the correlation coefficient between
two row-vectors x1 and x2 characterizing objects 1 and 2. Thus,
from the input matrix (raw data) a similarity matrix is calculated.
There is a wide variability of hierarchical algorithms but the typical
ones include the single linkage, the complete linkage and the average linkage methods. Representation of the cluster analysis results is
performed either by a tree-like scheme called dendrogram which has
a hierarchical structure (large groups are divided into small ones) or
by tables containing different possible clusterings. The hierarchical
methods of clustering mentioned above are called agglomerative.
Good results are also obtained when using hierarchical divisive
methods, i.e. methods that first divide the set of all objects into
two so that two groups (clusters) are formed. Then each group
(cluster) is again divided into two etc., until all objects are separated.
The aim of classification by non-hierarchical clustering is to
classify the objects in consideration into a certain number of
preliminary intended groups, e.g. K clusters. For instance, in order
to obtain 2 clusters, one selects 2 seed points from among the
objects and classifies each of the objects with the nearest seed
point. Thus, an initial cluster is obtained. For each of these clusters
one determines the centroid (the point of mean values of the
variables xi for each cluster). The whole procedure is repeated;
new centroids are calculated for the new clusters. The new centroids have new co-ordinates and it leads to reclassification of the
objects.
Daszykowski et al. [11] offer new original clustering algorithms
called density-based spatial clustering of application with noise
(DBSCAN) and ordering points to identify the clustering structure
(OPTICS), which is already applied in data mining [12].
Principal components analysis (PCA) is a typical display method
which allows estimating the internal relations in the data set and to
model the ecosystem in consideration. There are different variants of
PCA, but basically their common feature is that they produce linear
combinations of the original columns in the data matrix (data set)
responsible for the description of the variables characterizing the
objects of observation. These linear combinations represent a type of
abstract measurements (factors, principal components) which are
better descriptors of the data structure (data pattern) than the original
(chemical or physical) measurements. Usually, the new abstract
variables are referred to as latent factors and they differ from the
original ones called manifest variables. It is commonly found that
just a few of the latent variables account for a large part of the data
set variation. Thus, the data structure in a reduced space can be
observed and studied [5].
Generally, when analysing a data set consisting of n objects for
which m variables have been measured, PCA can extract m principal
components PCs (factors or latent variables) where m < n. The first
PC represents the direction in the data containing the largest variation. PC 2 is orthogonal to PC 1 and represents the direction of the
largest residual variation around PC 1. PC 3 is orthogonal to the first
two and represents the direction of the highest, residual variation
around the plane formed by PC 1 and PC 2. The projections of the
data on the plane of PC 1 and PC 2 can be computed and shown as a
plot (score plot). In such a plot it is possible to distinguish similarity
groups. According to the theory of PCA, the scores on the PCs
(the new coordinates of the data space) are a weighted sum of the
original variables (e.g. chemical concentrations):
Score ðvalue of object I along a PC pÞ
¼ 1p Y1 þ 2p Y2 þ þ kp Yk
where Y indicates the variable value (e.g. concentration) and is the
weight (called loading). The information hidden in the loadings can
also be displayed in loading plots. It is important to note that PCA
very often requires scaling the input raw data to eliminate dependence on the scale of the original values.
Multiple regression on principal components (apportioning models) is a very important environmetric approach [13]. It permits
apportioning the contribution of each latent factor identified by
PCA (emission source) to the total mass (concentration) of a certain
chemical variable. The first step is performance of PCA, identification of latent factors, then determination of the absolute principal
components scores (APCS) and multiple regression of the total mass
(dependent variable) on the APCSs (independent variables).
Results and Discussions
The monitoring data is available from the authors
on request. Multivariate statistical data treatment
was performed using the STATISTICA 6.0 package.
Data clustering to determine possible relationships
between the variables (chemical components of the
PM10 aerosol collection) gives the following significant
clusters (data matrix consisting of 113 objects or sampling days and 20 variables or chemical components):
C1:
C2:
C3:
C4:
C5:
C6:
C7:
NO3 , NH4 þ , Zn, BC, OC
Ca2þ , SO4 2
Cr, Ni, V
As, Cd, Pb
Mg2þ , Fe, Mn
Cl , Naþ
Kþ , Cu
Cluster analysis was performed according to
Ward’s method of linkage and squared Euclidean distance as similarity measure. Cluster significance was
determined by separation at distances of 1=3 Dmax and
2=3 Dmax (Sneath’s criterion).
The linkage of the chemical variables into 7 clusters is an indication of the complex character of the
pollution emitters in the region. It may be assumed
that probably seven factors determine the aerosol
composition in the region of Unterloibach.
In order to obtain information about the data structure and identify latent factors responsible for it, the
data collection was treated with principal components
analysis (Varimax rotation, scree plot validation and
Malinowski’s test for significance of the factor loadings). Table 1 presents the factor loadings for seven
principal components which explain nearly 80% of
the total variance of the system.
It is shown that seven latent factors determine the
data structure. These factors are related to the existing
296
V. Simeonov et al.
Table 1. Factor loadings
Cl
NO3
SO4 2
Naþ
NH4 þ
Kþ
Ca2þ
Mg2þ
As
Cd
Cr
Cu
Fe
Mn
Ni
Pb
V
Zn
BC
OC
Expl. var.
PC1
PC2
PC3
PC4
PC5
PC6
PC7
0.169
0.806
0.341
0.048
0.732
0.523
0.046
0.171
0.094
0.321
0.013
0.142
0.013
0.284
0.438
0.017
0.371
0.564
0.748
0.853
18.9%
0.095
0.119
0.780
0.082
0.445
0.051
0.830
0.072
0.101
0.400
0.374
0.197
0.201
0.309
0.103
0.110
0.257
0.049
0.407
0.207
11.6%
0.081
0.195
0.213
0.315
0.232
0.002
0.074
0.131
0.011
0.282
0.600
0.155
0.334
0.399
0.748
0.187
0.789
0.238
0.001
0.197
11.6%
0.096
0.073
0.238
0.180
0.187
0.042
0.107
0.065
0.870
0.527
0.320
0.063
0.090
0.315
0.159
0.843
0.034
0.221
0.059
0.096
11.0%
0.147
0.002
0.027
0.224
0.058
0.033
0.278
0.822
0.133
0.036
0.233
0.013
0.825
0.627
0.200
0.014
0.048
0.136
0.227
0.124
10.5%
0.869
0.290
0.146
0.758
0.094
0.187
0.030
0.363
0.029
0.158
0.138
0.063
0.048
0.098
0.084
0.007
0.018
0.194
0.126
0.092
8.7%
0.114
0.114
0.032
0.084
0.177
0.764
0.154
0.025
0.016
0.065
0.121
0.898
0.013
0.063
0.203
0.131
0.006
0.047
0.290
0.272
8.6%
Note: Marked loadings are statistically significant (Malinowski’s test).
emission sources in the region and are conditionally
named ‘‘secondary emission’’ (explaining nearly 19%
of the variance), ‘‘mineral dust’’ (accounting for about
12% of the total variance), ‘‘oil burning’’ (same percentage as the previous one), ‘‘lead smelter’’ (11%), ‘‘coal
burning’’ (10.5%), and the last two with an almost
equal contribution to the explanation of the total variance (8.7%) are, respectively, ‘‘salt’’ and ‘‘fertilizer’’.
The first latent factor reveals the high loadings for
nitrate, ammonium, zinc, black and organic carbon
and could be identified as a combination of various
effects of industrial activity and atmospheric transfer.
It is known that nitrate and ammonium are related to
secondary aerosol formation which is dispersed as fine
dust. The carbonaceous components are remains of
incomplete combustion processes (traffic, domestic
burning, chemical industry etc).
The high factor loadings of calcium and sulfate in
the second latent factor are indicative of the contribution of the earth crust (explaining the calcium impact).
It is known that in the Alpine region, calcium compounds are mainly carbonates and sulfates.
The third latent factor is a kind of diffuse source of
anthropogenic activities, but keeping in mind the role
of the trace components vanadium and nickel as tracers for oil burning [14] we ascribed the role of the oil
burning source to this factor. Chromium emissions are
typical of coal burning but are also reported as products of oil burning [15].
In the fourth identified source dominant loadings
are those of As, Cd and Pb, which are typical of lead
smelter emissions. As already mentioned, such a
smelter was in operation in the region of interest.
The significant loadings for iron and manganese in
factor five lead to the assumption that the emission
source could be related to steel production or coal
burning. Magnesium also indicates high loading and
could be ascribed to steel production [16]. Yet no steel
plant is located in the proximity of the relevant site.
That is why a more probable source is local coal burning with tracers such as iron, manganese, calcium,
arsenic and black carbon [16]. This assumption is
confirmed by the values of the factor loadings for
Ca2þ and BC in PC5. Although not significant, they
have relatively high values.
The sixth latent factor clearly indicates the influence of long distance transport of marine aerosols
(high loadings for sodium and chloride).
The last latent factor is characterized by high loadings for potassium and copper. Since there is no refuse
incineration plant in the neighborhood of the site
(along with Zn and Pb, potassium and copper could
be tracers for such a source), we assume that it represents a fertilizer influence.
In the next stage of the chemometric study, a
source apportioning procedure was applied [13],
which allows determining the contribution (in % and
quantity) of each identified source in the formation of
297
Multivariate Statistical Assessment of Air Quality
Table 2. Source apportioning for the particle total mass and chemical concentrations
Intcpt
Cl
0.005
26.3%
1.68
88.1%
1.15
32.8%
NO3
SO4 2
Naþ
NH4 þ
Kþ
Ca2þ
Mg2þ
As
Cd
Cr
Cu
Fe
Mn
Ni
Pb
V
Zn
BC
OC
Total mass
Secondary
aerosol
Mineral
dust
1.49
42.6%
0.03
36.9%
1.11
55.2%
0.09
46.2%
0.40
19.9%
Oil burning
0.35
10.1%
0.01
13.8%
0.17
8.5%
Lead
smelter
0.51
14.4%
0.009
10.1%
0.18
8.7%
0.11
24.5%
0.19
0.07
41.8%
14.7%
0.32
0.25
17.0%
13.5%
7.33
14.5%
0.66
0.46
20.2%
14.1%
No adequate model obtained
9.66
27.4%
0.53
0.22
40.7%
16.7%
14.88
15.51
49.0% 51.0%
0.27
0.52
26.3% 51.0%
0.70
2.51
0.36
14.4% 51.5%
7.4%
2.95
7.62
2.56
14.6% 37.8%
12.7%
Salt
0.004
16.4%
0.013
57.3%
0.23
11.9%
0.012
13.9%
0.025
20.0%
0.06
13.6%
0.09
19.7%
0.16
8.9%
9.79
19.4%
0.51
15.5%
3.77
10.7%
0.56
42.6%
1.17
86.0%
0.17
35.8%
0.06
13.5%
Fertilizer
0.02
25.4%
0.01
6.2%
0.03
0.06
19.9%
50.6%
No adequate model obtained
0.12
26.2%
Coal
burning
0.16
7.8%
0.09
47.7%
0.01
9.6%
0.19
14.0%
0.05
10.3%
1.13
60.7%
0.50
15.3%
33.41
66.1%
1.14
34.9%
21.82
61.9%
Estimated
0.29
5.9%
1.38
6.8%
0.18
3.7%
1.77
8.8%
0.10
2.1%
0.12
11.9%
0.48
9.9%
1.86
9.2%
R2
0.022
0.025
0.84
1.91
1.71
0.75
3.50
3.57
0.86
0.09
0.08
0.72
2.02
2.00
0.89
0.19
0.17
0.90
0.13
0.12
0.81
1.36
1.32
0.80
0.48
0.42
0.68
0.45
0.48
0.69
1.86
1.88
0.90
50.53
51.64
0.89
3.26
3.20
0.89
35.25
35.26
0.77
1.31
1.29
0.87
30.4
0.11
10.8%
0.25
5.2%
2.02
10.0%
Observed
28.5
0.35
1.03
1.03
0.70
4.87
4.87
0.90
20.2
20.3
0.90
Note: Apportioned values in mg Nm3 , (ng Nm3 for metals) and in %; estimated mass – calculated by the model mass or concentration;
observed mass – measured mass or concentration.
the particle total mass or measured chemical concentrations. Table 2 presents the regression models
(regression using the absolute principal components
scores) for each chemical parameter and for the total
mass. The intercept indicates the unexplained mass or
concentration. The determination coefficient R2 is a
measure for the model validity.
The largest part of the particle total mass is
explained by the contribution of the secondary emission sources, then ‘‘mineral dust’’, ‘‘coal burning’’,
‘‘fertilizer’’, ‘‘lead smelter’’ and ‘‘oil burning’’. No
contribution of marine salt is found, and the unex-
plained part amounts to nearly 15%. The model shows
good validity (R2 ¼ 0.84).
Similarly, the contribution of the emission sources
to the formation of the total concentrations of the
chemical parameters can be found and estimated.
No adequate models could be obtained for the apportioning of magnesium and nickel.
Conclusions
Sustainability concepts require respective technological, ecological and economic responses to possible
298
environmental pollution. But in most cases the
responses are univariate, e.g. calculation of eco-efficiency indicators, comparison with allowable levels of
pollution, instructions for immediate action etc. The
multivariate statistical assessment of the air quality in
the region of the site Unterloibach, Austria, seems to
be a good example of how the multivariate approach
to sustainability works. Monitoring (analytics) and
chemometrics are the combination which considers
the environmental system in its whole complexity
and, in this particular case study, reveals many
‘‘hidden’’ details about pollution sources and their
effects on the environment. It is our deep conviction
that chemometrics could contribute significantly to
the concept of sustainability in all of its aspects –
ecological, technological, economic and even social.
References
[1] Siemann W (2003) Umweltgeschichte, Themen und
Perspektiven. Verlag C.H. Beck, M€
unchen
[2] National round table on the environment and the economy,
Canada. Eco-efficiency indicators workbook, (2003) http:==
www.nrtee.ca=publications=eco-efficiency_workbook=
Multivariate Statistical Assessment of Air Quality
[3] Einax J W, Zwanziger H W, Geiß S (1997) Chemometrics in
environmental analysis. VCH, Weinheim
[4] Simeonov V (2002) Encyclopedia of environmetrics. Wiley,
New York
[5] Massart D L, Vandeginste B G M, Buydens L M C, De Jong S,
Lewi P J, Smeyers-Verbeke J (1998) Handbook of chemometrics and qualimetrics; data handling in science and technology, parts A and B. Elsevier, Amsterdam
[6] Hopke P K (1991) Receptor modeling for air quality management. Elsevier, New York
[7] Berner A (1978) Chem Ing Techn 50: 399–412
[8] CEN Norm – (1998) pr EN 1234
[9] Puxbaum H, Rendl J (1983) Microchim Acta I: 263–266
[10] Hansen A D, Rosen H, Novakov T (1984) The Sci Total Envir
36: 191–198
[11] Daszykowski M, Walczak B, Massart D L (2001) Chemom
Intell Lab Syst 56: 83–91
[12] Stanimirova I, Daszykowski M, Massart D L, Questier F,
Simeonov V, Puxbaum H (2004) J Envir Manag (in press)
[13] Thurston G D, Spengler J D (1985) Atmos Environ 19:
9–15
[14] Pacyna J M, Semb A, Hanssen J E (1984) Tellus 36B:
163–173
[15] Subcommittee on chromium. Chromium. Medical and
biological effects of environmental pollutants, Division of
medical sciences assembly of life sciences national research
council (1974) National Academy of Sciences Washington
D.C.
[16] Steiger M (1991) PhD Dissertation, University of Hamburg