Multivariate Statistical Assessment of Air Quality: A Case Study

Stefan Tsakovski

Multivariate Statistical Assessment of Air Quality: A Case Study

2004, Mikrochimica Acta

The present paper deals with the application of several chemometrical methods (cluster and principal components analysis, source apportioning on absolute principal components scores) to an aerosol data collection from Unterloibach, Austria. It is shown that seven latent factors explaining almost 80% of the total variance are responsible for the data structure and are conditionally identified as “secondary aerosol”, “mineral dust”, “oil burning”, “lead smelter”, “coal burning”, “salt” and “fertilizer” emission sources. Furthermore, the contribution of each identified source to the formation of the particle total mass and chemical compounds total concentration is calculated. Thus, a reliable assessment of the air quality in the region is performed. The requirements of the sustainability concept for ecological indicators in this case is easily transformed into a multivariate statistical problem taking into account not separate indicators but the specific multivariate nature of aerosol pollution.

Microchim. Acta 148, 293–298 (2004) DOI 10.1007/s00604-004-0279-2 Original Paper Multivariate Statistical Assessment of Air Quality: A Case Study Vasil Simeonov1;, Stefan Tsakovski1 , Tomaz Lavric2 , Pavlina Simeonova3 , and Hans Puxbaum2 1 2 3 Chair of Analytical Chemistry, Faculty of Chemistry, University of Sofia ‘‘St. Kl. Okhridski’’, J. Bourchier Blvd. 1, 1164 Sofia, Bulgaria Institute of Chemical Technologies and Analytics, Vienna University of Technology, Getreidemarkt 9=164, 1060 Vienna, Austria Institute of Solid State Physics, Bulgarian Academy of Sciences, Tzarigradsko Chaussee 72, 1784 Sofia, Bulgaria Received March 12, 2004; accepted August 31, 2004; published online November 5, 2004 # Springer-Verlag 2004 Abstract. The present paper deals with the application of several chemometrical methods (cluster and principal components analysis, source apportioning on absolute principal components scores) to an aerosol data collection from Unterloibach, Austria. It is shown that seven latent factors explaining almost 80% of the total variance are responsible for the data structure and are conditionally identified as ‘‘secondary aerosol’’, ‘‘mineral dust’’, ‘‘oil burning’’, ‘‘lead smelter’’, ‘‘coal burning’’, ‘‘salt’’ and ‘‘fertilizer’’ emission sources. Furthermore, the contribution of each identified source to the formation of the particle total mass and chemical compounds total concentration is calculated. Thus, a reliable assessment of the air quality in the region is performed. The requirements of the sustainability concept for ecological indicators in this case is easily transformed into a multivariate statistical problem taking into account not separate indicators but the specific multivariate nature of aerosol pollution. Key words: Chemometrics; air quality; cluster analysis; principal components analysis; source apportioning. In recent years the concept of sustainable development has been successfully introduced and exploited Author for correspondence. E-mail: VSimeonov@chem.unisofia.bg not only as a legislative and political formula but also as the basis for various research projects. The current discussion about sustainability is limited mostly to descriptive and normative elements. The question is not whether a certain activity is sustainable but what recommendation is needed in order for an activity to be considered sustainable. In terms of environmental problems, this seems quite formal and undefined [1]. For instance, the traditionally accepted idea of sustainability is to find a sound compromise between the constantly increasing needs of mankind for energy and raw materials on one hand, and, on the other, the social requirement for a clean environment and better chances for the coming generations. In order to achieve sustainability, people will need to reach a mutual agreement or reliable solutions concerning a bunch of individual problems such as future development of clean technologies and products, controlling the prices of natural energy resources or funding alternative energy supplies etc. Obviously rapid technological and social development requires some kind of sustainability metrics in order to control and establish the system of sustainable development. The concept requires the implementation of standard sustainability measures in the industrial sector, in environmental policy, in economics, and in social life. The challenge to 294 become environmentally relevant has led to the development of two important concepts for sustainability indicators: – The P–S–R indicator concept (the pressure of the socio-economic activities into natural systems leads to observable changes in the state of the environmental systems, which causes respective response or socio-economic measures to reduce the hazardous effects); – The D–P–S–R indicator concept (the socio-economic drivers cause the pressure, which changes the state and calls for response). Support for these basic concepts reflected in the creation of various eco-efficiency indicators, which track and report energy, waste and water parameters required for the definition of sustainability [2]. Such indicators have been introduced in the economy and there are efforts to find appropriate eco-efficiency indicators for social life. However, these metrics lead to univariate estimates of the real environmental problems: keeping a certain value within the allowable limits; following a trend; resolving a local pollution problem; comparing a product with a price; assessment of public opinion about an environmental issue etc. All present efforts concentrate on the definition and calculation of indicators yet for separate applications – to production, to economics or the social sciences. The natural environment is, indeed, a multivariate complex system, and its quality assessment with respect to sustainability requires multivariate approaches and metrics. The capability of chemometrics and environmetrics to handle multivariate systems and objects has helped many environmental studies [3–6] in reaching the correct data classification, modeling and interpretation. Thus, chemometrics turns out to be a very effective tool for problem solving and decision-making. It is the aim of the present study to illustrate some of the multivariate solutions offered by chemometrics when applied to environmental studies. This is how chemometrics could be used as a metrics tool in sustainability research. The multivariate statistical treatment of the monitoring data of aerosols from the border between Austria and Slovenia made it possible to extract important information about environmental pollution in the region and to perform a reliable assessment of the air quality. V. Simeonov et al. Experimental Sampling Site and Sampling Procedure The sampling site of Unterloibach is located in the Austrian province of Carynthia at a height of 629 m a.s.l. (altitude 46 320 1800 and longitude 14 480 5200 ). Unterloibach is a typical rural site located near the border of Slovenia. Nearly 15 km northeast there is a lead smelter (on Slovenian territory) which is still active, and to the east, again in Slovenia, a steel work produces special quality steels. To the south you will find the largest Slovenian coal power station which supplies over 75% of the country’s electricity. The aerosol data was gathered in the period between March 1999 and February 2000. Sampling was performed with a high-volume sampler (Digitel DHA-80), which is a completely automated device described in detail elsewhere [7]. The aerosol particles of class PM10 were collected daily on quartz fiber filters (QAT-UP, Pallflex, USA), thus allowing the determination of the carbon content. The complete description of the sampling device and the pre-sampling preparation of the filters can be found in [7]. Analytical Procedures The particle total mass was determined by weighing the sampling filters before and after sampling according to the CEN standard [8]. Determination of the water-soluble ions (cations: sodium, ammonium, potassium, magnesium and calcium; anions: chloride, nitrate, sulfate) was performed by using two ion-chromatographic systems after extraction of the filters by deionised water in ultrasonic bath for 20 min. The concentration of the heavy metals was determined by atomic absorption spectrometry. One quarter of the filter was cut with ceramic scissors, and the sample was weighted and extracted with 10 mL 10% HNO3. The analytical procedures are described in detail elsewhere [9, 10]. The analytical procedure for the determination of carbon (total carbon, TC, black carbon, BC and organic carbon, OC) used the developments of the well-established approaches of Puxbaum [9] for sample burning in oxygen atmosphere (TC), optical determination (BC) and the difference between TC and BC for OC determination. Chemometrical Methods In the data treatment approaches of environmetrics, both unsupervised and supervised techniques are used. In the first case, data mining is performed spontaneously, in a hierarchical way, from the data set. In the latter case, a preliminary step of learning (training) is necessary to derive a treatment (classification) rule based on grouping of objects with known origin or behaviour. This rule allows interpreting new objects with unknown origin or behaviour in the classes offered by the classification rule. This case study uses three major classification procedures only. Cluster analysis is a well-known and widely used classification approach for environmetrical purposes with its hierarchical and nonhierarchical algorithms [3, 5]. In order to cluster objects characterized by a set of variables (e.g. sampling sites by chemical concentrations or pollutants), one has to determine their similarity. To avoid the influence of data size, a preliminary step of data scaling is necessary (e.g. autoscaling or z-transform, range scaling, logarithmic transformation) where normalized dimensionless numbers replaces the real data values. Thus, even serious differences in absolute (concentration) values 295 Multivariate Statistical Assessment of Air Quality are reduced to close numbers. Then the similarity (or more strictly, the distance) between the objects in the variable space can be determined. Very often the Euclidean distance (ordinary, weighted, standardized) is used for clustering purposes. Another way of measuring similarity is to calculate the correlation coefficient between two row-vectors x1 and x2 characterizing objects 1 and 2. Thus, from the input matrix (raw data) a similarity matrix is calculated. There is a wide variability of hierarchical algorithms but the typical ones include the single linkage, the complete linkage and the average linkage methods. Representation of the cluster analysis results is performed either by a tree-like scheme called dendrogram which has a hierarchical structure (large groups are divided into small ones) or by tables containing different possible clusterings. The hierarchical methods of clustering mentioned above are called agglomerative. Good results are also obtained when using hierarchical divisive methods, i.e. methods that first divide the set of all objects into two so that two groups (clusters) are formed. Then each group (cluster) is again divided into two etc., until all objects are separated. The aim of classification by non-hierarchical clustering is to classify the objects in consideration into a certain number of preliminary intended groups, e.g. K clusters. For instance, in order to obtain 2 clusters, one selects 2 seed points from among the objects and classifies each of the objects with the nearest seed point. Thus, an initial cluster is obtained. For each of these clusters one determines the centroid (the point of mean values of the variables xi for each cluster). The whole procedure is repeated; new centroids are calculated for the new clusters. The new centroids have new co-ordinates and it leads to reclassification of the objects. Daszykowski et al. [11] offer new original clustering algorithms called density-based spatial clustering of application with noise (DBSCAN) and ordering points to identify the clustering structure (OPTICS), which is already applied in data mining [12]. Principal components analysis (PCA) is a typical display method which allows estimating the internal relations in the data set and to model the ecosystem in consideration. There are different variants of PCA, but basically their common feature is that they produce linear combinations of the original columns in the data matrix (data set) responsible for the description of the variables characterizing the objects of observation. These linear combinations represent a type of abstract measurements (factors, principal components) which are better descriptors of the data structure (data pattern) than the original (chemical or physical) measurements. Usually, the new abstract variables are referred to as latent factors and they differ from the original ones called manifest variables. It is commonly found that just a few of the latent variables account for a large part of the data set variation. Thus, the data structure in a reduced space can be observed and studied [5]. Generally, when analysing a data set consisting of n objects for which m variables have been measured, PCA can extract m principal components PCs (factors or latent variables) where m < n. The first PC represents the direction in the data containing the largest variation. PC 2 is orthogonal to PC 1 and represents the direction of the largest residual variation around PC 1. PC 3 is orthogonal to the first two and represents the direction of the highest, residual variation around the plane formed by PC 1 and PC 2. The projections of the data on the plane of PC 1 and PC 2 can be computed and shown as a plot (score plot). In such a plot it is possible to distinguish similarity groups. According to the theory of PCA, the scores on the PCs (the new coordinates of the data space) are a weighted sum of the original variables (e.g. chemical concentrations): Score ðvalue of object I along a PC pÞ ¼ 1p Y1 þ 2p Y2 þ þ kp Yk where Y indicates the variable value (e.g. concentration) and is the weight (called loading). The information hidden in the loadings can also be displayed in loading plots. It is important to note that PCA very often requires scaling the input raw data to eliminate dependence on the scale of the original values. Multiple regression on principal components (apportioning models) is a very important environmetric approach [13]. It permits apportioning the contribution of each latent factor identified by PCA (emission source) to the total mass (concentration) of a certain chemical variable. The first step is performance of PCA, identification of latent factors, then determination of the absolute principal components scores (APCS) and multiple regression of the total mass (dependent variable) on the APCSs (independent variables). Results and Discussions The monitoring data is available from the authors on request. Multivariate statistical data treatment was performed using the STATISTICA 6.0 package. Data clustering to determine possible relationships between the variables (chemical components of the PM10 aerosol collection) gives the following significant clusters (data matrix consisting of 113 objects or sampling days and 20 variables or chemical components): C1: C2: C3: C4: C5: C6: C7: NO3 , NH4 þ , Zn, BC, OC Ca2þ , SO4 2 Cr, Ni, V As, Cd, Pb Mg2þ , Fe, Mn Cl , Naþ Kþ , Cu Cluster analysis was performed according to Ward’s method of linkage and squared Euclidean distance as similarity measure. Cluster significance was determined by separation at distances of 1=3 Dmax and 2=3 Dmax (Sneath’s criterion). The linkage of the chemical variables into 7 clusters is an indication of the complex character of the pollution emitters in the region. It may be assumed that probably seven factors determine the aerosol composition in the region of Unterloibach. In order to obtain information about the data structure and identify latent factors responsible for it, the data collection was treated with principal components analysis (Varimax rotation, scree plot validation and Malinowski’s test for significance of the factor loadings). Table 1 presents the factor loadings for seven principal components which explain nearly 80% of the total variance of the system. It is shown that seven latent factors determine the data structure. These factors are related to the existing 296 V. Simeonov et al. Table 1. Factor loadings Cl NO3 SO4 2 Naþ NH4 þ Kþ Ca2þ Mg2þ As Cd Cr Cu Fe Mn Ni Pb V Zn BC OC Expl. var. PC1 PC2 PC3 PC4 PC5 PC6 PC7 0.169 0.806 0.341 0.048 0.732 0.523 0.046 0.171 0.094 0.321 0.013 0.142 0.013 0.284 0.438 0.017 0.371 0.564 0.748 0.853 18.9% 0.095 0.119 0.780 0.082 0.445 0.051 0.830 0.072 0.101 0.400 0.374 0.197 0.201 0.309 0.103 0.110 0.257 0.049 0.407 0.207 11.6% 0.081 0.195 0.213 0.315 0.232 0.002 0.074 0.131 0.011 0.282 0.600 0.155 0.334 0.399 0.748 0.187 0.789 0.238 0.001 0.197 11.6% 0.096 0.073 0.238 0.180 0.187 0.042 0.107 0.065 0.870 0.527 0.320 0.063 0.090 0.315 0.159 0.843 0.034 0.221 0.059 0.096 11.0% 0.147 0.002 0.027 0.224 0.058 0.033 0.278 0.822 0.133 0.036 0.233 0.013 0.825 0.627 0.200 0.014 0.048 0.136 0.227 0.124 10.5% 0.869 0.290 0.146 0.758 0.094 0.187 0.030 0.363 0.029 0.158 0.138 0.063 0.048 0.098 0.084 0.007 0.018 0.194 0.126 0.092 8.7% 0.114 0.114 0.032 0.084 0.177 0.764 0.154 0.025 0.016 0.065 0.121 0.898 0.013 0.063 0.203 0.131 0.006 0.047 0.290 0.272 8.6% Note: Marked loadings are statistically significant (Malinowski’s test). emission sources in the region and are conditionally named ‘‘secondary emission’’ (explaining nearly 19% of the variance), ‘‘mineral dust’’ (accounting for about 12% of the total variance), ‘‘oil burning’’ (same percentage as the previous one), ‘‘lead smelter’’ (11%), ‘‘coal burning’’ (10.5%), and the last two with an almost equal contribution to the explanation of the total variance (8.7%) are, respectively, ‘‘salt’’ and ‘‘fertilizer’’. The first latent factor reveals the high loadings for nitrate, ammonium, zinc, black and organic carbon and could be identified as a combination of various effects of industrial activity and atmospheric transfer. It is known that nitrate and ammonium are related to secondary aerosol formation which is dispersed as fine dust. The carbonaceous components are remains of incomplete combustion processes (traffic, domestic burning, chemical industry etc). The high factor loadings of calcium and sulfate in the second latent factor are indicative of the contribution of the earth crust (explaining the calcium impact). It is known that in the Alpine region, calcium compounds are mainly carbonates and sulfates. The third latent factor is a kind of diffuse source of anthropogenic activities, but keeping in mind the role of the trace components vanadium and nickel as tracers for oil burning [14] we ascribed the role of the oil burning source to this factor. Chromium emissions are typical of coal burning but are also reported as products of oil burning [15]. In the fourth identified source dominant loadings are those of As, Cd and Pb, which are typical of lead smelter emissions. As already mentioned, such a smelter was in operation in the region of interest. The significant loadings for iron and manganese in factor five lead to the assumption that the emission source could be related to steel production or coal burning. Magnesium also indicates high loading and could be ascribed to steel production [16]. Yet no steel plant is located in the proximity of the relevant site. That is why a more probable source is local coal burning with tracers such as iron, manganese, calcium, arsenic and black carbon [16]. This assumption is confirmed by the values of the factor loadings for Ca2þ and BC in PC5. Although not significant, they have relatively high values. The sixth latent factor clearly indicates the influence of long distance transport of marine aerosols (high loadings for sodium and chloride). The last latent factor is characterized by high loadings for potassium and copper. Since there is no refuse incineration plant in the neighborhood of the site (along with Zn and Pb, potassium and copper could be tracers for such a source), we assume that it represents a fertilizer influence. In the next stage of the chemometric study, a source apportioning procedure was applied [13], which allows determining the contribution (in % and quantity) of each identified source in the formation of 297 Multivariate Statistical Assessment of Air Quality Table 2. Source apportioning for the particle total mass and chemical concentrations Intcpt Cl 0.005 26.3% 1.68 88.1% 1.15 32.8% NO3 SO4 2 Naþ NH4 þ Kþ Ca2þ Mg2þ As Cd Cr Cu Fe Mn Ni Pb V Zn BC OC Total mass Secondary aerosol Mineral dust 1.49 42.6% 0.03 36.9% 1.11 55.2% 0.09 46.2% 0.40 19.9% Oil burning 0.35 10.1% 0.01 13.8% 0.17 8.5% Lead smelter 0.51 14.4% 0.009 10.1% 0.18 8.7% 0.11 24.5% 0.19 0.07 41.8% 14.7% 0.32 0.25 17.0% 13.5% 7.33 14.5% 0.66 0.46 20.2% 14.1% No adequate model obtained 9.66 27.4% 0.53 0.22 40.7% 16.7% 14.88 15.51 49.0% 51.0% 0.27 0.52 26.3% 51.0% 0.70 2.51 0.36 14.4% 51.5% 7.4% 2.95 7.62 2.56 14.6% 37.8% 12.7% Salt 0.004 16.4% 0.013 57.3% 0.23 11.9% 0.012 13.9% 0.025 20.0% 0.06 13.6% 0.09 19.7% 0.16 8.9% 9.79 19.4% 0.51 15.5% 3.77 10.7% 0.56 42.6% 1.17 86.0% 0.17 35.8% 0.06 13.5% Fertilizer 0.02 25.4% 0.01 6.2% 0.03 0.06 19.9% 50.6% No adequate model obtained 0.12 26.2% Coal burning 0.16 7.8% 0.09 47.7% 0.01 9.6% 0.19 14.0% 0.05 10.3% 1.13 60.7% 0.50 15.3% 33.41 66.1% 1.14 34.9% 21.82 61.9% Estimated 0.29 5.9% 1.38 6.8% 0.18 3.7% 1.77 8.8% 0.10 2.1% 0.12 11.9% 0.48 9.9% 1.86 9.2% R2 0.022 0.025 0.84 1.91 1.71 0.75 3.50 3.57 0.86 0.09 0.08 0.72 2.02 2.00 0.89 0.19 0.17 0.90 0.13 0.12 0.81 1.36 1.32 0.80 0.48 0.42 0.68 0.45 0.48 0.69 1.86 1.88 0.90 50.53 51.64 0.89 3.26 3.20 0.89 35.25 35.26 0.77 1.31 1.29 0.87 30.4 0.11 10.8% 0.25 5.2% 2.02 10.0% Observed 28.5 0.35 1.03 1.03 0.70 4.87 4.87 0.90 20.2 20.3 0.90 Note: Apportioned values in mg Nm3 , (ng Nm3 for metals) and in %; estimated mass – calculated by the model mass or concentration; observed mass – measured mass or concentration. the particle total mass or measured chemical concentrations. Table 2 presents the regression models (regression using the absolute principal components scores) for each chemical parameter and for the total mass. The intercept indicates the unexplained mass or concentration. The determination coefficient R2 is a measure for the model validity. The largest part of the particle total mass is explained by the contribution of the secondary emission sources, then ‘‘mineral dust’’, ‘‘coal burning’’, ‘‘fertilizer’’, ‘‘lead smelter’’ and ‘‘oil burning’’. No contribution of marine salt is found, and the unex- plained part amounts to nearly 15%. The model shows good validity (R2 ¼ 0.84). Similarly, the contribution of the emission sources to the formation of the total concentrations of the chemical parameters can be found and estimated. No adequate models could be obtained for the apportioning of magnesium and nickel. Conclusions Sustainability concepts require respective technological, ecological and economic responses to possible 298 environmental pollution. But in most cases the responses are univariate, e.g. calculation of eco-efficiency indicators, comparison with allowable levels of pollution, instructions for immediate action etc. The multivariate statistical assessment of the air quality in the region of the site Unterloibach, Austria, seems to be a good example of how the multivariate approach to sustainability works. Monitoring (analytics) and chemometrics are the combination which considers the environmental system in its whole complexity and, in this particular case study, reveals many ‘‘hidden’’ details about pollution sources and their effects on the environment. It is our deep conviction that chemometrics could contribute significantly to the concept of sustainability in all of its aspects – ecological, technological, economic and even social. References [1] Siemann W (2003) Umweltgeschichte, Themen und Perspektiven. Verlag C.H. Beck, M€ unchen [2] National round table on the environment and the economy, Canada. Eco-efficiency indicators workbook, (2003) http:== www.nrtee.ca=publications=eco-efficiency_workbook= Multivariate Statistical Assessment of Air Quality [3] Einax J W, Zwanziger H W, Geiß S (1997) Chemometrics in environmental analysis. VCH, Weinheim [4] Simeonov V (2002) Encyclopedia of environmetrics. Wiley, New York [5] Massart D L, Vandeginste B G M, Buydens L M C, De Jong S, Lewi P J, Smeyers-Verbeke J (1998) Handbook of chemometrics and qualimetrics; data handling in science and technology, parts A and B. Elsevier, Amsterdam [6] Hopke P K (1991) Receptor modeling for air quality management. Elsevier, New York [7] Berner A (1978) Chem Ing Techn 50: 399–412 [8] CEN Norm – (1998) pr EN 1234 [9] Puxbaum H, Rendl J (1983) Microchim Acta I: 263–266 [10] Hansen A D, Rosen H, Novakov T (1984) The Sci Total Envir 36: 191–198 [11] Daszykowski M, Walczak B, Massart D L (2001) Chemom Intell Lab Syst 56: 83–91 [12] Stanimirova I, Daszykowski M, Massart D L, Questier F, Simeonov V, Puxbaum H (2004) J Envir Manag (in press) [13] Thurston G D, Spengler J D (1985) Atmos Environ 19: 9–15 [14] Pacyna J M, Semb A, Hanssen J E (1984) Tellus 36B: 163–173 [15] Subcommittee on chromium. Chromium. Medical and biological effects of environmental pollutants, Division of medical sciences assembly of life sciences national research council (1974) National Academy of Sciences Washington D.C. [16] Steiger M (1991) PhD Dissertation, University of Hamburg

Log In

Multivariate Statistical Assessment of Air Quality: A Case Study

Related papers

Related papers

Related topics