Understanding The Molecular Information Contained in Principal Component Analysis of Vibrational Spectra of Biological Systems
DOI: 10.1039/c1an15821j
K-means clustering followed by Principal Component Analysis (PCA) is employed to analyse Raman
spectroscopic maps of single biological cells. K-means clustering successfully identifies regions of
cellular cytoplasm, nucleus and nucleoli, but the mean spectra do not differentiate their biochemical
composition. The loadings of the principal components identified by PCA shed further light on the
spectral basis for differentiation but they are complex and, as the number of spectra per cluster is
imbalanced, particularly in the case of the nucleoli, the loadings under-represent the basis for
differentiation of some cellular regions. Analysis of pure bio-molecules, both structurally and spectrally
distinct, in the case of histone, ceramide and RNA, and similarly in the case of the proteins albumin,
collagen and histone, show the relative strong representation of spectrally sharp features in the spectral
loadings, and the systematic variation of the loadings as one cluster becomes reduced in number. The
more complex cellular environment is simulated by weighted sums of spectra, illustrating that although
the loading becomes increasingly complex; their origin in a weighted sum of the constituent molecular
components is still evident. Returning to the cellular analysis, the number of spectra per cluster is
artificially balanced by increasing the weighting of the spectra of smaller number clusters. While it
renders the PCA loading more complex for the three-way analysis, a pair wise analysis illustrates clear
differences between the identified subcellular regions, and notably the molecular differences between
nuclear and nucleoli regions are elucidated. Overall, the study demonstrates how appropriate
consideration of the data available can improve the understanding of the information delivered by PCA.
data pre-processing have considerably increased the relevancy of USA). Cells were cultured in DMEM (Sigma), penicillin and
the information contained in the data, facilitating the stand- streptomycin (Gibco) and 10% foetal calf serum (FCS, Bio-
ardisation of methods for data analysis.30 Different approaches chrom, Berlin) in a humidified atmosphere containing 5% CO2 at
such as PCA or K-means clustering are commonly used for the 37 C. Cells were loaded, at a concentration of 4 104 cells, onto
analysis of large amounts of data, allowing a discrimination of CaF2 and were incubated for 24h at 37 C, 95% CO2. Before
different samples or regions of a sample, according to differences measurements, the cell samples were fixed using a 10% formalin
in their biochemical content, and identification of the spectral solution for 10 min. A number of studies on the effect of cell
features which manifest the highest degree of variability.1,7,31–33 fixation using Raman spectroscopy and can be found in the
Although these methods are used routinely and are quite well literature.34,37,38 By comparison between live cells and fixed cells
developed, the question of the relevance and molecular specificity using different fixation procedures, it has been demonstrated
of the information contained in the data remains largely unad- than the closest model for live cells is achieved using 10%
dressed. Notably, in an increasingly multidisciplinary field, the formalin fixation. This is explained by the fact that while many
results and underlying processes are seldom scrutinised. fixation techniques require drying of the sample, formalin fixa-
K-means clustering has been employed widely in tissue anal- tion keeps the cells in a hydrated state. Moreover, although the
ysis, providing the possibility of grouping the spectra into formalin could affect the protein, it also maintains the lipid
different clusters based on spectral similarity and therefore content relatively intact compared to fixation methods employ-
identifying common biochemical signatures and their spatial ing alcohol. After fixation, the cells were washed 3 times using
distributions. The clustering is based on the molecular informa- PBS and kept in this solution during the measurements.
tion contained in the individual spectra and the results are Pure samples of biomolecular components were analysed by
commonly displayed as false colour maps of the average spectra Raman spectroscopy to simulate biologically variable systems. In
of each cluster. While the technique is valuable for visualisation order to achieve the best homogeneity, samples were dissolved in
of the spectrally differentiated regions of the sample, the maps appropriate solvents before deposition on CaF2 substrates.
themselves do not show the original spectral data, and the mean Lyophilised ribonucleic acid, albumine and histones (Sigma-
spectrum representing each cluster can often under represent Aldrich, Ireland) were dispersed in water. Lyophilised
subtle differences between the different regions of the sample. deoxyribonucleic acid from calf thymus, ceramide and L-phos-
PCA is a powerful approach for the analysis of large spectral phatidyl-ethanolamine (Sigma-Aldrich, Ireland) were dispersed
data sets. It represents the spectra in data groupings of similar in chloroform. Collagen solution in acetic acid at a concentration
variability, allowing the identification and differentiation of of 5 mg mL1 (Gibco, Ireland) was deposited from acetic acid
different spectral groups. This approach is widely used to eval- solution. For all biomolecular samples, the Raman signals were
uate the possibility of discriminating different data sets (and stable over prolonged measurement periods (1 h). Signal vari-
therefore samples or regions of samples) using scores plots.29 ations were largely due to inhomogeneity in the deposited sample
However, the strength of this technique resides in the loadings, and variation in the background present in the spectra collected.
which give a representation of the spectral origin of the varia- The spectra were also utilised help identify the biochemical
tions which differentiate the data groupings according to the origins of the features of PC loading spectra. Isolation of the
wavenumbers.34–36 A combination of K-means clustering fol- different biomolecular components from cells would present the
lowed by PCA constitutes a well adapted tool for analysis of best references for the characterisation of the cellular content.
spectral data.19 Nevertheless, the analysis of the PCA loadings is However, comparisons are qualitative only, and moreover, the
not trivial and the relevance of the information contained in the similarity existing between the pure bio-molecules tested and the
plots often remains enigmatic. spectra recorded from the cells is satisfactory and gives a good
In this study, Raman spectral mapping of single biological representation of the information delivered by the PCA.
cells is presented and analysed using a combination of K-means
cluster followed by PCA. In order to better understand the
2.2 Raman spectroscopic measurements
information obtained through the analysis, Raman spectra of
individual biomolecular components are analysed using PCA. A Horiba Jobin-Yvon LabRAM HR800 spectrometer was used
The observations made from the scores plots are correlated with throughout this work. For the measurements, either a 100
the different loadings calculated using simple models based on objective (MPlanN, Olympus), for the recording of spectra from
these samples, and the influence of spectral differences and the pure compounds, or a 100 immersion objective (LUM-
number per dataset group are examined. The observations made PlanF1, Olympus), for cell mapping, was employed, each
are applied to the more complex data sets derived from sub- providing a spot diameter of 1 mm at the sample. The confocal
cellular maps recorded from single cells, demonstrating that hole was set at 100 mm for all measurements, the specified setting
a better understanding of the molecular contributions to the for confocal operation. The 785 nm laser line was used for this
spectral variances can improve the data analysis process. study, delivering a power of 40 mW at the sample. The system
was spectrally calibrated to the 520.7 cm1 spectral line of silicon.
The LabRAM system is a confocal spectrometer that contains
2. Materials and methods two interchangeable gratings (300 and 600 lines/mm respec-
tively). In the following experiments, the 300 lines/mm grating
2.1 Sample preparation
was used, providing a spectral dispersion of approximately 1.5
A549 cells from a human lung adenocarcinoma with alveolar cm1 per pixel. The backscattered Raman signal was integrated
type II phenotype were obtained from ATTC (Manassas, VA, for 10 s intervals over the spectral range from 400 to 1800 cm1.
The detector used was a 16-bit dynamic range Peltier cooled observations, this study is aimed at understanding the molecular
CCD detector. 100 spectra were collected for each different bio- information contained in the first two PCs. It is considered that
molecular component tested. The spectra recorded from single the observations made in this work can be applied for all the
cells have been obtained using the mapping function of the different PCs calculated using PCA.
Labspec software using a 1 mm step size. The Raman cellular map PCA was employed for this study to highlight the variability
presented in this paper has been selected for illustration existing in the spectral data set recording during the different
purposes. It is a typical example and matches observations made experiments. The other advantage of this method is the obser-
in a large cell screening study as presented in previous work.19 vation of loadings which represent the variance for each variable
(wavenumber) for a given PC. Analysing the loadings of a PC
can give information about the source of the variability inside
2.3 Data preprocessing
a dataset, derived from variations in the molecular components
Data analysis was performed using Matlab (Mathworks, USA). contributing to the spectra.
For the purpose of this study, rather than a full cellular anal-
ysis,18 the objective was to differentiate only three different types
2.4 Data analysis
of spectra, and for this reason the acquisition was focused in the
K-means clustering analysis is one of the simplest unsupervised nucleus including some neighbouring cytoplasm. An optical
learning algorithms used for spectral image analysis. It groups image of a typical A549 cell is shown in Fig. 1 IA. It has been
the spectra according to their similarity, forming clusters, each previously demonstrated that Raman spectroscopy can effec-
one representing regions of the image with identical molecular tively identify and discriminate the cytoplasm as well as nucleoli
properties. The distribution of chemical similarity can then be within the nucleus.18,19 Using K-means clustering analysis, three
visualised across the sample image. The number of clusters (k) different clusters can thus be identified, corresponding to
has to be determined a priori by the operator before initiation of nucleoli, the nucleus and cytoplasm, as shown in the example of
the classification of the data set. K centroids are defined, ideally Fig. 1 IB. Note, the black region represents areas which were not
as far as possible from each other, and then each point belonging sampled spectroscopically, rather than a fourth K-means cluster.
to a data set is associated to the nearest centroid. When all the Mean spectra corresponding to the different clusters were
points have been associated with a centroid, the initial grouping derived from the K-means clustering and are shown in Fig. 1 II.
is done. The second step consists of the calculation of new The three different spectra are rather similar and only small
centroids as barycentres of the clusters resulting from the variations can be seen between the different structures, illus-
previous step. A new grouping is implemented between the same trating that although the analytical technique is very efficient at
data points and the new centroids. These operations are repeated identifying sub-cellular regions, the mean spectra give little
until convergence is reached and there is no further movement of insight into the basis of differentiation on a molecular basis.
the centroids. Finally, k clusters are determined, each containing PCA can provide further insight into the source of the spectral
the most similar spectra from the image, and represented by the variability and therefore differentiation of the different sub-
mean of all spectra of that cluster. False color maps can then be cellular regions. In Fig. 1 III, the data are colour coded according
constructed to visualise the organisation of the clusters in the to their classification by the K-means clustering analysis, but the
original image. PCA clusters are widely spread and not clearly differentiated.
PCA is a method of multivariate analysis broadly used with The spectra have been recorded from cells and the signal is
datasets of multiple dimensions.30 It allows the reduction of the relatively weak, resulting in relatively noisy spectra. Also, vari-
number of variables in a multidimensional dataset, although it ations can be present due to the spatial non uniformity of the
retains most of the variation within the dataset. The order of the sample. Unavoidably, spectra of the nucleus will have varying
principal components (PCs) denotes their importance to the contributions from the overlying cell membrane and cytoplasm,
dataset. PC1 describes the highest amount of variation, PC2 and the spatial separation of the sub-cellular regions is not
the second highest, and so on. Therefore, var (PC1) $ var (PC2) necessarily distinct, within the measurement resolution and
$ var (PCp), where var (PCi) represents the variance of PCi in stepsize (1 mm), although this can be partially alleviated in
the considered data set. Generally, the first three PCs represent confocal operation. Nevertheless, the three different groups are
the highest variance present in the data sets, up to 99%, giving the relatively well discriminated using this method. PC1, which
best visualisation of the differentiation of the different clus- accounts for 35% of the variance, discriminates the nuclear
ters.39,40 However, when recording Raman data from single cells, spectra from those of the cytoplasm, whereas PC2 accounts for
the noise present in the spectra can increase the intragroup 8% of the variance and allows discrimination of the spectra from
variability, thus reducing the specificity of the PCA. In such within the nucleus.
cases, typically the 10 first PCs can be taken into account for Beyond differentiation and classification, the potential of PCA
specific analysis.41 Nevertheless, the PCs contribute less in lies in the possibility to derive information regarding the basis for
decreasing order, meaning that the first PCs contain the most discrimination from the loadings corresponding to each PC. A
information.42 In order to simplify interpretation of experimental clear representation of the spectral variability can be seen, and
Fig. 1 I: (A), Typical confocal microscope image of an A549 cell. The different structures such as membrane (a), cytoplasm (b) and nucleus (c) are
clearly identifiable. The nucleolus present inside the nucleus can also be seen. The area delineated by red indicates a ‘‘typical area’’ selected for Raman
mapping (B), Example of K-means reconstructed image from a Raman map recorded on the nuclear area of an A549. In both X and Y directions, 1 pixel
corresponds to a mapping step of 1 mm. II: Mean spectrum calculated for the different clusters obtained after K-means clustering analysis corresponding
to the nucleoli (A), nucleus (B) and cytoplasm (C). III: Scores plot of the first two principal components after PCA performed on Raman spectra
recorded from A549 cells. The individual data points have been colour coded according to the results of K-means cluster analysis; nucleus (green),
nucleolus (blue) and cytoplasm (red). IV: Plot of the loadings of PC1 (A) and PC2 (B). Different features corresponding to the lipids, proteins and nucleic
acids can be identified.
moreover these loadings can be compared to pristine Raman loadings which differentiate the groups.40,41 Nevertheless, in the
spectra for comparison.18 case of the differentiation of cytoplasm and nucleus/nucleoli by
The loadings of the principal components are shown in Fig. 1 PC1 and nucleus and nucleoli by PC2, there is a clear separation
IV. The plots are offset for clarity, the dotted line indicating the of the groups according to positive and negative scores, indi-
zero point in each case. PC1 has peaks which can be attributed to cating that the loadings are representative of underlying bio-
biochemical constituents such as nucleic acids (788, 1080, 1339 logical differences.
cm1), proteins (1003, 1268, 1339, 1437 cm1) and lipids (715, PCA, in combination with K-means clustering, sheds further
872, 1066, 1080, 1299, 1437 cm1) (for detailed assignments, see light on the basis for differentiation of the cellular regions.
for example8,10,43,44). Their respective negative and positive However, the loadings are complex or inconclusive, rendering
loadings contribute substantially to the differentiation of the interpretation of the underlying biology nontrivial. In order to
nuclear (negative) and cytoplasmic (positive) spectra in the scores further elucidate factors which govern the differentiation of
plot of Fig. 1 III. Differentiation of nucleus and cytoplasm based spectral groups, a series of studies on pure biomolecular
on nucleic acid and lipidic content is somewhat trivial, however, compounds was conducted.
and the loading is rich in features which further contribute to the
differentiation, although a detailed analysis is complex.
3.2 Understanding the PCA
Furthermore, PC2 is rather noisy, and it is difficult to extract
specific information relating to biomolecular constituents which In an effort to better elucidate the process of differentiation by
differentiate the nuclear and nucleoli datasets in the scores plot. PCA, based on biochemical content, a comparison was made of
It should be noted that PCA is an unsupervised technique, and the Raman spectra collected from 3 structurally distinct bio-
does not differentiate between variability within the dataset molecules: RNA, histone (protein) and ceramide (lipids). 100
groupings and variability between the groupings. Thus, intra- spectra were recorded for each sample. Fig. 2 I presents mean
group variability can contribute substantially to the variance and spectra calculated for each of the samples recorded offset for
clarity. These molecules are commonly found in biological the ceramide appear as positive features in PC1, whereas those
samples and have been specifically selected for their high degree from the RNA correspond to negative features of PC1. In rela-
of dissimilarity in the spectra range 400–1800 cm1. tion to the zero-line, the loadings are almost a perfect represen-
The three data sets were loaded in Matlab and PCA was tation of each pure spectrum, the one of RNA being inverted.
performed on the entire spectral window used for the acquisition. Notably, histone features are totally absent in this plot and have
As expected, the three groups are well discriminated in the scores absolutely no influence on the information contained in PC1, as
plot, as shown in Fig. 2 II. PC1 represents 70% of the explained might be expected from Fig. 2 II. To illustrate this effect, the
variance and allows the discrimination between the three groups. loading obtained from the PCA of the data sets of the RNA and
Notably, the histones are grouped at zero on the PC1 axis, ceramide alone was calculated and was found to overlap exactly
while the RNA and Ceramide spectra are symmetrically grouped with the loading 1 obtain using the 3 data sets (data not shown).
at the negative and positive extremes, respectively. PC2 repre- However, this loading overlaps perfectly with Fig. 2 IIIB and so
sents 29% of the variance and discriminates the histone spectra is not discernible. Notably, although all spectra have been nor-
from the RNA and ceramide spectra. For this PC, little or no malised before analysis, PCA identifies the histone as having the
discrimination between RNA and ceramide is observed. In the lowest variability, perhaps because of the lack of sharp individual
example presented in the Fig. 2 II, the bio-molecules tested spectral features in comparison with the other species (Fig. 2 I).
present highly different Raman signatures therefore the intra- An interesting observation is the correlation existing between
group variability is very low and the different spectra for each the values of the loading and the positions of the spectra in the
group are closely grouped. The RNA sample was observed to be PCA plot. The spectra of the RNA are negative with respect to
physically slightly more heterogeneous than the other materials the loading of PC1 whereas the spectra from the ceramide are
after drying, resulting in significant differences in the intensity of positive. Thus, a link exist between the composition of the
the signal recorded. Thus, even after background and baseline samples analysed and the profile of the loading of PC1, or, more
correction, some variation persists. However, the groups are well precisely, the loading of PC1 in this case is a representation of the
defined and discriminated across the scores plots indicating that molecular composition of each sample. To understand the
the PCs are a clear representation of inter-group variance. information contained in the loadings calculated from the PCA,
Fig. 2 III compares the loading of PC1 (Fig. 2 IIIB) with the a simple comparison is made in Fig. 2 III. The loading of PC1,
spectra of pure ceramide (Fig. 2 IIIA) and RNA (Fig. 2 IIIC). which discriminates the RNA from the ceramide, is compared to
The primary observation is that all the peaks corresponding to the difference spectrum between the mean spectrum of ceramide
Fig. 2 I: Mean Raman spectra recorded from RNA (A), histone (B) and ceramide (C) on CaF2 windows. II: Scores plot of the 2 first principal
components after PCA performed on Raman spectra recorded from RNA, histone and ceramide. III: Plot of the loadings of PC1 (blue dot line)
compared with the difference spectrum calculated from the mean spectrum of ceramide minus the mean spectrum of RNA (red dash line). The loadings
are compared with spectra recorded from ceramide (A) and RNA (C), both offset for clarity. IIIB: Plot of the loadings of PC2 (blue dot line) compared
with the difference between the mean spectrum of histone minus the average mean spectrum of RNA and ceramide (red dash line).
and the mean spectrum of RNA. The two spectra overlap almost difference of the mean spectra of albumin and collagen, as
exactly and no major variations can be seen. shown in Fig. 3 III.
A similar relationship between the source molecular spectra In the case of PC2, however, the loading is not easily associ-
and the PC loadings can also be demonstrated for PC2. In Fig. 2 ated with the spectrum of any one of the individual components,
IIIB, the loading of PC2 has been plotted, but in order to find although some of the stronger features can be identified,
a match, the mean spectra of RNA and ceramide have first been including the disulfide stretching at 509 cm1, the C–C twisting
averaged before being subtracted from the mean spectrum of the mode of Phe (proteins) at 623 cm1, the C–C stretching (proteins)
histones. The resulting spectrum contains all the same peaks as at 816 cm1, the C–C aromatic ring stretching in Phe at 1005 cm1
the loading of PC2 and only slight variations in relative peaks or the amide I band region around 1655 cm1 (Fig. 3 IV).10,45–47
intensities are observed. However, when summed according to their scores in the scores
Thus, PCA can provide information on the molecular plot, (0.065 (albumin spectrum) + 0.035 (collagen spec-
composition or underlying biochemical differences of the data trum)) 0.1 (histone spectrum), an excellent match to the
Published on 24 November 2011 on http://pubs.rsc.org | doi:10.1039/C1AN15821J
sets analysed and the results presented are comparable to the loading of PC2 is achieved, as shown in Fig. 3 V. Notably, when
difference spectra that can be calculated by simple subtraction. A weighted by the number of datapoints per group, (100), the
direct correlation exists between the position of the spectra in the respective sums of the negative and positive loadings each equals
scores plot and the value of the loadings. The negative scores are 1. Thus, although the principal components for more complex
as informative as the positives ones, but correspond to two systems are not easily identifiable with the spectra of constituent
Downloaded by University of Sussex on 03 December 2012
different directions in the scatter plot. Using the scale present for biomolecular components of the sample, they are clearly deter-
each PC, the negatives and positive peaks can be attributed to the mined by weighted sums of the spectra of those components.
different groups of the plot and using different reference spectra
these peaks can be then matched with specific features for the
3.4 Influence of imbalanced spectral datasets on PCA
molecular characterisation of the samples.
As described in the previous section, the weighted sum of the
spectra from the different components determines the profiles of
3.3 Sensitivity of PCA
the loadings when the number of spectra in each dataset is equal.
In the previous section, PCA was applied to three significantly However, when working on biological samples, especially when
different bio-molecules having strong dissimilarities in their the spectra are extracted from large maps, the numbers of spectra
Raman signatures. In this section, the histone spectra are present in each group is usually determined by the relative sizes
compared to those of albumin and collagen. These three samples of the tissue or cellular regions, the laser spot size and sampling
are proteins and therefore have more similar Raman profiles. step size, and thus can be imbalanced. To illustrate the effect of
Fig. 3 I presents the mean spectra of albumin and collagen such an imbalance in the datasets on the PCA, the numbers of the
compared to the histone spectrum, offset for clarity. The three spectra composing the dataset recorded from the albumin was
spectra contain many similar features and only the regions gradually reduced from 100 to 75, 50, 25 and finally 10 spectra,
around 1300–1350 cm1 and 800–950 cm1 exhibit obvious while the number of spectra of histone and collagen was kept
differences between the three different molecules. constant. PCA was run for each dataset and the final plots are
Despite the similarities of their spectra, PCA clearly merged in Fig. 4 I, to visualise the evolution of the position of
discriminates the molecules as shown in the scores plot of Fig. 3 each cluster in the scatter plot.
II. As observed in the previous section, the intra-group variance The first observation is that although the number of spectra in
is very low in comparison to the inter-group differentiation only one group has been varied, the entire scores plot is affected
suggesting that the loading is a good representation of the and the positions of the data sets corresponding to the histone
spectral variation between the different bio-molecules tested. and collagen vary considerably. As the number of albumin
PC1 describes 71% of the observed variance, and PC2 26%. spectra is systematically varied, they both move toward the zero
Albumin and collagen are largely discriminated by PC1, position of PC2, approaching zero when the group correspond-
although they are not placed symmetrically about the origin, ing to the albumin is reduced to 10 spectra. At this point they are
and they are also discriminated somewhat by PC2. Histone is predominantly discriminated by PC1 alone. Their positions also
differentiated from the other two proteins largely by PC2, evolve according to PC1, their relative spacing increasing such
although in this instance it does not sit at the origin of PC1 and that in the last data set of 10 albumin spectra, they are almost
so is somewhat discriminated by PC1. As in this case both PC1 equidistant from the zero position of PC1. Simultaneously, the
and PC2 contribute to the differentiation of all three species, the cluster corresponding to albumin is also significantly shifted
loadings are more complex and are not as clearly derived from when its number of spectra is reduced. When the groups have
the individual molecular components as for the previous equal numbers, this cluster is positioned near the zero position of
example. However, in comparison to the spectra of collagen and PC2, but gradually moves away from it as its constituent number
albumin, specific features for each of them can be identified in is reduced. Although less pronounced, the cluster drifts to lower
the loadings. As expected, the loading of PC1 is dominated by (negative) values of PC1, such that the position of the albumin
these two components, as shown in Fig. 3 III. Indeed, the data set and histone data sets is almost aligned with respect to
loading of the PC which differentiates a dataset of just collagen PC1 when the number of spectra for albumin is reduced to 10
and albumin is almost identical to that which discriminates the spectra.
three proteins (data not shown). As in the previous example, These modifications in the relative positions of the different
the loading of PC1 can be accurately reproduced by taking the groups reflect the variation in the relative contribution of each
Fig. 3 I: Mean Raman spectra recorded from albumin (A) collagen (B) and histone (C) on CaF2 windows. II: Scores plots of the first two principal
components after PCA performed on Raman spectra recorded from albumin (green), histone (red) and collagen (blue). III: Plot of the loadings of PC1
(blue dotted line) compared with the difference between the mean spectrum of collagen minus the mean spectrum of albumin (red dashed line). The
loadings are compared with spectra recorded from collagen (A) and albumin (C), both offset for clarity. IV: Plot of the loadings of PC2 (blue line)
compared with the difference between the mean spectrum of collagen minus the mean spectrum of albumin (red line). V: Plot of the loadings of PC2 (blue
dotted line) compared with the simulated weighted sum of the different spectra according to their position on the scatter plot (red dashed line).
Fig. 4 I: Scores plot of the first two principal components after PCA performed on Raman spectra recorded from albumin (Alb - green), histone
(H - red) and collagen (COL - blue). The data set corresponding to the albumin has been systematically reduced from 100 to 75, 50, 25 and 10 spectra. II:
Plot of the loadings of PC1 (B) corresponding respectively to the first principal component resulting from the PCA analysis for Alb n ¼ 100 (blue) and
Alb n ¼ 10 (red). These loading have been compared with spectra recorded from collagen (A) and histone (C) and albumin (D). III: Scores plot of the
first two principal components of PCA performed on Raman spectra recorded from protein mixtures. For all data sets, filled circles correspond to PCA
with 50/30/20 histone/DNA/RNA, whereas unfilled squares correspond to PCA with 50/45/5 histone/DNA/RNA. IV: Plot of the loading of PC1 (C) of
the PCA analysis for proteins mixture. This loading is compared with spectra recorded from different compounds such as ceramide (A) and albumin (B)
and RNA (D), DNA (E) and histone (F). V: Plot of the loadings of PC2 of the PCA analysis for protein mixtures with the data set H 50% DNA 30%
RNA 20% (C) and H 50% DNA 45% RNA 5% (D). The loading are compared to spectra recorded from different compounds such as histone (A) and
RNA (B) and DNA (E).
loading of PC1 are also affected and specifics features corre- however, these species are spatially mixed, and therefore within
sponding to the collagen are also better defined when the groups a typical sampling area are spectrally mixed. In order to
are imbalanced. This is due to the increased differentiation of the simulate such mixed spectra, mixed data sets were constructed
two clusters according to PC1. Notably, therefore, in the reduced using weighted sums of the spectra of individual compounds.
dataset, the molecular origin of the discrimination of the two Initially, a dataset of the three groups was constructed: (i) 50/
majority clusters becomes clearer. 50 albumin/ceramide, (ii) 50/50 histone/DNA, and (iii) 50/30/20
histone/DNA/RNA. The mixtures thus mimic regions of equal
protein content, but which are relatively rich in either lipidic
3.5 PCA of complex mixtures
(i) or nucleic acid (ii), or have differing nucleic acid (iii)
The studies outlined above illustrate how PCA can differentiate content. In a simplistic approach, these can be proposed to
structurally (and therefore spectrally) distinct and similar represent (i) cytoplasmic/membrane, (ii) nuclear and (iii)
molecular species. In biological samples such as tissues or cells, nucleolar regions.
This journal is ª The Royal Society of Chemistry 2012 Analyst, 2012, 137, 322–332 | 329
View Article Online
scores plot according to PC1 which indicates a better and thus molecular contributions to lower number clusters can
differentiation. be misrepresented. Balanced clusters provide the best represen-
The main limitation in this example is the presence of a high tation of relative contributions to loadings, but ultimately, the
intra-group variability which can distort the information con- best results are provided by a pair wise analysis of identified sub
tained in the loadings. The nuclear spectra are distributed on the cellular regions, which yield a clearer representation of the
negative and positive side of PC1 and thus the contribution to the underlying biochemical differences.
loading of PC1 is difficult to interpret. In terms of PC2, the intra-
group variability of the cytoplasm spectra is higher than the Acknowledgements
difference existing between the nucleoli and nucleus clusters.
Therefore the specificity of loading of PC2 is reduced. Therefore, This research was supported by the National Biophotonics and
as shown for the pure bio-molecules in section 3.2, a clearer Imaging Platform (NBIP) Ireland funded under the Higher
interpretation of the loadings is provided by a direct comparison Education Authority PRTLI (Programme for Research in Third
Level Institutions) Cycle 4, co-funded by the Irish Government
Published on 24 November 2011 on http://pubs.rsc.org | doi:10.1039/C1AN15821J
