SIMCA-P+ 12 Tutorial
SIMCA-P+ 12 Tutorial
SIMCA-P+ 12 Tutorial
Background
Data were collected to investigate the consumption pattern of a number of provisions in different
European countries. The purpose of the investigation was to examine similarities and differences between
the countries and the possible explanations.
Objective
The objective of this study is to understand how the variation in food consumption among a number of
industrialized countries is related to culture and tradition and hence find the similarities and dissimilarities
among the countries. Hence data have been collected on 20 variables and 16 countries. The data show
how many percent of households use 20 food items regularly.
Data
The data set consists of 20 variables (the different foods) and 16 observations (the European countries).
The values are the percentages of households in each country where a particular product was found. For
the complete data table, see below. This table is a good example of how to organise your data. There are
two secondary observation identifiers, Location (geographic) and Latitude (of capital). The coding of
Location is: C = central; S = south; N = north; U = UK & Ireland; X = beneluX. The coding of Latitude
is: 1 = < 45; 2 = 45-50; 3 = 50-55; 4 = 55-60; 5 = > 60.
Outline
The steps to follow in SIMCA-P are:
Import the data set.
Prepare the data (Workset menu).
Fit a PC model and review the fit (Analysis menu).
Interpret the results (Analysis menu).
Define project
Start SIMCA-P and create a new project from FILE | NEW
The import wizard detects that there is an empty row and asks if you want to exclude that row.
Chose Yes.
The setting for column 2 is now Primary ID. The first column is set to exclude which is fine. The 3rd and
4th column (Geographic location and Capital Latitude) is not unique and will both be set as a secondary
ID.
The rest of the columns are the data (X-variables) and are not changed.
Click on Next and you give the project name and a destination directory. Missing values are indicated.
Analysis
After finishing the import wizard the primary dataset is created in SIMCA. The primary dataset is the data
used to create models from. Default the whole dataset is selected with UV-scaling (unit variance). The
primary dataset will not change and when you want to make models where you change observations
and/or variables, change scaling etc.
The primary dataset can be shown choosing menu Dataset: Open: FOODS_update. Or use the speed
button .
Here it is possible to do several things. If you right click in the table several options are available.
In this case we want to fit a model to the data and we use menu Analysis: Autofit or a speed button
. This will calculate components one at a time and check the significance of each component
Based on cross validation). When a component is not significant the procedure is stopped.
A summary window opens up showing the R2 and Q2 for the significant components.
To see the details of the model, double click on the model row in the project window.
The plot with the summary of the fit of the model is displayed with R2X(cum) (fraction of the variation of
the data explained after each component) and Q2(cum) (cross validated R2X(cum)).
The summary of the fit of the model is displayed with R2X (fraction of the variation of the data explained
by each component) and cumulative R2X(cum), Q2 and Q2(cum) (cross validated R2X and R2X(cum))
as well as the eigenvalues. The food variables are, as expected, correlated, and fairly well summarized by
three new variables, the scores, explaining 65% of the variation.
These plots are a the score plot (upper left, t1 vs. t2), the loading plot (lower left, p1 vs. p2), DModX
(distance to model) and X/Y Overview plot (showing R2 and Q2 for each variable).
The DModX plot shows that no observation is far away from the model (projection). Statistically they are
below the critical limit (Dcrit). The X/Y overview plot shows that some variables have relatively high
R2/Q2 indicating systematic behavior. Some have low (even negative Q2) indicating low variation
(consumption almost constant over all countries).
The ellipse represents the Hotelling T2 with 95% confidence (see statistical appendix).
The scores t1 and t2, one vector for components 1 and 2, are new variables computed as linear
combinations of all the original variables to provide a good summary.
The weights combining the original variables are called loadings (p1 and p2), see below.
The score plot shows 3 groups of countries. One group with the Scandinavian countries (the North), the
second with countries from the South of Europe, and a third more diffuse with countries from Central
Europe. It seems a little odd that Austria is in the south Europe group but maybe the Tyrol region (close
to Italy) has a big impact.
To enhance the information of the plot we can use colors. Right click in the plot and select Properties and
then tab colors. Chose to color according to secondary ID Geographic location.
We could have use coding according to latitude (also a secondary ID) and get the about the same
coloring.
Loadings
The loadings are the weights with which the X-variables are combined to form the X-scores, t (se above).
This plot shows which variables describe the similarity and dissimilarity between countries.
Here we can see the influence of each variable on the 1st component. To inspect the second component
use the up arrow on the keyboard. The uncertainty of the loadings calculation is shown as confidence
interval (jack-knifing in the cross validation procedure).
Third Component
The cross validation procedure gives three components in the model. In the scores and loading plots
(default component 1 vs. component 2), use the keyboard arrows to shift. Up and down for the Y-axis in
the plot and left and right for the X-axis in the plot..
Plot the scores (t1 vs. t3) and loadings (p1 vs. p3). The third component explains 13.8% of the variation in
the data, and mainly shows high consumption of Tea, Jam and canned soups mainly in England and
Ireland.
Consumption of ground coffe goes down and tconsumption of tea and jam goes up.
Summary
In conclusion, a three components model of the data summarizes the variation in three major latent
variables, describing the main variation of food consumption in the investigated European countries.
This example shows a simple PC modeling to get an overview of a data table. The user is encouraged to
continue to play around with the data set. Take away observations and/or variables, refit new models, and
interpret at the results.
Background
Complex liquid samples can be characterized, compared and classified with the help of a non-selective
analytical method, for instance one which takes advantage of the samples ability to absorb visible light.
From the characterization of samples of known origin, predictive models can be built and tested with new
samples of unknown composition. In this tutorial a range of distilled spirits are investigated using vis-
spectroscopy. We are grateful to Johan Trygg and colleagues at Ume University for granting us access to
this data set.
Objective
The objective of this example is to provide an illustration of multivariate characterization based on
spectral data. To this end, spectra measured on a set of alcoholic spirits, among them whisky and cognac,
will be used. The spirits can be compared and classified by investigating if there are clusters relating to,
for example, product type or country of origin.
A growing problem in the beverage and brewing industry is fraud and plagiarism; see for example 1 in
which sparkling wines (champagne and cava) were differentiated using a multivariate model of their
mineral content. Chemometric methods can greatly assist in identifying incorrectly labeled or fake
products.
1) Jos, A., Moreno, I., Gonzalez, A.G., Repetto, G., and Camean, A.M., Differentiation of sparkling
wines (cava and champagne) according to their mineral content, Talanta, 63, 377-382, 2004.
Data
For each sample (spirit), the visible spectrum (200600 nm) was acquired using a Shimadzu
spectrometer. Signal amplitude readings were taken at 0.5 nm intervals yielding 801 variables. There
were 46 unique samples plus a few replicates giving 50 observations in total. The secondary observation
ID designates country of origin and product type as follows: XXYY where XX indicates country and YY
product type. The suffix R indicates a replicated sample.
Country of Origin: USa, SCotland, IReland, CAnada, FRance, ITaly, JApan.
Product Type: BOurbon, BRandy, COgnac, WHisky, Single Malt (SM), BLended, RUm.
One mixed (MIXT) sample is also present in the data set.
Outline
The analysis of these data will be divided in three parts. Each part are created as a separate project in
SIMCA.
1
Jos, A., Moreno, I., Gonzalez, A.G., Repetto, G., and Camean, A.M., Differentiation of sparkling wines
(cava and champagne) according to their mineral content, Talanta, 63, 377-382, 2004.
Import data
All new projects in SIMCA start by importing the data.
Start a new project in SIMCA by selecting File: New or click on the New speed button .
The following window opens:
Data can be imported as a file, from an ODBC database (using MS Query) or pasted into an empty
spreadsheet. Supported file formats will be shown in the file list (see the User Guide for a more detailed
explanation about different file formats).
In this example we chose the Excel 2007 file called Spirits.xlsx (XML-format).
Next a new window opens up where you can select between a normal and batch type of project. In this
case chose the first alternative:
Variables can also be defined as X/Y, qualitative (X/Y) and date/time (X/Y). In this example we only
have X variables which are the default setting.
In this particular case SIMCA has made a suggestion that is acceptable directly, we dont have to change
anything.
Next step is to go on in the wizard so press NEXT:
button .
This primary dataset will always be available. Later when you create different models with different
selections of observations and variables etc., you always use a copy of the original primary dataset.
Modeling
At this stage it is of interest to quickly see what type of information exist in the data imported.
SIMCA has already prepared a PCA model. At the import we did not declare any Y-variables etc., so all
variables are considered as X-variables.
The project window in SIMCA shows the prepared model:
This model is prepared so all observations and variables are presented. The scaling of the variables are
default (UV= Unit Variance).
99,9 % of the variation in the data is explained of which 97,2% is explained in the first two components
(normal for spectroscopic data). A more detailed information about the model can be found by a double
click on the model row in the project window.
Next step is to show the information from the PCA-model. This can easily be done using a speed button.
This will show components one and two for the scores and loadings.
Score plot
A look at the score plot shows labels from the primary ID of the observations. A more informative
labeling is to use the secondary ID. This can be achieved by right click in the picture, chose properties
and then lables:
The length and part of the label string can also be set (default start=1 length=10).
Select the secondary ID and click OK:
With the names of the observations a much better for interpretation can be done. In the plot it can be seen
that the different types of spirits are clustered. To emphasize that there is groups of spirits it is possible to
use color on the labels. Right click and chose properties: Color:
Chose to use character 1 to 4 in the name (character 1-2 shows country and 3-4 type of spirit).
X/Y Overview
The R2-Q2 plot of the variables shows that most of the variation in the variables is used in the model. To
see the individual variable enlarge in X-direction of the plot using the magnifying tool. Mark a region
with the mouse (press left mouse button) and release the mouse button.
Loadings
The loading plot p1 vs. p2 (which is default) is not informative when you have this type of data. In the
next part (scaling) the loadings will be shown one at a time.
Summary
To get a quick overview of a data table import the data into SIMCA and create PCA-model using Autofit,
present the 4 overview plots and interpret the information in the score plot, DModX plot, loading plot and
the summary plot.
Workset
From the primary dataset we can make changes to which variables and observations to use
(include/exclude, make classes, X/Y variables), transform variables, lag variables, expand variables.
The primary dataset will remain unchanged and each workset created will be a new model.
To create a new workset from scratch (full copy of the primary dataset) we use menu Workset: New.
Under tab Overview there is a list of the present variables and observations. This list will be updated
when we make changes under the other tabs. Missing data tolerance level can be set (if that value is
exceeded SIMCA will warn you). The model type can also be specified. In this example PCA-X is the
only alternative (all variables defined as X at the import, no classes defined for the observations)
Now we want to change the scaling of the variables and this is done under tab Scale.
Press Set
Press OK at the bottom.
Now a new model is prepared where the variables are Pareto scaled.
The project window shows the new unfitted model. Use Analysis: Autofit (or speed button) to calculate
components:
The next step is to create a model where we use only centering (Ctr) for the variables.
Right click on the model 2 line in the project window and select New as model 2.
The workset dialogue opens again. Go to the scale tab and set scaling to
Ctr (centering).
Autofit model 3
Now we have three models in the project and to make it more clear what we have we will change the title
for the models so that we remember what we have done. Right click on a model row and chose Change
Model Title). Set the title to UV for model 1, Par for model 2 and Ctr for model 3.
The model with Pareto scaling have 4 components and the other 5 components.
We will now investigate the effect of scaling. The four plots below shows the raw data prior to and after
the three different scaling approaches.
The plots below can be created in the following way:
Raw data
can be plotted by opening the primary dataset (Menu Dataset: Open and chose Spirits or use speed button
. Right click somewhere in the data table and chose Create: Plot Xobs. The emphasized line for
observation 15 (JARU) is created by right clicking in the plot, select Plot settings: Plot Area. Chose No.
15 and change to color black and width 5.
The three PCA models using UV scaling (model M1), Pareto scaling (M2) and centering but no scaling
(M3) are summarized below and are very similar in terms of variance explained. Cross-validation
suggests 5, 4 and 5 significant components, respectively, but for comparison purposes we forced a fifth
component into the Pareto model.
When examining the explained variances in more detail, it is apparent that only two components are
really necessary for obtaining a good overview of the data. Hence, in the following, we consider only
these two components.
The scores, loadings and DModX plots of the three models are given below (top triplet: UV scaling;
middle triplet: Pareto scaling; bottom triplet: centering without scaling). PCA based on UV scaled data
Distilled Liquor.M1 (PCA-X), PCA UV Distilled Liquor.M1 (PCA-X), PCA UV; p p[1] Distilled Liquor.M1 (PCA-X), PCA UV
t[Comp. 1]/t[Comp. 2] p[2] DModX[Comp. 2]
40 0.05 15
28 3
10
11 0.04
30
1
50 31 0.03
20 45 18
414 0.02
10
4347 38
48
9 41
2
21 20
DModX[2](Norm)
35 623
33 16 8 0.01
28
t[2]
0 244926 32 15
2722 30 4 10 44
12 40
0.00
1113 45 50
-10 44
7 2 37
42 -0.01 14 32
3617
46343
19 1
-20 39
5 25 D-Crit(0.05) 30 3739
-0.02
1
56 9 48
35
29 13 18 47 40
2 31
-30
20 23 43
-0.03 3 2527
21 24 49
8 17
16 29 33 38 41
42
-40 -0.04
7 19 22 26
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90
12 3436 46
-0.05
t[1] 0 10 20 30 40 50
200 250 300 350 400 450 500 550 600 Num
Var ID (Primary)
Ellipse: Hotelling T2 (0.95)
M1-D-Crit[2] = 1.212
Distilled Liquor.M2 (PCA-X), PCA Par Distilled Liquor.M2 (PCA-X), PCA Par; p p[1] Distilled Liquor.M2 (PCA-X), PCA Par
t[Comp. 1]/t[Comp. 2] p[2] DModX[Comp. 2]
2.20
10 0.09 10
11
2.00 44
8 50
45
0.08
28 1.80
15
6 0.07
4 1
4 31 37
9 41 47 0.06 1.60
4 28
48 43 38 45
2
2433
35 23
16 8
6 26 20 10
11 0.05
22 21
1.40
49
DModX[2](Norm)
0 18 0.04 50
2742
12 37
17 40 14 1.20 9 14 39
7 2 44 32
t[2]
-2 19
3646
5 3439
0.03 D-Crit(0.05)
1 32 40
253 30 1.00
6 18 30 35
-4 29 0.02
-6 0.80 25 43 4849
0.01 5 13 24 27 47
2 17 23 31
-8 0.00 0.60
3 33 41
42
16 19
-10 -0.01
0.40
21
20 29 34 38
8
-12 -0.02 7 26 36 46
-14
13 0.20
22
15 -0.03 12
-30 -20 -10 0 10 20 30 -0.04 0 10 20 30 40 50
t[1] 200 250 300 350 400 450 500 550 600 Num
Var ID (Primary)
Distilled Liquor.M3 (PCA-X), PCA Ctr Distilled Liquor.M3 (PCA-X), PCA Ctr; p p[1] Distilled Liquor.M3 (PCA-X), PCA Ctr
t[Comp. 1]/t[Comp. 2] p[2] DModX[Comp. 2]
6 3
45
50 28
0.12
2 10
11
4
4
2
0.10 15
4737 1 2
2 28 0.08
38 44 2
41 8 40 31
17
42
33
48 43
26
22
23
16 21 20 13 44
DModX[2](Norm)
0 24
9 19 39
12 2
0.06 2
353649
67
525
274634 32 1 18
t[2]
0.04
29 3 14 18 37
-2
30 10
11 1 1
0.02 D-Crit(0.05)
1 9 30
31 45
14 32
-4
4 40 50
0.00
1 39 43
38
6 20 252729 35
-6
-0.02 1
23 5 41 4749
13 17 21 24 33
34 48
0 16 19 23
-0.04 8 36 42 46
-8
15 0 7 12 22 26
-0.06
-20 -10 0 10 20 0 10 20 30 40 50
-0.08 Num
t[1]
200 250 300 350 400 450 500 550 600
Var ID (Primary)
Import data
Start a new project in SIMCA and chose the same dataset (Spirit.xlsx). The difference now is that classes
will be declared directly at the import. When the following window appears at the import change settings
for the second column (former secondary ID) to Class ID.
Class identification can be made on any column in the data table. Here we want to use the 4 first
characters of the name to define classes.
This column will then be called ClassID and the originally secondary ID will be copied to a new column
and marked excluded. Change that so it will be an active secondary Id again (useful for plots).
With help of the 4 first characters in the name SIMCA has identified the classes and shows how many
observation it is in each. Here it is possible to change names, orders and even merge classes.
For some of the classes identified from the name there is only one (1) observation. For these classes it is
impossible to create models. In this situation there are two possibilities. One is to mark the classes with
only one observation and mark them as deleted. The other is to keep them. In the first alternative the
observations in classes with one observation will not be imported in the project and in the second they
will be. In this example we keep them as they are (no delete).
Press Next in the wizard.
Give a new name to this SIMCA project (i.e. SpiritClassification) and press Finish in the wizard.
In SIMCA the following project window will open:
Overview model
In a classification situation you often start by creating a model containing all classes to get an overview.
This can be done in the same way as part 1(overview) by creating a special project but it is more practical
to do that directly where the classes are defined.
Mark CM1, right click and choose New as model group CM1. In the workset dialogue change model
type to PCA-X.
Press OK.
In the project window the new PCA-X model (12) is prepared with scaling = Pareto.
T1 vs. t2 is chosen.
Click on the Labels tab and change label types to secondary id for point labels.
Press OK.
The score plot is the same as in Part 1. The colors in a score plot come automatically when classes have
been defined. In part 1 the colors was defined from the name (secondary ID).
4 components are calculated and it is possible to shift axis in the score plot using the arrows on the
keyboard. Keys up/down will shift the Y-axis and keys left/right will shift the X-axis.
In the interpretation remember that component one describes almost all variation.
In this window it is possible to exclude models, andset number of components to use. Here we use
Autofit.
Press OK and SIMCA will autofit all marked classes, showing the summary plot for each.
The result is shown in the project window.
All classes with one observation are unfitted (wanted). One class (12, USBL) has only two observations
leading to a zero component model.
Now there are several possibilities. Each class model can be examined in the usual way (score and
loading plots, DModX etc.) but the interesting interpretation is to see how the different observations fit to
different classes and how classes fit to each other. This is done using the prediction menu. The first step is
to select a model to use (to begin with, later it is possible to shift model). Mark the model in the project
window (i.e. model 2). Go to menu Prediction and Specify Predictionset. Here there are many
possibilities depending on which data are available. The first choice is to see how all observations fit to
model 2 (SCBL), therefore Dataset is chosen.
The names on the X-axis can be activated using the property Axis label (90 rotation).
We can see that USBO, FRCO, ITBR and JARU have high DModX values, indicating differences
compared to SCBL which is the actual model. Much higher than the Dcrit line (red) for the model.
In the model based on SCBL it can be seen that USBO, FRCO, JARU does not fit to the model.
Keep these two plots open and change to another model using the property toolbar for the plot.
Select model 4 and the active plot will update. To update another plot make it active and change model
again using the same procedure.
If you want to update both plots simultaneously you have to through the dockable window called
Favorites. Open the favorites dockable window by pointing on vertical tab to the left. The window will
open and follow the steps below.
Click on the pin to lock this dockable window.
Observe that the above described procedure does not work for the unfitted models (3, 5, 8, 9, 10).
Under the prediction menu there are more items that can be used.
Classification list
This will show the probability for an observation to belong to the different classes.
Right click in the window and select properties. Change labels to secondary ID and number format to
decimal with two decimals.
Miss-classification Table
A miss-classification table shows the overall classification.
Cooman plot
A Cooman plot can be used to compare 2 classes at a time. The plot shows DModX for 2 models. Below
is an example where model 2 (SCBL and model 7 (FRCO) are compared.
The list to the right shows the present prediction set (all observations). Start by removing all in this list.
Then right click on the column with primary ID in the list to the left and chose observation ID and check
class ID. These will then appear in the list.
Next step is to fill in class ID names in the search box. Start with SCBL (not case sensitive). All SCBL is
found and marked.
Click on the arrow between the lists and all SCBL observations will appear in the right list. Do the
same for FRCO.
Now we have created a new prediction set and we can use the different menu items under the Prediction
menu in the same way as before.
With this new prediction set we will show the Cooman plot for these two models (SCBL and FRCO).
Background
The following example is taken from a mineral sorting plant at LKAB in Malmberget, Sweden. Research
engineer Kent Tano, at LKAB was responsible for this investigation.
In this process, raw iron ore (TON_IN) is divided into finer material (<100 mm, 50% Fe) passing several
grinders. After grinding, the material is sorted and concentrated in several steps by magnetic separators.
The separation flow is divided in several parallel lines and there are also feedback systems to get as high
Fe concentration as possible. The concentrated material is divided into two products, one (PAR) which is
sent to a flotation process and another part (FAR, fines) which is sold as is. For both these products high
Fe content is important.
Twelve process factors were identified. Of these, three important factors were used to set up a statistical
design (RSM). The results of each experiment were measured in 6 response variables. Several
observations were collected for each design point.
The process is equipped with an ABB Master system with a SuperView 900 connected to the process data
system. Data where transferred from the ABB system to a personal computer with the SIMCA-P software
for modeling. Models were transferred back to the SuperView system for on-line monitoring (predictions,
score and loading plots) of the process. The investigation was made in 1992. The multivariate on-line
control of the process is still in work with very good results concerning the quality of the products.
Objective
The objective of this study is to investigate the relationship between the process variables and the 6 output
variables describing the quality of the final product.
Analysis outline
An Overview of the Responses
A PC model of the responses is made to understand:
How the responses relate to each other and to the observations.
The similarity and dissimilarity between the observations, and if there are outliers.
The explanatory power of the variables.
Relating the process conditions to the responses
Understand and interpret the relationship between the process variables and the responses.
Predict the output of new process conditions.
The steps to follow in SIMCA-P
Define the project: Import the primary data set.
Prepare the data (Workset menu).
Specify which variables are process variables (X) and which are responses (Y).
Expand the X matrix with the squares and cross terms of the 3 designed variables.
Fit the models, first PC-Y and then PLS, and review the fit (Analysis menu).
Refine models if necessary by removing outliers (Workset menu).
Use the PLS model for predictions (Prediction menu).
Variables
Data from 18 process variables (X) were collected.
4 PARmull PARM
Explanation Abbr.
Observations
A subset of 231 observations (which full representation of the Y-variables) from the total of 572
observations was used for modeling. Each observation has a name referring to the date and time when
data were collected.
SIMCA has interpreted the data as seen above. However we want make some changes. The 3rd column
should be set as Secondary ID (use the arrow for the column and change.
The 2nd column is marked as secondary ID but could be changed to a time variable (X, new in ver. 12.0).
But we also want to use this column as a secondary ID so we have to make a copy of the column first.
An empty column will appear in the spreadsheet. Copy the content of column with Date/Time
information to the empty column and mark it as a secondary ID (now there are two identical columns
marked as secondary ID.
Use the arrow for one of these columns and change setting to Date/Time variable. To do that you have to
present the format present so SIMCA can parse the cells and convert it to numbers. In this case the
resolution is minutes. The expression to use is yyMMdd HH:mm.
Click on any of the column arrows (for these variables and chose Y-variable.
The color is changed showing that these are Y-variables. We can also see that for the last 3 Y-variables
there are a lot of missing data which can be seen in the next view.
The temporary spreadsheet looks like this.
Give a name and location for the project and press Finish.
Press Finish and SIMCA prepares the primary dataset. SIMCA detects that there are variables with more
than 50% missing data and therefore gives a warning. We know that the last 3 Y-variables have this
structure so therefore we select No to ALL.
Analysis
Workset
SIMCA-P's default workset consists of all the observations in the primary data set with all variables,
scaled to unit variance and defined as X's or Ys as specified at import.
Now we want make some changes to the default prepared model. Right click on model M1 in the project
window and select Edit Model 1.
The X- and Y-variables are correct. The Date/Time variable is default marked excluded which is OK. In
this application we will not use this variable in the models but for line plots where time is on the X-axis.
At the bottom of the form there is a list box where you can choose model type. The default in this case is
PLS (we have defined Y-variables at the import). However, before we run the PLS it is wise to extract
information from the X and Y data separately. This can be achieved using model type PCA-X and/or
PCA-Y. Use the Model type list box and select PCA-X. Click on OK to exit the workset menu.
PCA-X
When chosing PCA-X the model type shifts in the project window.
Select Autofit from the analysis menu or use speed button in the toolbar.
The cross validation procedure finds 2 components which we will plot in a score plot. Use menu
Analysis: Scores: Line plot and change the axis of the plot according to this:
Press OK.
Here we see the same thing; t1 is low in the beginning and after a while. To find out the reason for this we
can use the contribution plot to see how the original x-variables are changed in these time periods. Mark
the left part of the score plot t1 vs. t2 with the mouse and press the contribution tool .
PC of Y
We also want to get an overview of the responses to see how the Y-variables are correlated to each other.
To do that, we have to shift to a PCA-Y model. This can be done by right click on model 1 in the project
window and select New as model 1. A new workset dialogue will appear.
There are a lot of observations which dont have values for the y-variables (samples for analysis in lab
have been taken out at certain time points (not all) which leads to missing values in the Y-part of the data
table. We only want to use observations where there are Y-values. Click on tab Observations, right click
on the first column and chose to add the secondary ID Select.
Observations marked O dont have any Y-values (missing) and observations marked S have. Start by
click on one observation and use Ctrl-A to mark all and press the Exclude button to the right. Using the
Find and select function we can now seek the observations marked with S. First we must choose the
select column which is done by clicking on the arrow to the right of the search box and choose Find in
Select column. Write S in the search box and all observation with S in the select column will be
highlighted. Press the Include button to the right.
Change Model type to PCA-Y at the bottom of the workset and press OK.
Click on the model summary line to open a table with the summary of the fit of the model. This table
displays R2X (fraction of the variation of the data explained by each component) and cumulative
R2X(cum), as well as the eigenvalues and the Q2 and Q2(cum)(cross validated R2). The six Y's are
correlated, and are summarized by two new variables, the scores t1 and t2, explaining 71.7% of their
variation.
The loadings are the weights with which the variables are combined to form the scores, t. The loadings, p,
for a selected PC component, represent the importance of the variables in that component and show the
correlation structure between the variables, here the responses Y.
In this plot we see that PAR, FAR, %P_FAR is positively correlated and negatively correlated to
%Fe_FAR. r_Far dominates the second component, is here negatively correlated to PAR and has only a
small correlation to the other variables in component 2. %Fe-Malm is not correlated to any of these
variables in the first two components.
PLS MODELING
The main objective is to develop a predictive model, relating the process variables X's to the output
measurements (responses) Y. The experimental design in three of the process variables accounts for an
important part of the variation of the Y's.
Autofit
To prepare the PLS model right click on model 2 (PCX-Y) in the project window and select New as
model 2. In the workset dialogue change Model type to PLS. The reason we do this is that in model 2 we
have all we want defined (X and Y, expanded X, selection of observations which have values for Y).
Click on Analysis: Autofit, or the speed button , to fit a PLS model, with cross validation.
2
The Model Overview Plot displays R Y(cum), the fraction of the variation of Y (all the responses)
explained by the model after each components, and Q2(cum), the fraction of the variation of Y that can
be predicted by the model according to the cross-validation. Values of R2Y(cum) and Q2Y(cum) close
to 1.0 indicate an excellent model.
Use the tool for 4 overview plots .and 4 plots will be created, score and
loading plots, distance to model in X (DModX) and X/Y summary. Below is an explanation to these
plots.
Summary: X/Y Overview
This plot displays the cumulative R2Y and Q2Y for every response. With the exception of %Fe-FAR and
%P-FAR, all responses have an excellent R2 and Q2.
Scores t1 vs. t2
Click on Scores: Scatter plot and use default t1 vs. t2.
. The plot shows the line (slope 1, 45) and the model for it.
Loadings
The loading plot w*c1 vs. w*c2 shows the correlation between the X- and Y-variables.
In this plot we see the two first components and can make interpretations like that PAR and FAR is
positively correlated to several X-variable lying to the right in the plot. We can also see that of the
expanded variables the squared term for HS_2 (speed of magnetic separator 2) have a significant
contribution. All others are close to the origin, meaning that the influence is low.
Like in the case of the scatter plot we have 6 components to consider. Change the Y-axis in the plot to
component 3 and Y-variable %Fe_malm will be strong in this component.
Remember that the loading plot is mostly used to get an overview of the correlations between variables. If
we want to see the quantitative relationships use the coefficients. In this case with 6 Y-variables there will
be 6 coefficient plots to look at (below)
DModX
Distance to model is a measure of the residuals for the observation (default standardized). In this case
there are no observations far outside the critical limit (Dcrit, 95% default) indicating no outliers.
Here there is no critical level that but no observation has a high value compared to the other, indicating no
outliers.
Coefficients
Regression coefficients show the quantitative relationship between the X-variables and the responses. Use
menu Analysis: Coefficients: Plot to show a plot.
Use list box for Y-variable and shift between these. It is also possible to use the up/down arrow on the
keyboard (select one response first).
Variable Importance
A drawback with regression coefficient is that there is one for each Y-variable. However there is another
measure which will summarize the influence on all Y-variables simultaneously.
Use menu Analysis: Variable Importance: Plot. This plot shows the importance of the terms in the model,
as they correlate with Y (all the responses) and approximate X.
Prediction
Create a prediction set
To test the prediction properties of a PLS model you often have test data sets (prediction sets). There are
several ways to handle this. One is to use external data which you import as secondary datasets to
SIMCA. Another way is to split the primary data so that you create models on one part and use the other
part for prediction. In this case we use the second approach.
Use the pin to lock this window (afterwards you can un-pin it so it will disappear to the left.
Doing this you can have plots with observations (score plots, DModX etc.) open and see what happens.
In the dockable Observation window, hold the Ctrl key and mark observations 140, 176, 177, 199, 200,
256, 257, 326, 327, 351, 352, 371, 372, 401, 402, 427, 428, 451, 452, 491, 492, 501, 502, 526, 527, 546,
547, 566, 567. If you have a plot open you can see that we have marked observations evenly distributed
over all observations.
The deleted observations can be seen in the active plots (i.e. score plot t1 vs. u1)
Use the exclude tool to remove these observations . Un-pin the dockable window (disappears to
the left of the window). In the project window a new model is created where the marked observations
have disappeared.
Use Autofit to the PLS model.
The model becomes less complicated (4 components) but if we look at the X/Y Overview plot it very like
the one where all observations are present.
We can look at this model in the usual way showing score plots, loading plots etc.
We now have an excellent relationship between t1 and u1 with no outliers.
In the first two components, PAR, and FAR are positively correlated with all the load variables and
negatively correlated with r_PAR, %Fe-FaR and %Fe_Malm. The model is almost linear except for HS_2
and its squared term dominating the second component.
Remove all observations from right list. In the left Workset Complement is selected (all observations not
in the model 4). In this case we dont want observation with no Y-values. Right click in the left list and
select to show observation label Select. Use the Find function (first click on the arrow, chose to find in the
select column) and search for S (observations with Y-values.
29 observations with Y-values (S) are selected as the prediction set. Press OK
Use menu Predictions:Y Predicted: Scatter plot. The observed vs. predicted plot, for PAR, is displayed.
Test data and prediction data can be seen in the same plot.
To the right in the list the actual and predicted Y-variables can be seen.
Summary
This example shows that statistical design in the dominating process variables gives data with high
quality that can be used to develop good predictive process models. With multivariate analysis we extract
and display the information in the data.
Background
This example deals with a polymer production plant. It is a continuous process which went out of control
at around time point 80 after a fairly successful campaign to decrease the side product (response variable
y6).
Objective
The manufacturing objective was to minimize the yield of side product (y6) and maximize product
strength (y8). The data analysis objective is to investigate the use of hierarchical modeling to:
overview the process
detect the process upset
To understand the relationship between the two most important y variables (y6= impurity, and y8= yield)
and the three steps of the process, feed (x1-x7), reactor (x8-x15), and purification and work up (x16-x25).
We shall do the following, using obs. 1-79 as a training set:
1. PLS model of X= feed (x1-x7) with y6 and y8 (Block 1)
2. PLS model of X= reactor (x8-x15) with y6 and y8 (Block2)
3. PLS model of X= purification (x16-x5) with y6 and y8 (Block 3)
4. PCA model of less important y's (y1 to y7 not including y6) (block4)
5. Top level hierarchical model with scores of blocks 1 3 as X and scores
of block 4 plus y6 and y8 as Y.
The objective of Block Models 1 to 4 is to summarize the various steps of the process by scores to then be
used as X variables the top level model.
Data
The data set contains 33 variables and 92 hourly observations. The measured variables comprise seven
controlled process variables (x1x7), 18 intermediate process variables (x8x25), and eight output
variables (y1y8). The controlled process variables (x1x7) relate to the feed of the process, variables x8
x15 reflect reaction conditions and variables x16x25 correspond to a purification step. As mentioned
above, y6 and y8 are the key outputs. All the data are coded so as not to reveal any proprietary
information.
Outline
The steps to follow in SIMCA-P are:
Create the project by importing the data set
Generate and fit the three PLS model for X-blocks 1-3 (obs.1-79), and
mark them as Base hierarchical.
Generate and fit the PC model for block 4., and mark it as base
hierarchical
Generate and fit the top level hierarchical
Interpret the hierarchical model
7 8 10 6 2
Block X1 Block X2 Block X3 Block Y1 Block Y2
Input and feed Reaction conditions Purification step Less important Ys Important Ys
(x1 - x7) (x8 - x15) (x16 - x25) (y1-y5 & y7) (y6 & y8)
Analysis
Create the project
Start a new project. The data set name proc1a.dif
Start SIMCA-P and create a new project from FILE: NEW. The data set name is HI-PROC.XLS.
SIMCA has interpreted the spreadsheet correct. Mark variable called y1-y8 as Y-variables.
Click on Next and give the project a name and chose destination directory. Select Finish and the project
are created. In the project window the first unfitted model is prepared as a PLS model (we have defined
Y-variables).
Overview of data
The first step is to use PCA to get an overview of the data table. To do that right click on the model and
chose Edit Model 1. Change model Type to PCA X&Y and press OK. The model type shifts to PCA
X&Y in the project window.
Another way to do it is to mark model 1 in the project window and use menu Analysis: Change Model
Type.
The score plot of t2 versus t1 (below left) indicates a clear trend towards the end of the time period. The
corresponding loading plot (below right) demonstrates how the different variables contribute to the first
two principal components. The location of the two important responses (y6 and y8) is highlighted by an
increased font size. y6 is modeled by the first component and y8 mostly by the second component.
The main conclusion is that hierarchical modeling should be applied to observations 179 as these
represent normal process behavior. We want to model the process when it behaves within allowed
variation.
From this overview and from the knowledge of the process it would be interesting to model the process in
logical blocks. Below is a figure showing the system in blocks showing the approach to calculate
hierarchical models of the system and finally combine the bas models to a top (super) model summarizing
the system.
7 8 10 6 2
Block X1 Block X2 Block X3 Block Y1 Block Y2
Input and feed Reaction conditions Purification step Less important Ys Important Ys
(x1 - x7) (x8 - x15) (x16 - x25) (y1-y5 & y7) (y6 & y8)
4 4 4 3 2
Top level PLS model
X1 X2 X3 Y1 Y2
scores from scores from scores from scores from
y6 & y8
model 1 model 2 model 3 model 4
Next step is to define variable classes. Click on the variable tab. Mark variables for block 1 and check
block 1 (to the right).
Do the same for block 2 and 3 choosing the correct X- and Y variables.
The last block is the rest of the Y-variables. Mark them and change variable type to X and block 4.
Change model type (bottom of workset dialogue to PLS-Hierarchical.
Click on OK to exit the workset. The project window is updated showing a class model hierarchy CM1.
Press OK and the models are re-fitted. In this way we have extracted 60-90% of the variation in the X-
variables. In the project window we can also see that the models have been marked with a B (hierarchical
Base model).
To make it more clear give names on the models. Right click on each model, chose Change Model
Title.
The next step is to go through the 4 models and see if there are any outliers.
In the score plot of t1 versus t2 (below left), the observations now display a much better distribution. The
corresponding loading plot (below right) indicates that y6 is modeled by the first and y8 mainly by the
second component.
Score and loading plots of the first two components are displayed below. The observations are now much
more homogeneously distributed. Basically, the first component accounts for variables y2 and y4 and, to
a lesser extent, y1 and y7. The second component is largely related to y1 and y5.
The top level PLS model yields four significant components explaining 51.7% of the Y-variance.
The two most important responses, Y6 and Y8, are modeled well (below, top right). The score plot of t1
versus t2 (below, bottom left) indicates that the process starts down to the right (high values of y6),
moves up to the left (lower Y6), and is then manipulated to give lower values of y6 (lower left quadrant).
The process then becomes unstable and lurches back to the right.
In the corresponding loading plot (below, bottom right), y6, the side product, is on the right-hand side of
the plot, positively correlated with the first component of the feed model ($M6.t1) and the second
components of both the reactor ($M7.t2) and purification models ($M8.t2). Y6 is also negatively
correlated with the first component of the model of the less important Y-variables ($M10.t1).
Meanwhile, Y8 is related to the first components of both the reactor ($M7.t1) and the purification models
($M8.t1). Y8 is also related to the second component of the model of the less important Y-variables
($M10.t2).
With the contribution tool we can investigate the patterns suggested by the loading plot above in more
detail. Simply double click on any score variable (e.g. $M6.t1) and the corresponding base-level loading
plot will open, revealing which of the original variables influence that particular score.
This gives us zoom-in/zoom-out functionality. In the top-level loading plot, we get a feel for the
relationships between the two important y-variables and the various stages of the process (feed, reaction,
purification). In the base-level loading plot, we understand which of the original variables influence these
outputs.
Use the top-level loading plot to investigate the underlying relationships for other score variables. Just
double click on the appropriate points.
Prediction
It is known that the process became unstable after time point 80 and it would be interesting to fit all data
to the top model. Mark the top model in the project window. Use men Predictions: Specify and select the
original data table.
Under menu Predictions score plots and DModX plots can be used to see how the original data fits to the
hierarchical top model.
The DModXPS plot (below, top right), in particular, illustrates this very clearly. The contribution plot
(below, bottom left) for time point 91 (large DModXPS) shows the feed (model M6, components 1, 2 and
3) as being the main cause of the problem. Zooming in (double click on a column in the contribution plot)
on the feed in all 3 components (plot only shown for t2, below bottom right) points to variable 6, which is
much too low.
Background
The following example originates from a research project on peat in Sweden. Peat is formed by an
aerobic microbiological decomposition of plants followed by a slow anaerobic chemical degradation. Peat
in Sweden (northern hemisphere in general) is mainly formed from two types of plants, Sphagnum
mosses and grass of Carex type. Within the main groups there is variation among the species. Depending
on location, climate etc. there are several other plants involved in the peat forming process.
In the project many different types of chemical analyses were performed to get detailed information about
the material and to investigate differences among different peat types. Chemical analysis was performed
according to traditional methods (GC, HPLC, etc.) which often were laborious and time consuming. To
speed up the analysis of samples, Near Infrared Spectroscopy (NIR) together with multivariate calibration
was introduced. This strategy was found to work very well and after the calibration phase, samples were
analyzed in minutes instead of weeks.
In this tutorial we selected a subset of samples, which represents the typical variation of peat in Sweden.
Objective
The objective of this study is to model and predict different constituents of samples of peat directly from
their NIR spectra. 41 samples of peat, mainly of two types Sphagnum and Carex, were subjected to NIR
spectroscopy. The spectra were recorded at 19 wavelengths (19 filters) with a reflectance instrument
(log(abs)) and scatter corrected before the analysis.
For this objective, we will now develop a PLS model relating the X variables (NIR spectra) to the Y
variables (peat constituent concentrations measured by traditional analysis).
Data
Variables
Variables 1-19 represent spectra from the NIR instrument, which in this case was a 19 channel filter
instrument. Spectra are recorded as Log (Absorbance) and then scatter corrected by a MSC procedure.
Variables 20-46 represent different chemical analyses, which the NIR spectra can be calibrated against.
Variable 27 (Klason l) is Klason Lignin (rest after hydrolysis) and variable 28 is Bitumen, which
represents carbohydrates solvable in acetone.
Observations
From a huge number of peat samples 41 were selected, representing the main variation of peat in Sweden.
The sample (observation) names are coded in all 20 characters. Each position in the names carries certain
information. In the plots a sub-string of two characters (position 6 and 7) are often used. Position 6
represents the degree of decomposition, L (low), M (medium) and H (high). Position 7 represents peat
type, S (Sphagnum) and C (Carex).
Outline
Making a PLS model relating the NIR spectra variables to the peat
constituents in order to:
Understand and interpret the relationship between the spectra (X) and peat
composition (Y variables).
The first two columns are correctly marked as observations numbers and names and the first raw is
variable names.
Mark the variables starting with Ramos to end, and from the combo box select Y Variable.
Use the Find function to select Sphagnum and Carex peat and set class 1 and 2 for them.
Click on the arrow to the right of the find selection box and chose to find in the Obs. Sec. ID:1 column.
We want to use character 7 in the ID so we use the following search string ??????C* to search for Carex
observations, set them as class 1. Do the same for Sphagnum but change search string to ??????S* and set
the class to 2.
Click on OK to exit the Workset window. The project window will show a CM-group with the 2 models
prepared. Change the Title of the models to Carex and Sphagnum.
Analysis
PLS model of all the samples
Right click on the CM1 mark in the project window and select New as Model Group 1. Change Model
Type to PLS and click OK. In this way we will get a model with all observations.
The model overview plot is updated as the model is fitted. This plot displays R2Y cumulative by
component and Q2 Y cumulative by component. R2 Y is the fraction of the variation of Y (all the
responses) explained by the model after each component, and Q2Y is the fraction of the variation of Y
that can be predicted by the model according to the cross-validation. Values of R2Y(cum) and Q2Y(cum)
close to 1.0 indicate an excellent model.
Multivariate calibration with NIR spectra often leads to many components due to the high precision of the
data. The present model is indeed excellent and explains 88.2% of the variation of Y, with a predictive
ability (Q2) of 73.9%.
Summary: X/Y Overview
Click on Analysis | Summary | X/Y Overview | Plot and display the cumulative R2Y and Q2Y for every
response. Use the Properties page to select variable labels and Click on Save As default Options to
always have variable names.. With the exception of Bitumen all responses have an excellent R2 and Q2.
Scores t1 vs. t2
Click on Analysis : Scores : t1 vs. t2 plot. Observation 32 lies far away in the second component,
indicating that sample 32 is different with respect to NIR spectra.
This contribution plot displays the differences, in scaled units, for all the terms in the model, between the
outlying observation 32 and the normal (or average) observation, weighted by w*1 w*2 (the importance of
the X-variables in component 1, 2
In the plot we see some spectral variables close to 8 standard deviations, indicating some contamination
in this sample. We shall remove sample 32.
Comparing the spectra of observation 32 and 39
Mark both observations, right click and select Plot Xobs to display the spectra of these 2 observations.
To display informative labels, select in properties Obs Sec ID, start in position 6 for length 2.
Predictions
We now have two good models describing the relation between NIR spectra and Chemical composition of
peat and they can be used to classify peat samples as Sphagnum or Carex.
In this tutorial we do not have new peat samples. However, we will use the data set and classify every
sample with respect to the two models. We first will want to remove sample 32.
The model for class 1 (Carex) has 4 components and the model for class 2 (Sphagnum) has only one
component. This is probably due to variation in X (NIR) that confuses the models. Therefore
OPLS/O2PLS class models are created to separate orthogonal variation in X that has nothing to do with
variation in Y.
In order to use the 2 models in prediction we can use more of the variation in the X-matrix (NIR).
Therefore, both models are expanded with more components (7 in each, as for the total model with all
samples). Mark a model a model in the project window and add components using Next Component or
the icon . The project window will look like below.
Coomans' Plot
Exclude sample 32 from the Prediction set (Predictions | Specify Prediction set | Remove observation 32
from prediction set) and display the Coomans plot.
This plot displays the Distance to the model of every observation with respect to model M2 and M3, and
shows a very good separation between the Sphagnum and the Carex peat samples.
Summary
As a tutorial, this provides just a brief introduction to the main functionalitys and plots in SIMCA-P. We
recommend that you continue with your own data, may be another tutorial, and then look in the Manual
for details. The Help system contains the same information as the Manual, but organized in a different
way.
Introduction
This example illustrates multivariate calibration using PLS, spectral filtering and OPLS.
The data set of this example was collected at Akzo Nobel, rnskldsvik, in Sweden. The raw material for
their cellulose derivative process is delivered to the factory in form of cellulose sheets. Before entering
the process the cellulose sheets are controlled by a viscosity measurement, which functions as a steering
parameter for that particular batch.
In this data set NIR spectra for 180 cellulose sheets were collected after the sheets had been sent through
a grinding process. Hence the NIR spectra were measured on the cellulose raw material in powder form.
Data are divided in two parts, one used for modeling and one part for testing.
Data
The data consists of:
X: 1201 wavelengths in the VIS-NIR region
Y: Viscosity of cellulose powder.
Objective
The objective of this study is to develop calibration models with original data, filtered data and OPLS.
Analysis Outline
1. Make an ordinary PLS model relating the NIR spectra variables to the
viscosity with the original data.
2. Review and validate the calibration model with the test samples.
3. As 1 but with SNV filtered data.
4. Review and validate the calibration model with the test samples.
5. As 1 but with 1st derivative is used as filter.
6. Review and validate the calibration model with the test samples.
7. Make an OPLS model.
8. Review and validate the calibration model with the test samples.
The 1st column is marked as primary observation ID which is OK. The 2nd column is marked as a
secondary ID which is also OK. This column will later be used to divide the data in training and test sets.
The scale of the X-axis is ordinal and to change to a scale representing nm use properties (right click) for
the plot and change the axis label to the following. The label interval must be set high to see anything on
the axis.
The Workset window opens where variables, observations, transformation scaling etc. can be changed.
Change the scaling of the X variables to CTR (centered only)
If you want to do an interpretation of the calculated components it is common to just center the data. The
default scaling in SIMCA is UV so it has to be changed.
Click on Scale, mark variables 1 to 1201, select in Type, Ctr and click on Set. The X variables are now
just centered, and not scaled.
The classID column can now be used to divide the data in two parts called classes. In this way one class
can be used as model and the other as test and vice versa.
Start by indicating that we want to use the find function for the classID column by clicking on the small
arrow to the right of the find box and chose to find in the classID column.
In the Find box select 1 and then press Set on the button to the right of the Set Class box.
Do the same thing but chose 2 and Set. In this way we have created 2 classes.
Analysis
When you exit the workset window, SIMCA has prepared a class model group (CM1) that contains two
unfitted models for the two defined classes with 90 observations in each.
The model overview plot updates as the models are fitted. Below is a model overview for model 2 (last
model).
R2Y(cum) the fraction of the variation of Y explained by the model after 5 components, is equal to 0.723
and Q2(cum) the fraction of the variation of Y that can be predicted by the model according to the cross-
validation is equal to 0.579. Values of R2Y(cum) and Q2Y(cum) close to 1.0 indicate an excellent model.
Component 4 and 5 describes each less than 1% of the variation.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. The relationship between t1 and u1 is not very good in
particular for the cluster of samples 153, 162 and 168 etc. The line in the plot is drawn using the line tool
. It is obvious that the data are not homogeneous. Most of the observations are in the middle and
others are grouped outside the main cluster.
Several samples have a distance to the model larger than the critical distance, indicating data
inhomogeneity. The labels in the plot can be set by first use the cursor to draw around the points. Release
the mouse button and the points are marked. Then use the speed button in the toolbar to chose Primary
ID.
It is possible to zoom in to enlarge parts of the plot. Use the magnifier tool (X/Y).
The predictions are reasonable with an RMSEP of 138 compared to the training set RMSEE of 147. The
R2 value for the regression line is 0.7308 and this is often called Q2 external.
Some time it is of interest to have both training and test data in the same plot. In this case use menu
Predictions: Specify Predictionset: Specify and add the training set data (class 1 in this case) to the right
list. Class 1 is already in the test set (right list) so chose to add class 2 in the left list. Click on one row and
then Ctrl A to mark all. Then use the arrow to send the left observations over to the right list.
Use properties for the plot and change colors according to predictions.
SNV filtering
The above modeling indicates systematic variation in the X block that is not related to the response Y.
We will apply an SNV filter to the X block (the NIR data) to remove unwanted variation in X and see if
the prediction error becomes smaller. SNV also normalize the observations and is often used as filter for
spectroscopic data.
Click on Dataset: Spectral Filters:
Model 2 now contains 3 significant components. R2Y(cum) the fraction of the variation of Y explained by
the model after 3 components, is equal to 0.637 and Q2(cum) the fraction of the variation of Y that can be
predicted by the model according to the cross-validation is equal to 0.585.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. Using SNV seems to further highlights that some
observation does not fit very well 153, 162, 168 etc.
Loadings
Loadings for each component will show which part of the spectra holds information. Click on Loadings:
Line plot and select to plot w* for all components (3).
Some more samples have a larger distance to the model after SNV filtering.
Observed vs. Predicted
The predictions are poor particularly for a cluster of samples as in the t1 vs. u1 plot. They can be labeled
by marking them and then clicking on the selected item fast button, and selecting labels as Primary ID.
The prediction error are higher with an RMSEP of 148 compared to the ordinary PLS model (without
filtering). The R2 value for the regression line is 0.690 and this is often called Q2 external.
Validating the Model 1
The same procedure can now be done using the 1st model and use class 2 as test data.
It is very practical to define classes working with calibration models. In this way the data is split so there
will be two models and two test sets of the data.
Derivatives are only calculated on the X-data (like SNV). Select all spectral variables and all
observations.
Analysis
Autofit the class models in the same way as above.
In this case SIMCA finds two significant components for each class.
Double click in the project window on the model 2 summary line to display the details by component.
Model 2 now contains 2 significant components. R2Y(cum) the fraction of the variation of Y explained by
the model after 2 components, is equal to 0.509 and Q2(cum) the fraction of the variation of Y that can be
predicted by the model according to the cross-validation is equal to 0.485.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. Using derivatives resembles very much the SNV
filtering.
With derivated data the components seems to pick out approximately the same information except for the
region around 1500-2000 nm where they differ.
The prediction error is higher with an RMSEP of 179.8 compared to the ordinary PLS model (without
filtering). The R2 value for the regression line is 0.5416 and this is often called Q2 external.
In this case we also see 2 main clusters and 2 smaller indicating that the data are not homogeneous or that
this type of model does not fit well.
OPLS
The two last filtering approaches (SNV and derivative) did not improve the prediction error for these data.
Both filtering methods try to make corrections on the X-data to remove unwanted variation (imperfection
in the instrument leading to baseline shift and scattered light etc.) These filters do not take Y in to account
in these correction.
In SIMCA ver 12, OPLS is implemented as a new type of PLS that will seek out variation in X that is not
correlated to variation in Y and remove that part from the X-data. This will hopefully lead to better
calibration models focusing on the interesting correlation between X and Y.
In this way all settings from CM1 will be preserved, the only thing that must be done is to change model
type to OPLS/OPLS2-Class.
Analysis
When you exit the workset window, SIMCA has prepared a class model group (CM2) that contains two
unfitted models for the two defined classes with 90 observations in each.
The Autofit class models window will appear automatically so press OK.
Double click on class 2 in the CM2 group to show the details of the OPLS model.
Loadings
Loadings for each component will show which part of the spectra holds information. Click on Loadings:
Line plot and select to plot w* for all components (1).
The prediction has a RMSEP of 134.38. This is better than for the ordinary PLS model and the filtered
models.
Validating the Model 1
The same procedure can now be done using the 1st model and use class 2 as test data.
Conclusions
OSC, Wavelets, and OPLS are tools that have some additional features beyond ordinary PLS making
these tools useful. OPLS makes the PLS model easier to interpret only one component, and an
interpretable loading plot. Wavelets compress the spectra with little loss of information, and, sometimes,
especially in combination with OSC (OSC-Wavelets) even improves the predictions somewhat.
Introduction
The following example is taken from the article:
J.MacGregor and P.Nomikos, Multivariate SPC Charts for Monitoring Batch Processes, Technometrics
Vol. 37 No. 1 (1995) 41-57
The duration of a batch was 2 hours. During this period, 10 variables were measured every 1.2 minutes,
for a total of 100 measurements. A quality variable was measured at the completion of every batch.
Data were collected on 55 batches.
Batches 40 to 42 and 50 to 55 had their quality variable outside the specification limits. The quality
variable of batches 38, 45, 46 and 49 was on the boundary.
Data
Variables
The following 10 variables were measured at equally spaced intervals during the evolution of a batch.
x1 to x3: Temperature inside the reactor
x6 and x7: Temperature inside the heating- cooling medium
x4,x8 and x9: Pressures variables
x5 and x10: Flow rates of material added to the reactor.
Objectives
1. Develop a model of the evolution of good batches (the observation level
model), and use it to monitor new batches as they are evolving, in order to
detect problems as early as possible.
2. Make a model of the whole batch based on the scores of the observation level
model, and use this model to classify the new batches as good or bad ones.
Analysis Outline
We will use 18 good batches (1800 observations) to model the evolution of good batches. This is done
by fitting a PLS model relating Y, the relative batch time, to the 10 measured variables.
This observation level model is used to monitor the evolution of the new batches, batch 30 to 33 (good
batches) and 49 to 55(bad batches).
We will make a PCA model of the whole batch, with the unfolded scores of the observation level as X-
variables.
The second column labeled observation names contains the batch identifiers.
Both the Batch identifiers and the phase identifiers (when present) can be located in any variable
(column) in the spreadsheet.
Mark this second column and from the combo box (top of column) select batch identifiers.
In this example you do not need to define phase identifiers, as the batch process has only one phase.
The Batch page displays the list of batches in the dataset with the number of observations in each batch.
The Conditional delete allows you to delete batches with fewer observations than a selected number.
Click on OK.
With batches we are interested in summarizing the X variables and the loadings p1 are the weights that
indicate the importance of the original Xs for t1.
Use the up/down arrow keys to shift between the scores. You can also use the property bar.
Use the up arrow to display the control chart of t2.
To display the control chart in normalized units, from the limits and averages tab, select remove the
average and normalize the values (from List and Averages) and then use Select batch and Phase to
show all batches and then click on apply.
Click on Predictions: Specify Predictionset: Dataset: Alpred to select the alpred prediction set.
Click on Predictions: Batch Control charts: Scores Plot to display the new batches in the control charts
with the control limits derived from the training set. Use the Properties page to include batches 50 to 55.
If the legend does not appear right click in the window and chose Plot Settings: Area and check to show
the legend.
Use the up key to display the Control chart of t2.
In both of these control charts, batches 50 to 55 are out of the control limits in the first time period (0 -
15). Batches 50 - 55 are also out of the control limits in t1, for the last time period (90 to 100) of the
polymerization process.
Contribution plot
To use the Contribution tool, double click in the t1 control chart on one of the outlying batches, 50 for
example, at time point 5.
Batch 49 is slightly out of the control limits around time period 55- 60 (use properties: Select Batch and
Phase to select batch 49).
The Contribution plot around time point 59 shows variable V-10 slightly lower than average good
batches.
Batches 50 to 55 are clearly out of the control limit for the time period 0-20.
Contribution plot
The Contribution plot for any of these batches in that time period shows again variable V4 (pressure) as
being lower than in good batches.
Analysis: Scores
Click on Analysis: Scores: t1 vs. t2
This plot, by combining the importance of the scores in the batch level model, with the weights w*
derived from the observation level model, displays the overall importance of the measured variables in
the whole batch model. Here we see that all the 10 variables are important (this is to be expected as the 10
measured variables are highly correlated).
Predictions: T Predicted
We clearly see that batches 50 to 55 (with the exception of 52) are outside the Hotelling T2 ellipse and
are outliers in the second dimension.
Use properties to put labels on the points (you can also change size of the font).
Batches 50 to 55 have their distance to the model way above the control limit, and batch 49 is also above
the control limit. Clearly these batches are different than the good ones.
Prediction: Contribution: Distance to the model
Using the Contribution tool double click on batch 50
Conclusion
Modeling the evolution of a representative set of good batches allowed us to construct control charts to
monitor new batches during their evolution. We detected problems in the evolution of the bad batches and
understood why these batches were outside the control limits.
The model of the whole batch has allowed us to classify the new batches as good or bad and understand
why these batches had an inferior quality.
Introduction
The following example is derived from a batch digester.
Batch digesters are used in the pulp and paper industry to produce pulp from wood chips.
The batch process has 5 phases: chip, acid, cook, blowback and blow.
In the chip phase, the wood chips are fed into the digester and steamed.
In the acid phase, the chips are impregnated with an acid.
They are then cooked at high temperature and pressure during the cook phase. This is the most important
phase, as this is where the de-lignifications happen.
In the blowback phase, the pressure is released and thereby brought back to atmospheric pressure. The
temperatures also drop.
Finally, in the blow phase, the pulp is blown out of the digester.
The duration of a batch varies between 8 and 10 hours, and on the average, is around 9.4 hours in the
present data set.
27 variables (including the sampling time) were measured every 2 minutes during the batch evolution.
Different variables are meaningful in the different phases.
Data were collected on 52 batches. Of these, thirty good batches are used to build the training set model.
Data
Variables
The following variables are meaningful in the following phases:
Chip and Acid phase:
State of the acid (2 variables)
State of the vent (2 variables)
State of Steam1 (2 variables)
State of Steam2 (2 variables)
Temperature4
Pressure2
Cook phase:
Pressure1
Steam
Temperature1
Temperature2
Temperature3
Temperature4
Temperature5
Pressure2
Temperature6
Pump
Objectives
1. To develop a model of the evolution of good batches (the observation level
model), and use the model to monitor other batches as they are evolving, in
order to detect problems as early as possible.
2. Make a model of the whole batch based on the scores of the observation
level model, and use this model to classify other batches as good or bad.
Analysis Outline
We will use 30 good batches to develop the model of the evolution of good batches.
In the analysis, we will combine the chip and acid phase (they are not meaningful alone) and delete the
blow phase which has no effect on the quality of the pulp.
We will fit 3 different PLS models relating Y, the relative batch time, to the measured variables in the 3
relevant phases (chp+acid, cook and Blowback).
These observation level models are used to monitor the evolution of the other batches, in this example
those left out of the training set.
We will make a PCA model of the good batches at the batch level, with the unfolded scores of the
observation level as X-variables.
Click OK.
The import wizard opens and select to start a SIMCA-P+ Batch project and click on Next.
The second column labeled observation names contains the batch and phase identifiers.
Both the Batch identifiers and the phase identifiers (when present) can be located in the same variable
(column) or in separate column in the spreadsheet.
The batch identifiers are sequential numbers from 01 to 52 and the phases are chip, acid, cook, blbk, and
blow, click OK.
The batch and phase ID are now in 2 separate columns. The original column will be marked excluded
(can be set as a secondary ID).
We now have 3 phases left: chip+acid, cook and blbk, click on Next.
The Batch page opens listing all the batches with their numbers of observations. Listed under every batch
are the phases included in the batch. In our example all the batches include all the phases.
Some of the variables will be constant for some of the phases and SIMCA warns about that. Select No in
the following messages.
Note the Y variable, sampling time, will automatically be shifted, to start at 0 for every phase and
Normalized for better alignment. Normalizing the sampling time achieves linear time warping.
Click on the Batch Page and select 30 good batches: 1, 4, 6-13, 16. 18, 21, 23, 25, 29, 31, 32, 34, 36-38,
40, 42, 43, 46-49 and 51.
To do this, first press Select All and Exclude. This excludes all batches. Then use the CTRL key, mark
the 30 good batches and click on Include.
Analysis
Fitting All the Class models
SIMCA opens an Autofit window with all phases present.
Click on OK.
The 3 class models are fitted and they all explain more than 80% of X.
A R2/Q2 plot will appear for each model during calculation.
The first three components are the most important, explaining together 69% of the variation of X; t1
explain 44%, t2 16% and t3 9%.
We can see that t1 consists mainly of the first 5 temperatures and pressure 1.
The second score t2 is primarily pressure1, the steam and temperature1
The score t3 is again dominated by the pressures (1 and 2) with steam, temp1 and temp6.
Individual batches can be included in this control chart the first training batch is included as default.
More can be included in the stack of displayed batches by using the properties menu (after right click).
Use the side arrows to move the stack of displayed batches forward or backward by one batch.
Another way is to use the properties window (right click and select properties) on the plot and make the
selection there. Select all in the left window and use the right arrow and send them to the right window.
In this case the traces of all the good batches are within the red control limits.
In the component tab, select the 2nd component using the up arrow key on the keyboard and use properties
to switch back default Limits and Averages.
Display univariate Batch Control charts when needed by selecting the Variable Plot.
In the Control chart of t1 with the average and 3 computed from the good batches, we can see batch 28
far outside the control limits.
Only batches which are 10% of area outside 3, -3 sigma are shown.
Right click on the control chart and select properties. Select all batches (move from left to right window).
This plot displays for every batch the percent of the area outside the limits relative to the total area inside
the limits of the control chart.
Hence batch 28 has 40% of its area outside the control area.
Go back to the batch control chart and select to show only batch 28. Mark the time points outside the 3
sigma and click on the action plot (toolbar)
Batch 28 is clearly out of the control limit for the time period 0 to 1.5 hrs.
Contribution plot
The Contribution plot for batch 28 in that time period shows that the problem is also associated with
pressure 2 and temp6 (correlated with pressure 2).
Analysis: Scores
Click on Analysis: Scores: t1 vs t2
The Contribution plot is colored by phases, and shows that t1 in the cook phase, at time 5.2 hours is lower
than the average by 6.5 standard deviations. With the Contribution tool double click on this bar to resolve
this contribution into the original variables,
Predictions: T Predicted
Select t1 vs. t3. Batches 28 and 26 are outside the Hotelling T2 ellipse.
What is causing batch 28 to be an outlier? The problem clearly is the cook phase. Double click on one of
the scores with large deviations, for example t1 at time 1.1 hours, to resolve the contribution into original
variables.
Contribution Plot
Double click with the contribution tool on batch 33 to display the contribution plot
The Control charts for batch 33 of both pressure1 and steam confirms this.