SIMCA-P+ 12 Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 144
At a glance
Powered by AI
The document discusses analyzing food consumption patterns across different European countries using multivariate statistical analysis techniques. It aims to understand similarities and differences in consumption based on culture and tradition.

The objective of this study is to understand how the variation in food consumption among a number of industrialized countries is related to culture and tradition and hence find the similarities and dissimilarities among the countries.

The data set consists of 20 variables (different foods) and 16 observations (European countries). The values are the percentages of households in each country where a particular product was found.

Foods

Background
Data were collected to investigate the consumption pattern of a number of provisions in different
European countries. The purpose of the investigation was to examine similarities and differences between
the countries and the possible explanations.

Objective
The objective of this study is to understand how the variation in food consumption among a number of
industrialized countries is related to culture and tradition and hence find the similarities and dissimilarities
among the countries. Hence data have been collected on 20 variables and 16 countries. The data show
how many percent of households use 20 food items regularly.

Data
The data set consists of 20 variables (the different foods) and 16 observations (the European countries).
The values are the percentages of households in each country where a particular product was found. For
the complete data table, see below. This table is a good example of how to organise your data. There are
two secondary observation identifiers, Location (geographic) and Latitude (of capital). The coding of
Location is: C = central; S = south; N = north; U = UK & Ireland; X = beneluX. The coding of Latitude
is: 1 = < 45; 2 = 45-50; 3 = 50-55; 4 = 55-60; 5 = > 60.

Outline
The steps to follow in SIMCA-P are:
Import the data set.
Prepare the data (Workset menu).
Fit a PC model and review the fit (Analysis menu).
Interpret the results (Analysis menu).

Define project
Start SIMCA-P and create a new project from FILE | NEW

SIMCA-P Tutorial 0BFoods 1


Select type of data (XLS) or ALL Supported Files (the default) and find the data set
(FOODS_update.XLS). Data can be imported from your hard-disk or from a network drive. Data can be
imported in different formats, so select the one which is appropriate or All Supported Files. In this
example we have the data in a XLS-file created from Excel.
If the data set is on a floppy disk, we recommend that you first copy the file to the hard disk.
If you want to leave open the current project, remove the check mark from the box Close Current Project.
Note: The data set to import can be located anywhere on an accessible directory. It does not have to be
located where you have defined the destination directory.
When you click on Open, SIMCA-P opens the Import Wizard.
With SIMCA-P+, mark the radio button SIMCA-P normal project.

The import wizard detects that there is an empty row and asks if you want to exclude that row.

Chose Yes.

2 0BFoods SIMCA-P Tutorial


SIMCA-P has tried to do an interpretation of the data table and made some settings. Observations and
variables must have a primary ID but can have many secondary ID:s. The primary ID must be unique but
not the secondary ID. The ID:s will be used as labels in plots.
In this case we have name on countries (unique) that are suitable as a primary ID and names on food that
are also unique and can be set as a primary ID.
On each row and column there is a small arrow that can be used to change settings.
Click on the arrow for the column with country names and chose Primary Observation ID. The
available settings for columns can be seen in the list. The default for variables are X.

The setting for column 2 is now Primary ID. The first column is set to exclude which is fine. The 3rd and
4th column (Geographic location and Capital Latitude) is not unique and will both be set as a secondary
ID.

The rest of the columns are the data (X-variables) and are not changed.

SIMCA-P Tutorial 0BFoods 3


The same procedure is done for rows in the table. First row is numbers and second row is names on food
(unique). Shift the second row to Primary Variable ID. The first row will be excluded which is fine.

Click on Next and you give the project name and a destination directory. Missing values are indicated.

Analysis
After finishing the import wizard the primary dataset is created in SIMCA. The primary dataset is the data
used to create models from. Default the whole dataset is selected with UV-scaling (unit variance). The
primary dataset will not change and when you want to make models where you change observations
and/or variables, change scaling etc.
The primary dataset can be shown choosing menu Dataset: Open: FOODS_update. Or use the speed
button .

Here it is possible to do several things. If you right click in the table several options are available.

4 0BFoods SIMCA-P Tutorial


When data are imported the project window opens up and will show the start for the 1st model (PCA-X
unfitted).

In this case we want to fit a model to the data and we use menu Analysis: Autofit or a speed button

. This will calculate components one at a time and check the significance of each component
Based on cross validation). When a component is not significant the procedure is stopped.
A summary window opens up showing the R2 and Q2 for the significant components.

The project window is updated:

To see the details of the model, double click on the model row in the project window.

The plot with the summary of the fit of the model is displayed with R2X(cum) (fraction of the variation of
the data explained after each component) and Q2(cum) (cross validated R2X(cum)).
The summary of the fit of the model is displayed with R2X (fraction of the variation of the data explained
by each component) and cumulative R2X(cum), Q2 and Q2(cum) (cross validated R2X and R2X(cum))
as well as the eigenvalues. The food variables are, as expected, correlated, and fairly well summarized by
three new variables, the scores, explaining 65% of the variation.

SIMCA-P Tutorial 0BFoods 5


In total the model describes 64.8% of the variation (R2(cum)) in the data with a Q2 14.4% (bad prediction
properties of the model). 1st component describes 31,7% of the variation.

Scores and Loadings


Scores
To get a quick overview of the results from the model use a speed button that will create four important
plots directly.

These plots are a the score plot (upper left, t1 vs. t2), the loading plot (lower left, p1 vs. p2), DModX
(distance to model) and X/Y Overview plot (showing R2 and Q2 for each variable).
The DModX plot shows that no observation is far away from the model (projection). Statistically they are
below the critical limit (Dcrit). The X/Y overview plot shows that some variables have relatively high
R2/Q2 indicating systematic behavior. Some have low (even negative Q2) indicating low variation
(consumption almost constant over all countries).
The ellipse represents the Hotelling T2 with 95% confidence (see statistical appendix).
The scores t1 and t2, one vector for components 1 and 2, are new variables computed as linear
combinations of all the original variables to provide a good summary.
The weights combining the original variables are called loadings (p1 and p2), see below.
The score plot shows 3 groups of countries. One group with the Scandinavian countries (the North), the
second with countries from the South of Europe, and a third more diffuse with countries from Central
Europe. It seems a little odd that Austria is in the south Europe group but maybe the Tyrol region (close
to Italy) has a big impact.
To enhance the information of the plot we can use colors. Right click in the plot and select Properties and
then tab colors. Chose to color according to secondary ID Geographic location.

6 0BFoods SIMCA-P Tutorial


.

We could have use coding according to latitude (also a secondary ID) and get the about the same
coloring.

Loadings
The loadings are the weights with which the X-variables are combined to form the X-scores, t (se above).
This plot shows which variables describe the similarity and dissimilarity between countries.

SIMCA-P Tutorial 0BFoods 7


Scandinavians eat crisp bread, frozen fish and vegetables, while in southern Europe people use garlic and
olive oil, and central Europeans (in particular the French) consume a lot of yogurt.
A more detailed interpretation of the loadings can be done from plots showing the loadings separately.
Use menu Analysis: Loadings: Column plot. Default p1 is chosen.

Here we can see the influence of each variable on the 1st component. To inspect the second component
use the up arrow on the keyboard. The uncertainty of the loadings calculation is shown as confidence
interval (jack-knifing in the cross validation procedure).

Third Component
The cross validation procedure gives three components in the model. In the scores and loading plots
(default component 1 vs. component 2), use the keyboard arrows to shift. Up and down for the Y-axis in
the plot and left and right for the X-axis in the plot..
Plot the scores (t1 vs. t3) and loadings (p1 vs. p3). The third component explains 13.8% of the variation in
the data, and mainly shows high consumption of Tea, Jam and canned soups mainly in England and
Ireland.

8 0BFoods SIMCA-P Tutorial


Contribution
A very nice tool in SIMCA to see differences between single observations, between one observation and a
group of observations or between groups of observations is to use contribution plots. They will show the
differences between observations expressed in the original variables (weighted by the loadings of the
model).
Contribution for one observation to center of plot
Double click on an observation in the score plot (i.e. Sweden) and the following plot appears.
The interpretation of the plot is: when you go from the calculated average country to Sweden the
consumption of Crisp Bread Frozen Fish, Frozen Vegetables go up. Don over interpret the plot. Look at
the biggest columns.

SIMCA-P Tutorial 0BFoods 9


Contribution for one observation to a group of observations
When you want to show a new contribution click on an empty area someware in the score plot to release
the first choice (the markings will disappear).
To compare one country with a group of countries, click on one country (i.e. Sweden) and then use the
mouse (hold down left mouse button) and draw a line around the observations you want to compare with
and then click on this tool
Below Sweden is compared with the south Europe group (Italy, Portugal, Austria, Spain).

Consumption of garlic goes up and a lot of other foods goes down.

10 0BFoods SIMCA-P Tutorial


Contribution for one group to another group of observations
To compare a group of countries with a another group of countries, mark the first group and then the
other group and then click on the tool . In this case stat by showing the score plot t1 vs. t3 where UK
and Ireland deviates from the others. Mark all countries except UK and Ireland and then these two
countries. This leads to the following contribution plot.

Consumption of ground coffe goes down and tconsumption of tea and jam goes up.

Summary
In conclusion, a three components model of the data summarizes the variation in three major latent
variables, describing the main variation of food consumption in the investigated European countries.
This example shows a simple PC modeling to get an overview of a data table. The user is encouraged to
continue to play around with the data set. Take away observations and/or variables, refit new models, and
interpret at the results.

SIMCA-P Tutorial 0BFoods 11


Spirits

Background
Complex liquid samples can be characterized, compared and classified with the help of a non-selective
analytical method, for instance one which takes advantage of the samples ability to absorb visible light.
From the characterization of samples of known origin, predictive models can be built and tested with new
samples of unknown composition. In this tutorial a range of distilled spirits are investigated using vis-
spectroscopy. We are grateful to Johan Trygg and colleagues at Ume University for granting us access to
this data set.

Objective
The objective of this example is to provide an illustration of multivariate characterization based on
spectral data. To this end, spectra measured on a set of alcoholic spirits, among them whisky and cognac,
will be used. The spirits can be compared and classified by investigating if there are clusters relating to,
for example, product type or country of origin.
A growing problem in the beverage and brewing industry is fraud and plagiarism; see for example 1 in
which sparkling wines (champagne and cava) were differentiated using a multivariate model of their
mineral content. Chemometric methods can greatly assist in identifying incorrectly labeled or fake
products.
1) Jos, A., Moreno, I., Gonzalez, A.G., Repetto, G., and Camean, A.M., Differentiation of sparkling
wines (cava and champagne) according to their mineral content, Talanta, 63, 377-382, 2004.

Data
For each sample (spirit), the visible spectrum (200600 nm) was acquired using a Shimadzu
spectrometer. Signal amplitude readings were taken at 0.5 nm intervals yielding 801 variables. There
were 46 unique samples plus a few replicates giving 50 observations in total. The secondary observation
ID designates country of origin and product type as follows: XXYY where XX indicates country and YY
product type. The suffix R indicates a replicated sample.
Country of Origin: USa, SCotland, IReland, CAnada, FRance, ITaly, JApan.
Product Type: BOurbon, BRandy, COgnac, WHisky, Single Malt (SM), BLended, RUm.
One mixed (MIXT) sample is also present in the data set.

Outline
The analysis of these data will be divided in three parts. Each part are created as a separate project in
SIMCA.

Overview: The use of PCA to get a quick overview.


Classification: How to handle classification in SIMCA
Scaling: How to use scaling
Combine: How to combine the three parts in one project.

1
Jos, A., Moreno, I., Gonzalez, A.G., Repetto, G., and Camean, A.M., Differentiation of sparkling wines
(cava and champagne) according to their mineral content, Talanta, 63, 377-382, 2004.

SIMCA-P Tutorial 0BSpirits 1


Overview
The first step will show a quick way to create a PCA-model and show the information in the data

Import data
All new projects in SIMCA start by importing the data.

Start a new project in SIMCA by selecting File: New or click on the New speed button .
The following window opens:

Data can be imported as a file, from an ODBC database (using MS Query) or pasted into an empty
spreadsheet. Supported file formats will be shown in the file list (see the User Guide for a more detailed
explanation about different file formats).
In this example we chose the Excel 2007 file called Spirits.xlsx (XML-format).
Next a new window opens up where you can select between a normal and batch type of project. In this
case chose the first alternative:

Click on Next and the following window opens (import wizard):

2 0BSpirits SIMCA-P Tutorial


In this step of the import it is possible to manage labels for observations and/or variables. SIMCA needs a
primary ID for both (if you dont define them SIMCA will create them automatically (just numbers).
Primary ID: s must be unique.
In addition you can mark as many secondary ID: s as you want (dont need to be unique).
All ID: s can be used as labels in plot later on.
SIMCA makes an own interpretation of the data matrix imported and if you want to make changes use the
small arrows for each column or row.
For columns the following options are available:

The first column is chosen as the Primary ID for


observations (all are unique).
It can also be marked as a secondary ID, Class ID
(described later), X or Y, as qualitative (X, Y). Date
and time,
Any column can be excluded.

Variables can also be defined as X/Y, qualitative (X/Y) and date/time (X/Y). In this example we only
have X variables which are the default setting.

SIMCA-P Tutorial 0BSpirits 3


For rows the following options are available:
The first row is chosen as the Primary ID for variables
(all are unique).
The values here are expressed in nm from the spectra
(801 variables).
It can also be marked as a secondary ID.
Any row can be excluded.

In this particular case SIMCA has made a suggestion that is acceptable directly, we dont have to change
anything.
Next step is to go on in the wizard so press NEXT:

Here you give a name and where to store the project.


It is also possible to see information about observations and variables and also a map of missing values
(in this case none).
Press Finish and the data will be imported as the primary dataset in SIMCA. The primary dataset is used
to create models from. Later it is possible to import secondary data sets which can be used for testing,
prediction etc.

4 0BSpirits SIMCA-P Tutorial


Primary dataset
A view of the primary dataset in SIMCA can be made from a menu item (Dataset: Open) or from a speed

button .

This primary dataset will always be available. Later when you create different models with different
selections of observations and variables etc., you always use a copy of the original primary dataset.

Modeling
At this stage it is of interest to quickly see what type of information exist in the data imported.
SIMCA has already prepared a PCA model. At the import we did not declare any Y-variables etc., so all
variables are considered as X-variables.
The project window in SIMCA shows the prepared model:

This model is prepared so all observations and variables are presented. The scaling of the variables are
default (UV= Unit Variance).

Select menu item Analysis: Autofit or use the speed button


SIMCA calculates 5 components and the project window is updated:

SIMCA-P Tutorial 0BSpirits 5


A summary window will appear showing R2 and Q2 for the model (2 components).

99,9 % of the variation in the data is explained of which 97,2% is explained in the first two components
(normal for spectroscopic data). A more detailed information about the model can be found by a double
click on the model row in the project window.

Next step is to show the information from the PCA-model. This can easily be done using a speed button.

This will show components one and two for the scores and loadings.

6 0BSpirits SIMCA-P Tutorial


The 4 plots are the score plot (upper left), the DModX (distance to model, upper right), the loading plot
(lower left) and R2, Q2 for the variables (lower right).

Score plot
A look at the score plot shows labels from the primary ID of the observations. A more informative
labeling is to use the secondary ID. This can be achieved by right click in the picture, chose properties
and then lables:

The length and part of the label string can also be set (default start=1 length=10).
Select the secondary ID and click OK:

With the names of the observations a much better for interpretation can be done. In the plot it can be seen
that the different types of spirits are clustered. To emphasize that there is groups of spirits it is possible to
use color on the labels. Right click and chose properties: Color:

SIMCA-P Tutorial 0BSpirits 7


Chose to color according to the secondary ID (identifier):

Chose to use character 1 to 4 in the name (character 1-2 shows country and 3-4 type of spirit).

8 0BSpirits SIMCA-P Tutorial


Now the groups of spirits are much clear. JARU (lower left, Jamaican rum) seems to be different from all
others.

Distance to model (DModX)


The distance to model (DModX) must be shown to see how far away from the projection plane (score
plot) observations are situated. Some of the spirits (i.e. FRCO) are a little different from the others
(according to the visual spectra).

X/Y Overview
The R2-Q2 plot of the variables shows that most of the variation in the variables is used in the model. To
see the individual variable enlarge in X-direction of the plot using the magnifying tool. Mark a region
with the mouse (press left mouse button) and release the mouse button.

SIMCA-P Tutorial 0BSpirits 9


The result can be adjusted by changing the size of the scroll bar on the x-axis and dragging the
scroll bar will show different regions.

Loadings
The loading plot p1 vs. p2 (which is default) is not informative when you have this type of data. In the
next part (scaling) the loadings will be shown one at a time.

Summary
To get a quick overview of a data table import the data into SIMCA and create PCA-model using Autofit,
present the 4 overview plots and interpret the information in the score plot, DModX plot, loading plot and
the summary plot.

10 0BSpirits SIMCA-P Tutorial


Scaling
So far we have used the default scaling UV (unit variance) which should be used in a situation where the
variables in the data are different (i.e. temperature, pressure, flow etc.). However, in this case we have
digitized spectra, which mean that the variables are measured in the same unit. In such a case UV scaling
may not be optimal.
SIMCA-P supports a number of scaling methods. For spectral data, the most commonly used are
centering with no scaling (Ctr) and Pareto scaling (Par). Unit variance scaling (UV) will give each
variable (wavelength) a variance of one and thereby an equal chance of being expressed in the PCA
model. This will compress signal amplitude variation in spectral regions where large changes occur whilst
magnifying regions with less variation. Thus, there is a risk that the influence of noisy regions in the
spectra will become inflated. The most common option with spectral data is centering without scaling
(Ctr) in which the influence of a variable is related to its amplitude and hence regions of low amplitude
have little or no influence. A useful compromise between UV scaling and no scaling is Pareto scaling in
which regions of low-medium amplitude have more chance of influencing the analysis but only if they
represent systematic variation. This scaling is often applied to NMR and MS data.
Scaling in SIMCA is a feature of the workset. The default scaling is UV and if we want to change the
scaling we have to make new worksets.

Workset
From the primary dataset we can make changes to which variables and observations to use
(include/exclude, make classes, X/Y variables), transform variables, lag variables, expand variables.
The primary dataset will remain unchanged and each workset created will be a new model.
To create a new workset from scratch (full copy of the primary dataset) we use menu Workset: New.

This opens up the Workset


window where several things
can be changed.

Under tab Overview there is a list of the present variables and observations. This list will be updated
when we make changes under the other tabs. Missing data tolerance level can be set (if that value is
exceeded SIMCA will warn you). The model type can also be specified. In this example PCA-X is the
only alternative (all variables defined as X at the import, no classes defined for the observations)
Now we want to change the scaling of the variables and this is done under tab Scale.

SIMCA-P Tutorial 0BSpirits 11


Select all variables (click on one row and
press CtrlA.).
Use list box Type and select Par.

Press Set
Press OK at the bottom.

Now a new model is prepared where the variables are Pareto scaled.
The project window shows the new unfitted model. Use Analysis: Autofit (or speed button) to calculate
components:

The next step is to create a model where we use only centering (Ctr) for the variables.
Right click on the model 2 line in the project window and select New as model 2.

The workset dialogue opens again. Go to the scale tab and set scaling to
Ctr (centering).
Autofit model 3

Now we have three models in the project and to make it more clear what we have we will change the title
for the models so that we remember what we have done. Right click on a model row and chose Change
Model Title). Set the title to UV for model 1, Par for model 2 and Ctr for model 3.

The model with Pareto scaling have 4 components and the other 5 components.
We will now investigate the effect of scaling. The four plots below shows the raw data prior to and after
the three different scaling approaches.
The plots below can be created in the following way:
Raw data
can be plotted by opening the primary dataset (Menu Dataset: Open and chose Spirits or use speed button
. Right click somewhere in the data table and chose Create: Plot Xobs. The emphasized line for
observation 15 (JARU) is created by right clicking in the plot, select Plot settings: Plot Area. Chose No.
15 and change to color black and width 5.

12 0BSpirits SIMCA-P Tutorial


Data with scaled variables:
Mark model 1 (UV) in the project window. Go to menu Workset: Spreadsheet. The data table with UV
scaled data opens up. Right click in the table and select Create: Plot Xobs.
Spectral variation is most pronounced between 200 and 350 nm and above 450 nm there is essentially
nothing but minor baseline variations (top left plot). One spectrum (no 15, Japanese Rum) is shown as a
thickened black line in all four plots. This is an atypical spectrum and it can be seen that it particularly
stands out after UV-scaling (lower left plot). The ability of UV-scaling to blow up noisy regions is
evident from this plot. The atypical spectrum (no 15) is less extreme following Pareto-scaling (upper right
plot) and there is still some scope for the high wavelength end to impact the analysis. Sample 15 is even
less extreme if we only center the data (lower right plot). In this case, however, the 450-600 nm region
has little or no influence and will not affect the analysis.

The three PCA models using UV scaling (model M1), Pareto scaling (M2) and centering but no scaling
(M3) are summarized below and are very similar in terms of variance explained. Cross-validation
suggests 5, 4 and 5 significant components, respectively, but for comparison purposes we forced a fifth
component into the Pareto model.

When examining the explained variances in more detail, it is apparent that only two components are
really necessary for obtaining a good overview of the data. Hence, in the following, we consider only
these two components.
The scores, loadings and DModX plots of the three models are given below (top triplet: UV scaling;
middle triplet: Pareto scaling; bottom triplet: centering without scaling). PCA based on UV scaled data

SIMCA-P Tutorial 0BSpirits 13


finds two samples (13 and 15) different to the majority of samples. These are suspect outliers. Sample
number 15 also has a very high DModX after two components. Based on the plot of the UV scaled data
shown previously, the outlying behavior of sample 15 is not surprising.
With Pareto scaling, and, to a greater extent, with centered data, the influence of samples 13 and 15 is
reduced although they are still clearly different to the rest. As far as the spectral interpretation is
concerned, centering produces the most interpretable loading spectra. The first loading resembles the
average spectrum whilst the second loading picks up additional structure between 200-225 nm and 250-
325 nm. In the remainder of this exercise we will use Pareto scaling.

Distilled Liquor.M1 (PCA-X), PCA UV Distilled Liquor.M1 (PCA-X), PCA UV; p p[1] Distilled Liquor.M1 (PCA-X), PCA UV
t[Comp. 1]/t[Comp. 2] p[2] DModX[Comp. 2]

40 0.05 15
28 3
10
11 0.04
30
1
50 31 0.03
20 45 18
414 0.02
10
4347 38
48
9 41
2
21 20

DModX[2](Norm)
35 623
33 16 8 0.01
28
t[2]

0 244926 32 15
2722 30 4 10 44
12 40
0.00
1113 45 50
-10 44
7 2 37
42 -0.01 14 32
3617
46343
19 1
-20 39
5 25 D-Crit(0.05) 30 3739
-0.02
1
56 9 48
35
29 13 18 47 40
2 31
-30
20 23 43
-0.03 3 2527
21 24 49
8 17
16 29 33 38 41
42
-40 -0.04
7 19 22 26
-60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90
12 3436 46
-0.05
t[1] 0 10 20 30 40 50
200 250 300 350 400 450 500 550 600 Num
Var ID (Primary)
Ellipse: Hotelling T2 (0.95)

M1-D-Crit[2] = 1.212

Distilled Liquor.M2 (PCA-X), PCA Par Distilled Liquor.M2 (PCA-X), PCA Par; p p[1] Distilled Liquor.M2 (PCA-X), PCA Par
t[Comp. 1]/t[Comp. 2] p[2] DModX[Comp. 2]

2.20
10 0.09 10
11
2.00 44
8 50
45
0.08
28 1.80
15
6 0.07
4 1
4 31 37
9 41 47 0.06 1.60
4 28
48 43 38 45
2
2433
35 23
16 8
6 26 20 10
11 0.05
22 21
1.40
49

DModX[2](Norm)
0 18 0.04 50
2742
12 37
17 40 14 1.20 9 14 39
7 2 44 32
t[2]

-2 19
3646
5 3439
0.03 D-Crit(0.05)
1 32 40
253 30 1.00
6 18 30 35
-4 29 0.02

-6 0.80 25 43 4849
0.01 5 13 24 27 47
2 17 23 31
-8 0.00 0.60
3 33 41
42
16 19
-10 -0.01
0.40
21
20 29 34 38
8
-12 -0.02 7 26 36 46
-14
13 0.20
22
15 -0.03 12
-30 -20 -10 0 10 20 30 -0.04 0 10 20 30 40 50

t[1] 200 250 300 350 400 450 500 550 600 Num

Var ID (Primary)

Ellipse: Hotelling T2 (0.95) M2-D-Crit[2] = 1.212

Distilled Liquor.M3 (PCA-X), PCA Ctr Distilled Liquor.M3 (PCA-X), PCA Ctr; p p[1] Distilled Liquor.M3 (PCA-X), PCA Ctr
t[Comp. 1]/t[Comp. 2] p[2] DModX[Comp. 2]

6 3
45
50 28
0.12
2 10
11
4
4
2
0.10 15
4737 1 2
2 28 0.08
38 44 2
41 8 40 31
17
42
33
48 43
26
22
23
16 21 20 13 44
DModX[2](Norm)

0 24
9 19 39
12 2
0.06 2
353649
67
525
274634 32 1 18
t[2]

0.04
29 3 14 18 37
-2
30 10
11 1 1
0.02 D-Crit(0.05)
1 9 30
31 45
14 32
-4
4 40 50
0.00
1 39 43
38
6 20 252729 35
-6
-0.02 1
23 5 41 4749
13 17 21 24 33
34 48
0 16 19 23
-0.04 8 36 42 46
-8
15 0 7 12 22 26
-0.06
-20 -10 0 10 20 0 10 20 30 40 50
-0.08 Num
t[1]
200 250 300 350 400 450 500 550 600
Var ID (Primary)

Ellipse: Hotelling T2 (0.95) M3-D-Crit[2] = 1.212

14 0BSpirits SIMCA-P Tutorial


Classification
In this step the data will be imported again to start a new project which will show how to handle classes
(groups, clusters) of observations. It will also show how to use scaling.

Import data
Start a new project in SIMCA and chose the same dataset (Spirit.xlsx). The difference now is that classes
will be declared directly at the import. When the following window appears at the import change settings
for the second column (former secondary ID) to Class ID.

Class identification can be made on any column in the data table. Here we want to use the 4 first
characters of the name to define classes.

This column will then be called ClassID and the originally secondary ID will be copied to a new column
and marked excluded. Change that so it will be an active secondary Id again (useful for plots).

SIMCA-P Tutorial 0BSpirits 15


Press Next in the import wizard and the following window will appear:

With help of the 4 first characters in the name SIMCA has identified the classes and shows how many
observation it is in each. Here it is possible to change names, orders and even merge classes.
For some of the classes identified from the name there is only one (1) observation. For these classes it is
impossible to create models. In this situation there are two possibilities. One is to mark the classes with
only one observation and mark them as deleted. The other is to keep them. In the first alternative the
observations in classes with one observation will not be imported in the project and in the second they
will be. In this example we keep them as they are (no delete).
Press Next in the wizard.
Give a new name to this SIMCA project (i.e. SpiritClassification) and press Finish in the wizard.
In SIMCA the following project window will open:

16 0BSpirits SIMCA-P Tutorial


There are 11 classes in the data. Some of them have few observation and it would of course be better if
each class contained more observations (10-20) but in this case the main purpose is to show the how
classification is handled in SIMCA.
SIMCA is now prepared to make models for each class. In the window above the classes have been
arranged hierarchically under CM1 (Class Model Group 1). The models have also got the name from the
class ID defined at the import.
According to what we learned in the previous part the scaling should be changed to Pareto. To change the
scaling for all prepared class models, right click on the CM1 header and chose Edit model group CM1.
Chose scaling for variables to Pareto. Click OK and all class models will be updated to Pareto scaling.

Overview model
In a classification situation you often start by creating a model containing all classes to get an overview.
This can be done in the same way as part 1(overview) by creating a special project but it is more practical
to do that directly where the classes are defined.
Mark CM1, right click and choose New as model group CM1. In the workset dialogue change model
type to PCA-X.

Press OK.
In the project window the new PCA-X model (12) is prepared with scaling = Pareto.

SIMCA-P Tutorial 0BSpirits 17


Use autofit to create a model. Autofit is a rule based procedure that calculates components one at a time
using cross-validation and checks if each component is significant. When all significant components are
extracted the procedure stops.
A 4 component model is calculated describing 99.4% of the variation in data. The result will be identical
to what we did in part 1 (4 significant components, in part 1 we forced the calculation of the 5th
component). The same type of plots can be shown for this model as in part 1.
A score plot of this model can made from Analysis: Scores: Scatter Plot or use speed button

The following window opens up:

T1 vs. t2 is chosen.
Click on the Labels tab and change label types to secondary id for point labels.

Press OK.

18 0BSpirits SIMCA-P Tutorial


The legend in the plot can be activated by first right click, select Plot Settings: Plot Area and to the left
scroll down to Legend and check Show legend:

The score plot is the same as in Part 1. The colors in a score plot come automatically when classes have
been defined. In part 1 the colors was defined from the name (secondary ID).
4 components are calculated and it is possible to shift axis in the score plot using the arrows on the
keyboard. Keys up/down will shift the Y-axis and keys left/right will shift the X-axis.
In the interpretation remember that component one describes almost all variation.

Models based on separate classes


The next step is to create separate models for each class.
Mark a class model in the project window (as above) and use menu item Analysis: Autofit Class Models

or use speed button .

SIMCA-P Tutorial 0BSpirits 19


In this case we have classes containing only one observation. These cannot be modeled and can be
deactivated from the modeling in the following window which will appear.

In this window it is possible to exclude models, andset number of components to use. Here we use
Autofit.
Press OK and SIMCA will autofit all marked classes, showing the summary plot for each.
The result is shown in the project window.

All classes with one observation are unfitted (wanted). One class (12, USBL) has only two observations
leading to a zero component model.
Now there are several possibilities. Each class model can be examined in the usual way (score and
loading plots, DModX etc.) but the interesting interpretation is to see how the different observations fit to
different classes and how classes fit to each other. This is done using the prediction menu. The first step is
to select a model to use (to begin with, later it is possible to shift model). Mark the model in the project
window (i.e. model 2). Go to menu Prediction and Specify Predictionset. Here there are many
possibilities depending on which data are available. The first choice is to see how all observations fit to
model 2 (SCBL), therefore Dataset is chosen.

20 0BSpirits SIMCA-P Tutorial


A new window opens automatically showing i.e.:
Set (TS=test set, dont belong to class 2 and WS= Workset, belong to class 2)
PModXPS is the probability that the observation belongs to class 2.
tPS1-3 are calculated score values.
DModXPS is the predicted distance to model 2 (normalized, expressed in SD).

The suffix PS is always added to values under menu Prediction.


It is now possible to use all menu items under Prediction.
DModX is often used to determine classification. A high DModX means that the correlation structure is
different. (in this case different sprits have different spectra).

The names on the X-axis can be activated using the property Axis label (90 rotation).
We can see that USBO, FRCO, ITBR and JARU have high DModX values, indicating differences
compared to SCBL which is the actual model. Much higher than the Dcrit line (red) for the model.

SIMCA-P Tutorial 0BSpirits 21


The score plot (based on model 2, SCBL) can also be used to see how well all observations fit to this
model. In this case the Hotelling T2 ellipse is used as a criterion.

In the model based on SCBL it can be seen that USBO, FRCO, JARU does not fit to the model.
Keep these two plots open and change to another model using the property toolbar for the plot.

Select model 4 and the active plot will update. To update another plot make it active and change model
again using the same procedure.
If you want to update both plots simultaneously you have to through the dockable window called
Favorites. Open the favorites dockable window by pointing on vertical tab to the left. The window will
open and follow the steps below.
Click on the pin to lock this dockable window.

Go to bottom of the window and create a new item


called Prediction plots (as an example) by right
clicking on Add project specific favorites here and
chose Add Folder.
Open up the plots you want to see. As an example the
score and DModX plot based on model 2 shown above
will be used.

22 0BSpirits SIMCA-P Tutorial


Right click on the plots (one at a time) and chose Add
to favorites.
The plots will appear under the Prediction plots
folder.

Right click on the folder name Prediction plots and


chose Treat folder as item

Select active model (i.e. M4) and then click on


Prediction plots just created in favorites.
New score and DModX plots will be created to show
how data is fitted to model 4.
Do the same for other class models.

Observe that the above described procedure does not work for the unfitted models (3, 5, 8, 9, 10).
Under the prediction menu there are more items that can be used.

Classification list
This will show the probability for an observation to belong to the different classes.

Select the models where there are fitted models.

Right click in the window and select properties. Change labels to secondary ID and number format to
decimal with two decimals.

SIMCA-P Tutorial 0BSpirits 23


This list shows the probability that a observation belongs to a class. A cell will be marked green if the
value is above 0.1, orange if the values is between 0,05 and 0,1 and white below 0,05.

Miss-classification Table
A miss-classification table shows the overall classification.

Cooman plot
A Cooman plot can be used to compare 2 classes at a time. The plot shows DModX for 2 models. Below
is an example where model 2 (SCBL and model 7 (FRCO) are compared.

24 0BSpirits SIMCA-P Tutorial


So far we have used all observations as prediction set but it is possible to specify in detail which
observations to use. Go to men item Specify Predictonset: Specify.

The list to the right shows the present prediction set (all observations). Start by removing all in this list.
Then right click on the column with primary ID in the list to the left and chose observation ID and check
class ID. These will then appear in the list.

SIMCA-P Tutorial 0BSpirits 25


This will show the class ID:s
Now we want to select observations from two classes that are well separated (SCBL and FRCO). To do
that there are several ways but one is to use Find. We want to search on class Id, therefore we have to
change where to search. Click on the arrow button close to Find and select Find in ClassID Column.

Next step is to fill in class ID names in the search box. Start with SCBL (not case sensitive). All SCBL is
found and marked.

Click on the arrow between the lists and all SCBL observations will appear in the right list. Do the
same for FRCO.
Now we have created a new prediction set and we can use the different menu items under the Prediction
menu in the same way as before.
With this new prediction set we will show the Cooman plot for these two models (SCBL and FRCO).

Here we see a very clear class separation between these to spirits.

26 0BSpirits SIMCA-P Tutorial


Mineral sorting at LKAB

Background
The following example is taken from a mineral sorting plant at LKAB in Malmberget, Sweden. Research
engineer Kent Tano, at LKAB was responsible for this investigation.
In this process, raw iron ore (TON_IN) is divided into finer material (<100 mm, 50% Fe) passing several
grinders. After grinding, the material is sorted and concentrated in several steps by magnetic separators.
The separation flow is divided in several parallel lines and there are also feedback systems to get as high
Fe concentration as possible. The concentrated material is divided into two products, one (PAR) which is
sent to a flotation process and another part (FAR, fines) which is sold as is. For both these products high
Fe content is important.
Twelve process factors were identified. Of these, three important factors were used to set up a statistical
design (RSM). The results of each experiment were measured in 6 response variables. Several
observations were collected for each design point.
The process is equipped with an ABB Master system with a SuperView 900 connected to the process data
system. Data where transferred from the ABB system to a personal computer with the SIMCA-P software
for modeling. Models were transferred back to the SuperView system for on-line monitoring (predictions,
score and loading plots) of the process. The investigation was made in 1992. The multivariate on-line
control of the process is still in work with very good results concerning the quality of the products.

Objective
The objective of this study is to investigate the relationship between the process variables and the 6 output
variables describing the quality of the final product.

Analysis outline
An Overview of the Responses
A PC model of the responses is made to understand:
How the responses relate to each other and to the observations.
The similarity and dissimilarity between the observations, and if there are outliers.
The explanatory power of the variables.
Relating the process conditions to the responses
Understand and interpret the relationship between the process variables and the responses.
Predict the output of new process conditions.
The steps to follow in SIMCA-P
Define the project: Import the primary data set.
Prepare the data (Workset menu).
Specify which variables are process variables (X) and which are responses (Y).
Expand the X matrix with the squares and cross terms of the 3 designed variables.
Fit the models, first PC-Y and then PLS, and review the fit (Analysis menu).
Refine models if necessary by removing outliers (Workset menu).
Use the PLS model for predictions (Prediction menu).

SIMCA-P Tutorial 0BMineral sorting at LKAB 1


Data
The following is a description of variables and observations.

Variables
Data from 18 process variables (X) were collected.

Explanation Abbr. RSM

1 Total load TON_IN Design

2 Load of grinder 30 KR30_IN

3 Load of grinder 40 KR40_IN

4 PARmull PARM

5 Velocity of separator 1 HS_1 Design

6 Velocity of separator 2 HS_2 Design

7 Effect grinder 30 PKR_30

8 Effect grinder 40 PKR_40

9 Ore waste GBA

10 Load of separator 3 TON_S3

11 Waste from grinding KRAV_F

12 Total waste TOTAVF


Responses (Y)

Explanation Abbr.

13 Amount of concentrate type 1 PAR

14 Amount of concentrate type 2 FAR

15 Distribution of type 1 and 2 r-FAR

16 Iron (Fe) in FAR %Fe_FAR

17 Phosphor (P) in FAR %P_FAR

18 Iron (Fe) in raw ore %Fe_malm

Observations
A subset of 231 observations (which full representation of the Y-variables) from the total of 572
observations was used for modeling. Each observation has a name referring to the date and time when
data were collected.

2 0BMineral sorting at LKAB SIMCA-P Tutorial


Create the project
Start SIMCA-P and import the data file from FILE: NEW
Find the data set (SOVR.XLS).
Create a normal SIMCA-P project and click on Next.

SIMCA has interpreted the data as seen above. However we want make some changes. The 3rd column
should be set as Secondary ID (use the arrow for the column and change.

The 2nd column is marked as secondary ID but could be changed to a time variable (X, new in ver. 12.0).
But we also want to use this column as a secondary ID so we have to make a copy of the column first.

Click on button Commands and select Insert: Columns.

An empty column will appear in the spreadsheet. Copy the content of column with Date/Time
information to the empty column and mark it as a secondary ID (now there are two identical columns
marked as secondary ID.
Use the arrow for one of these columns and change setting to Date/Time variable. To do that you have to
present the format present so SIMCA can parse the cells and convert it to numbers. In this case the
resolution is minutes. The expression to use is yyMMdd HH:mm.

SIMCA-P Tutorial 0BMineral sorting at LKAB 3


There are Y-variables present in the data and should be marked as Y.variables. Select the last 6 columns.

Click on any of the column arrows (for these variables and chose Y-variable.

The color is changed showing that these are Y-variables. We can also see that for the last 3 Y-variables
there are a lot of missing data which can be seen in the next view.
The temporary spreadsheet looks like this.

Give a name and location for the project and press Finish.

Press Finish and SIMCA prepares the primary dataset. SIMCA detects that there are variables with more
than 50% missing data and therefore gives a warning. We know that the last 3 Y-variables have this
structure so therefore we select No to ALL.

4 0BMineral sorting at LKAB SIMCA-P Tutorial


In SIMCA a PLS-project is prepared.

Analysis
Workset
SIMCA-P's default workset consists of all the observations in the primary data set with all variables,
scaled to unit variance and defined as X's or Ys as specified at import.
Now we want make some changes to the default prepared model. Right click on model M1 in the project
window and select Edit Model 1.

The Workset dialogue opens up. Click on tab Variables.

The X- and Y-variables are correct. The Date/Time variable is default marked excluded which is OK. In
this application we will not use this variable in the models but for line plots where time is on the X-axis.

SIMCA-P Tutorial 0BMineral sorting at LKAB 5


The three variables TON_IN, HS_1 and HS_2 were varied according to a statistical design (RSM)
supporting a full quadratic model. We will expand the X matrix with the squares and cross-terms of these
3 variables.
Mark TON_IN, HS_1, HS_2. Press the button Sq & Cross and the squares and cross-terms of these 3
variables are displayed in the expanded list.
To Expand the X matrix with squares and/or cross terms press Use Advanced Mode and click on
Expand.

At the bottom of the form there is a list box where you can choose model type. The default in this case is
PLS (we have defined Y-variables at the import). However, before we run the PLS it is wise to extract
information from the X and Y data separately. This can be achieved using model type PCA-X and/or
PCA-Y. Use the Model type list box and select PCA-X. Click on OK to exit the workset menu.

PCA-X
When chosing PCA-X the model type shifts in the project window.

Select Autofit from the analysis menu or use speed button in the toolbar.

The cross validation procedure finds 2 components which we will plot in a score plot. Use menu
Analysis: Scores: Line plot and change the axis of the plot according to this:

Press OK.

6 0BMineral sorting at LKAB SIMCA-P Tutorial


The score line plot shows how the process has moved over time. You can put labels on the points to see
the time. Obviously there is a shift in the process towards left in the plot. This is variation is captured in
the 1st component (t1) and can be illustrated using a line plot of only t1. Use Analysis: Scores: Line plot
again and use Date/Time on the X-axis and t1 on the Y-axis.

Here we see the same thing; t1 is low in the beginning and after a while. To find out the reason for this we
can use the contribution plot to see how the original x-variables are changed in these time periods. Mark
the left part of the score plot t1 vs. t2 with the mouse and press the contribution tool .

SIMCA-P Tutorial 0BMineral sorting at LKAB 7


This plot shows that the period when the process is to the left in the score scatter or line plot is due to that
there is no material (ore) going in to the process, Ton_In is low (effecting other process variables also).
This plot tells us that we should exclude these time periods from the analysis

PC of Y
We also want to get an overview of the responses to see how the Y-variables are correlated to each other.
To do that, we have to shift to a PCA-Y model. This can be done by right click on model 1 in the project
window and select New as model 1. A new workset dialogue will appear.
There are a lot of observations which dont have values for the y-variables (samples for analysis in lab
have been taken out at certain time points (not all) which leads to missing values in the Y-part of the data
table. We only want to use observations where there are Y-values. Click on tab Observations, right click
on the first column and chose to add the secondary ID Select.

Observations marked O dont have any Y-values (missing) and observations marked S have. Start by
click on one observation and use Ctrl-A to mark all and press the Exclude button to the right. Using the
Find and select function we can now seek the observations marked with S. First we must choose the
select column which is done by clicking on the arrow to the right of the search box and choose Find in
Select column. Write S in the search box and all observation with S in the select column will be
highlighted. Press the Include button to the right.

Change Model type to PCA-Y at the bottom of the workset and press OK.

8 0BMineral sorting at LKAB SIMCA-P Tutorial


A new model (PCA-Y) will appear in the project window (with N=85 observations).
Use autofit to create a model. No significant component is found (no structure in the multivariate space).
Calculate 2 components manually (menu Analysis: Two first components).
The model overview plot opens.

Click on the model summary line to open a table with the summary of the fit of the model. This table
displays R2X (fraction of the variation of the data explained by each component) and cumulative
R2X(cum), as well as the eigenvalues and the Q2 and Q2(cum)(cross validated R2). The six Y's are
correlated, and are summarized by two new variables, the scores t1 and t2, explaining 71.7% of their
variation.

Scores and Loadings


Scores
Select Analysis: Scores: Line plot to display the score plot of t1 vs. t2 with a line drawn between the
points. In Label Types mark Use identifier Obs ID (primary).

SIMCA-P Tutorial 0BMineral sorting at LKAB 9


The scores t1 and t2, one vector for dimension 1 and 2, are the new variables computed as linear
combinations of the six responses and summarizing Y.
The score plot shows that the observations cluster in different groups. Each group represents a setting of
the experimental design. The process ran for a certain time at each of these settings (design points) to
reach stability. Measurements on the process (the observations in the score plot) were recorded every
minute. No obvious outliers are present.
Loadings
Select Analysis: Loadings: Scatter Plot to display the loadings p1 vs. p2.
In Label Types mark Use Identifier Var ID (Primary) and click on Save AS Default Options, to always
display variable names.

The loadings are the weights with which the variables are combined to form the scores, t. The loadings, p,
for a selected PC component, represent the importance of the variables in that component and show the
correlation structure between the variables, here the responses Y.
In this plot we see that PAR, FAR, %P_FAR is positively correlated and negatively correlated to
%Fe_FAR. r_Far dominates the second component, is here negatively correlated to PAR and has only a
small correlation to the other variables in component 2. %Fe-Malm is not correlated to any of these
variables in the first two components.

10 0BMineral sorting at LKAB SIMCA-P Tutorial


Click on Analysis: Next Component, and compute a third component. Display the loadings p1 vs. p3. The
third component (explaining 22% of the variation of the data) is dominated by %Fe-Malm. In the third
component this variable has a small positive correlation to %Fe-FAR, r_FAR and FAR and little to the
others.

Summary of Overview of Responses


No outliers were detected. All of the responses participate in the model, and are correlated to each other,
with the exception of %Fe-Malm, which is only slightly correlated to three of them.

PLS MODELING
The main objective is to develop a predictive model, relating the process variables X's to the output
measurements (responses) Y. The experimental design in three of the process variables accounts for an
important part of the variation of the Y's.
Autofit
To prepare the PLS model right click on model 2 (PCX-Y) in the project window and select New as
model 2. In the workset dialogue change Model type to PLS. The reason we do this is that in model 2 we
have all we want defined (X and Y, expanded X, selection of observations which have values for Y).

Click on Analysis: Autofit, or the speed button , to fit a PLS model, with cross validation.
2
The Model Overview Plot displays R Y(cum), the fraction of the variation of Y (all the responses)
explained by the model after each components, and Q2(cum), the fraction of the variation of Y that can
be predicted by the model according to the cross-validation. Values of R2Y(cum) and Q2Y(cum) close
to 1.0 indicate an excellent model.

SIMCA-P Tutorial 0BMineral sorting at LKAB 11


Double click on the model summary line to display a list of the fit of the model per component.
The present model is indeed excellent and explains 83,9% of the variation of Y, with a predictive ability
(Q2) of 80,6%.

Use the tool for 4 overview plots .and 4 plots will be created, score and
loading plots, distance to model in X (DModX) and X/Y summary. Below is an explanation to these
plots.
Summary: X/Y Overview
This plot displays the cumulative R2Y and Q2Y for every response. With the exception of %Fe-FAR and
%P-FAR, all responses have an excellent R2 and Q2.

Scores t1 vs. t2
Click on Scores: Scatter plot and use default t1 vs. t2.

12 0BMineral sorting at LKAB SIMCA-P Tutorial


Here we see the effect of the experimental design in 3 of the process variables. The clusters represent
different settings in the design. For each setting the process has been run for a while.
This plot is t1 vs. t2. But we have 6 components in the model. Using the arrows on the keyboard we can
change the axis in the plot (t1 vs. t3 etc). In this case it would give 15 combinations. If we want to be sure
that there are no outliers we can use a line plot of Hotellings T2 which combine all components in a line
plot. Use menu Analysis: Hotellings T2Range: Line plot. The properties can be changed so it covers
other ranges of components (1.e. 3-Last).

Over all components there is no outlier in the projection plane.


Scores t1 vs. u1
Right click and in properties select t1 vs. u1. We have a good relationship between the first summary of
the X's (t1), and the first summary of the Y's (u1).

SIMCA-P Tutorial 0BMineral sorting at LKAB 13


To indicate the inner relation in PLS we can put a regression line fitted to the observations using the tool

. The plot shows the line (slope 1, 45) and the model for it.

Loadings
The loading plot w*c1 vs. w*c2 shows the correlation between the X- and Y-variables.

In this plot we see the two first components and can make interpretations like that PAR and FAR is
positively correlated to several X-variable lying to the right in the plot. We can also see that of the
expanded variables the squared term for HS_2 (speed of magnetic separator 2) have a significant
contribution. All others are close to the origin, meaning that the influence is low.
Like in the case of the scatter plot we have 6 components to consider. Change the Y-axis in the plot to
component 3 and Y-variable %Fe_malm will be strong in this component.
Remember that the loading plot is mostly used to get an overview of the correlations between variables. If
we want to see the quantitative relationships use the coefficients. In this case with 6 Y-variables there will
be 6 coefficient plots to look at (below)
DModX
Distance to model is a measure of the residuals for the observation (default standardized). In this case
there are no observations far outside the critical limit (Dcrit, 95% default) indicating no outliers.

14 0BMineral sorting at LKAB SIMCA-P Tutorial


DModY
In PLS we have a projection on the Y side and residuals there should be checked using Analysis: Distance
to Model: Y block.

Here there is no critical level that but no observation has a high value compared to the other, indicating no
outliers.
Coefficients
Regression coefficients show the quantitative relationship between the X-variables and the responses. Use
menu Analysis: Coefficients: Plot to show a plot.

SIMCA-P Tutorial 0BMineral sorting at LKAB 15


Many factors have a strong positive effect. All expanded terms except Ton_in*Ton_in are small. Use the
Property bar to change responses or components.

Use list box for Y-variable and shift between these. It is also possible to use the up/down arrow on the
keyboard (select one response first).
Variable Importance
A drawback with regression coefficient is that there is one for each Y-variable. However there is another
measure which will summarize the influence on all Y-variables simultaneously.
Use menu Analysis: Variable Importance: Plot. This plot shows the importance of the terms in the model,
as they correlate with Y (all the responses) and approximate X.

VIP values > 0,8 is considered strong.

Prediction
Create a prediction set
To test the prediction properties of a PLS model you often have test data sets (prediction sets). There are
several ways to handle this. One is to use external data which you import as secondary datasets to
SIMCA. Another way is to split the primary data so that you create models on one part and use the other
part for prediction. In this case we use the second approach.

16 0BMineral sorting at LKAB SIMCA-P Tutorial


To select observations we can use the workset dialogue (New model as 3, use tab observations and
exclude observations). Another way is to this interactively in a plot and uses the exclude tool (see below).
A third way which is chosen here is to use a dockable window. Be sure that model 3 in the project
window is marked. Activate the dockable window through menu View: Dockable Window: Observations.

Use the pin to lock this window (afterwards you can un-pin it so it will disappear to the left.
Doing this you can have plots with observations (score plots, DModX etc.) open and see what happens.
In the dockable Observation window, hold the Ctrl key and mark observations 140, 176, 177, 199, 200,
256, 257, 326, 327, 351, 352, 371, 372, 401, 402, 427, 428, 451, 452, 491, 492, 501, 502, 526, 527, 546,
547, 566, 567. If you have a plot open you can see that we have marked observations evenly distributed
over all observations.
The deleted observations can be seen in the active plots (i.e. score plot t1 vs. u1)

Use the exclude tool to remove these observations . Un-pin the dockable window (disappears to
the left of the window). In the project window a new model is created where the marked observations
have disappeared.
Use Autofit to the PLS model.
The model becomes less complicated (4 components) but if we look at the X/Y Overview plot it very like
the one where all observations are present.

We can look at this model in the usual way showing score plots, loading plots etc.
We now have an excellent relationship between t1 and u1 with no outliers.
In the first two components, PAR, and FAR are positively correlated with all the load variables and
negatively correlated with r_PAR, %Fe-FaR and %Fe_Malm. The model is almost linear except for HS_2
and its squared term dominating the second component.

SIMCA-P Tutorial 0BMineral sorting at LKAB 17


Prediction
We can now use the model to predict the outcome of the process for the prediction set observations.
First we have to chose a model to use for prediction. Mark model 4 in the project window or use the list
box situated in the upper left. Use menu Prediction: Specify Predictionset: Specify.

Remove all observations from right list. In the left Workset Complement is selected (all observations not
in the model 4). In this case we dont want observation with no Y-values. Right click in the left list and
select to show observation label Select. Use the Find function (first click on the arrow, chose to find in the
select column) and search for S (observations with Y-values.

29 observations with Y-values (S) are selected as the prediction set. Press OK
Use menu Predictions:Y Predicted: Scatter plot. The observed vs. predicted plot, for PAR, is displayed.

18 0BMineral sorting at LKAB SIMCA-P Tutorial


For PAR and FAR (Select from properties), we have excellent predictions, they are less good for the other
responses. At the bottom of the plot RMSEP is calculated. Use the property bar for the plot to change Y-
varaible.

Also look at DModXPS (under prediction menu).

No observation is far away from the model.


Sometimes it is nice to use both data that are used for modeling and for prediction in the same plot. To do
that we have to specify the prediction set again. Now we use the whole dataset but as before we only use
observations marked with S. Keep the Y predicted scatter plot open and create the new prediction set (85
observations selected). Right click in the plot and choose to set color according to predictions.

Test data and prediction data can be seen in the same plot.

SIMCA-P Tutorial 0BMineral sorting at LKAB 19


There is also a possibility to see the actual predictions for the Y-variables using menu Predictions:
Prediction list. Right click and chose properties to select what you want to see in the list.
To the left in the list there are labels, Set (workset/testset), PModXPS (probability to belong to actual
model) and DModX.

To the right in the list the actual and predicted Y-variables can be seen.

Summary
This example shows that statistical design in the dominating process variables gives data with high
quality that can be used to develop good predictive process models. With multivariate analysis we extract
and display the information in the data.

20 0BMineral sorting at LKAB SIMCA-P Tutorial


Hierarchical Modeling

Background
This example deals with a polymer production plant. It is a continuous process which went out of control
at around time point 80 after a fairly successful campaign to decrease the side product (response variable
y6).

Objective
The manufacturing objective was to minimize the yield of side product (y6) and maximize product
strength (y8). The data analysis objective is to investigate the use of hierarchical modeling to:
overview the process
detect the process upset

To understand the relationship between the two most important y variables (y6= impurity, and y8= yield)
and the three steps of the process, feed (x1-x7), reactor (x8-x15), and purification and work up (x16-x25).
We shall do the following, using obs. 1-79 as a training set:
1. PLS model of X= feed (x1-x7) with y6 and y8 (Block 1)
2. PLS model of X= reactor (x8-x15) with y6 and y8 (Block2)
3. PLS model of X= purification (x16-x5) with y6 and y8 (Block 3)
4. PCA model of less important y's (y1 to y7 not including y6) (block4)
5. Top level hierarchical model with scores of blocks 1 3 as X and scores
of block 4 plus y6 and y8 as Y.
The objective of Block Models 1 to 4 is to summarize the various steps of the process by scores to then be
used as X variables the top level model.

Data
The data set contains 33 variables and 92 hourly observations. The measured variables comprise seven
controlled process variables (x1x7), 18 intermediate process variables (x8x25), and eight output
variables (y1y8). The controlled process variables (x1x7) relate to the feed of the process, variables x8
x15 reflect reaction conditions and variables x16x25 correspond to a purification step. As mentioned
above, y6 and y8 are the key outputs. All the data are coded so as not to reveal any proprietary
information.

Outline
The steps to follow in SIMCA-P are:
Create the project by importing the data set
Generate and fit the three PLS model for X-blocks 1-3 (obs.1-79), and
mark them as Base hierarchical.
Generate and fit the PC model for block 4., and mark it as base
hierarchical
Generate and fit the top level hierarchical
Interpret the hierarchical model

SIMCA-P Tutorial 0B17BHierarchical Modeling 1


Validate the hierarchical model with the test set (obs. 80-92)
An overview of the structure can be seen below.

7 8 10 6 2
Block X1 Block X2 Block X3 Block Y1 Block Y2
Input and feed Reaction conditions Purification step Less important Ys Important Ys
(x1 - x7) (x8 - x15) (x16 - x25) (y1-y5 & y7) (y6 & y8)

Analysis
Create the project
Start a new project. The data set name proc1a.dif
Start SIMCA-P and create a new project from FILE: NEW. The data set name is HI-PROC.XLS.

SIMCA has interpreted the spreadsheet correct. Mark variable called y1-y8 as Y-variables.
Click on Next and give the project a name and chose destination directory. Select Finish and the project
are created. In the project window the first unfitted model is prepared as a PLS model (we have defined
Y-variables).

Overview of data
The first step is to use PCA to get an overview of the data table. To do that right click on the model and
chose Edit Model 1. Change model Type to PCA X&Y and press OK. The model type shifts to PCA
X&Y in the project window.
Another way to do it is to mark model 1 in the project window and use menu Analysis: Change Model
Type.

2 0B17BHierarchical Modeling SIMCA-P Tutorial


A common procedure when new data are imported is just to calculate two components to be able to look
at the data. Select menu item Analysis: 2 First Components to compute two components.
Create a score and loading plot to look at the information in data.

The score plot of t2 versus t1 (below left) indicates a clear trend towards the end of the time period. The
corresponding loading plot (below right) demonstrates how the different variables contribute to the first
two principal components. The location of the two important responses (y6 and y8) is highlighted by an
increased font size. y6 is modeled by the first component and y8 mostly by the second component.
The main conclusion is that hierarchical modeling should be applied to observations 179 as these
represent normal process behavior. We want to model the process when it behaves within allowed
variation.
From this overview and from the knowledge of the process it would be interesting to model the process in
logical blocks. Below is a figure showing the system in blocks showing the approach to calculate
hierarchical models of the system and finally combine the bas models to a top (super) model summarizing
the system.

7 8 10 6 2
Block X1 Block X2 Block X3 Block Y1 Block Y2
Input and feed Reaction conditions Purification step Less important Ys Important Ys
(x1 - x7) (x8 - x15) (x16 - x25) (y1-y5 & y7) (y6 & y8)

4 4 4 3 2
Top level PLS model
X1 X2 X3 Y1 Y2
scores from scores from scores from scores from
y6 & y8
model 1 model 2 model 3 model 4

SIMCA-P Tutorial 0B17BHierarchical Modeling 3


Creating the base models for all blocks
Use menu Workset: New or right click on model 1 in the project window and select New as Model 1.
In the workset dialogue, click on the observation tab and exclude observations 80-92.

Next step is to define variable classes. Click on the variable tab. Mark variables for block 1 and check
block 1 (to the right).

Do the same for block 2 and 3 choosing the correct X- and Y variables.

4 0B17BHierarchical Modeling SIMCA-P Tutorial


Y-variable 6 and 8 are marked for block 3 but should be marked for block 1, 2, 3. Mark variable y6 and
y8 and in the list box for blocks check block 1 ,2 and 3.

The last block is the rest of the Y-variables. Mark them and change variable type to X and block 4.
Change model type (bottom of workset dialogue to PLS-Hierarchical.

Click on OK to exit the workset. The project window is updated showing a class model hierarchy CM1.

SIMCA-P Tutorial 0B17BHierarchical Modeling 5


To fit these models use class fitting button or menu Analysis: Autofit class models. A new window
opens but press OK and the 4 models will be fitted.
Three of the models have only 1 component but we want to extract more components to exhaust variation
from the data. Therefore we re-fit the models and in the window that appears change the number of
components to 4 (press select all and change No. of Components to 4 and press Set).

Press OK and the models are re-fitted. In this way we have extracted 60-90% of the variation in the X-
variables. In the project window we can also see that the models have been marked with a B (hierarchical
Base model).

To make it more clear give names on the models. Right click on each model, chose Change Model
Title.

The next step is to go through the 4 models and see if there are any outliers.

Summarizing the Feed


The score plot of t1 versus t2 (below left) suggests that the first observation is actually an outlier in the
feed variables. The contribution plot for this observation (below right) indicates that the abnormal
behavior is due to x6.

6 0B17BHierarchical Modeling SIMCA-P Tutorial


The outlier is extreme and will be excluded from the model.
Right click on CM1 and select New as Model Group CM1. The workset dialogue opens an under the
observation tab exclude observation 1. When you press OK SIMCA will tell you that a new class model
structure will be created (CM2) where the models will appear unfitted (with observation one excluded).
Fit 4 components as for CM1 and put titles on the models.

In the score plot of t1 versus t2 (below left), the observations now display a much better distribution. The
corresponding loading plot (below right) indicates that y6 is modeled by the first and y8 mainly by the
second component.

Summarizing the Reaction Conditions


In the second stage, PLS was used to relate Block X2 to Block Y2 using observations 279 as the training
set.
There is a minor outlier on the second component (observation 73, below left) but not extreme enough to
justify its removal. The PLS loading plot (below right) suggests that y6 is related to x9, x10 & x14 and y8
to x12.

SIMCA-P Tutorial 0B17BHierarchical Modeling 7


Summarizing the Purification Step
In this case, the score plot of t1 versus t2 (below left) indicates a good spread of the observations. The
two responses are modeled well, y8 mainly by the first component and y6 mainly by the second
component (below right).

Summarizing the Less Important Y-variables


The final base-level model consists of PCA of Block Y1, the less important response variables.
The score plot of t1 versus t2 (below, top left) suggests that observation 65 is an outlier. Both the
corresponding loading plot (below, top right) and the appropriate contribution plot (below, bottom left)
for this observation indicate that the main cause for the deviating behavior is variable y7, which is much
lower for observation 65 than for the average observation. Observation 65 is also strange in DModX
(below, bottom right) and consequently this sample ought to be deleted.

8 0B17BHierarchical Modeling SIMCA-P Tutorial


Right click on M5 and choose to Edit Model. Exclude observation 65 and press OK. SIMCA will now
create a third CM group with a unfitted model for the less important Y variables. Fit 4 components
manually and give the model a title.

Score and loading plots of the first two components are displayed below. The observations are now much
more homogeneously distributed. Basically, the first component accounts for variables y2 and y4 and, to
a lesser extent, y1 and y7. The second component is largely related to y1 and y5.

SIMCA-P Tutorial 0B17BHierarchical Modeling 9


Top level
First, mark model 1 in the project window and then use menu Workset: New.
Under the observation tab exclude observations 1, 65 and 80-92.
Under the variables tab start by excluding all variables (mark one and the press CtrlA on keyboard and
then press the exclude button). Scroll down in the variable list to show the new variables that have been
added from the base models (the scores from the base models). Mark Y6, Y8 and the 4 scores from
model 10 as Y variables and score variables from models 6, 7 and 8 as X variables and press OK.

10 0B17BHierarchical Modeling SIMCA-P Tutorial


Press autofit and SIMCA finds three significant components.

The top level PLS model yields four significant components explaining 51.7% of the Y-variance.
The two most important responses, Y6 and Y8, are modeled well (below, top right). The score plot of t1
versus t2 (below, bottom left) indicates that the process starts down to the right (high values of y6),
moves up to the left (lower Y6), and is then manipulated to give lower values of y6 (lower left quadrant).
The process then becomes unstable and lurches back to the right.
In the corresponding loading plot (below, bottom right), y6, the side product, is on the right-hand side of
the plot, positively correlated with the first component of the feed model ($M6.t1) and the second
components of both the reactor ($M7.t2) and purification models ($M8.t2). Y6 is also negatively
correlated with the first component of the model of the less important Y-variables ($M10.t1).
Meanwhile, Y8 is related to the first components of both the reactor ($M7.t1) and the purification models
($M8.t1). Y8 is also related to the second component of the model of the less important Y-variables
($M10.t2).

With the contribution tool we can investigate the patterns suggested by the loading plot above in more
detail. Simply double click on any score variable (e.g. $M6.t1) and the corresponding base-level loading
plot will open, revealing which of the original variables influence that particular score.

SIMCA-P Tutorial 0B17BHierarchical Modeling 11


For example, $M7.t2 is positively correlated with Y6. Double click on this score variable and bring up the
loading plot shown below. We conclude that y6 is positively correlated with x9 and x10 and negatively
correlated with x8, x11 and x14.

This gives us zoom-in/zoom-out functionality. In the top-level loading plot, we get a feel for the
relationships between the two important y-variables and the various stages of the process (feed, reaction,
purification). In the base-level loading plot, we understand which of the original variables influence these
outputs.
Use the top-level loading plot to investigate the underlying relationships for other score variables. Just
double click on the appropriate points.

Prediction
It is known that the process became unstable after time point 80 and it would be interesting to fit all data
to the top model. Mark the top model in the project window. Use men Predictions: Specify and select the
original data table.

Under menu Predictions score plots and DModX plots can be used to see how the original data fits to the
hierarchical top model.
The DModXPS plot (below, top right), in particular, illustrates this very clearly. The contribution plot
(below, bottom left) for time point 91 (large DModXPS) shows the feed (model M6, components 1, 2 and
3) as being the main cause of the problem. Zooming in (double click on a column in the contribution plot)
on the feed in all 3 components (plot only shown for t2, below bottom right) points to variable 6, which is
much too low.

12 0B17BHierarchical Modeling SIMCA-P Tutorial


Conclusions
The hierarchical approach to multivariate analysis greatly enhances our ability to understand complex
problems. The zoom-in/zoom-out capability allows us first, to understand complex relationships in terms
of process blocks, and then, to zoom in on an individual block to resolve the details in terms of the
underlying process variables.

SIMCA-P Tutorial 0B17BHierarchical Modeling 13


NIR

Background
The following example originates from a research project on peat in Sweden. Peat is formed by an
aerobic microbiological decomposition of plants followed by a slow anaerobic chemical degradation. Peat
in Sweden (northern hemisphere in general) is mainly formed from two types of plants, Sphagnum
mosses and grass of Carex type. Within the main groups there is variation among the species. Depending
on location, climate etc. there are several other plants involved in the peat forming process.
In the project many different types of chemical analyses were performed to get detailed information about
the material and to investigate differences among different peat types. Chemical analysis was performed
according to traditional methods (GC, HPLC, etc.) which often were laborious and time consuming. To
speed up the analysis of samples, Near Infrared Spectroscopy (NIR) together with multivariate calibration
was introduced. This strategy was found to work very well and after the calibration phase, samples were
analyzed in minutes instead of weeks.
In this tutorial we selected a subset of samples, which represents the typical variation of peat in Sweden.

Objective
The objective of this study is to model and predict different constituents of samples of peat directly from
their NIR spectra. 41 samples of peat, mainly of two types Sphagnum and Carex, were subjected to NIR
spectroscopy. The spectra were recorded at 19 wavelengths (19 filters) with a reflectance instrument
(log(abs)) and scatter corrected before the analysis.
For this objective, we will now develop a PLS model relating the X variables (NIR spectra) to the Y
variables (peat constituent concentrations measured by traditional analysis).

Data
Variables
Variables 1-19 represent spectra from the NIR instrument, which in this case was a 19 channel filter
instrument. Spectra are recorded as Log (Absorbance) and then scatter corrected by a MSC procedure.
Variables 20-46 represent different chemical analyses, which the NIR spectra can be calibrated against.
Variable 27 (Klason l) is Klason Lignin (rest after hydrolysis) and variable 28 is Bitumen, which
represents carbohydrates solvable in acetone.

Observations
From a huge number of peat samples 41 were selected, representing the main variation of peat in Sweden.
The sample (observation) names are coded in all 20 characters. Each position in the names carries certain
information. In the plots a sub-string of two characters (position 6 and 7) are often used. Position 6
represents the degree of decomposition, L (low), M (medium) and H (high). Position 7 represents peat
type, S (Sphagnum) and C (Carex).

Outline
Making a PLS model relating the NIR spectra variables to the peat
constituents in order to:
Understand and interpret the relationship between the spectra (X) and peat
composition (Y variables).

SIMCA-P Tutorial 0BNIR 1


Develop separate PLS model for each type of peat (Sphagnum and Carex),
to:
1) Increase the precision of the calibration.
2) Be able to classify and predict peat types.

The steps to follow in SIMCA are:


Define the project: Import the primary data set.
Prepare the data (Workset menu).
a) Specify which variables are process variables (X) and which are
responses (Y)
b) Transform the variables
The responses are concentrations of the chemical constituents of peat, and
their variation is non linear, a Log transformation is warranted.
(log Y + 0.1) with 0.1 to make sure that all values are positive before the
transformation.
c) Group the observations in 2 classes for peat type Sphagnum and Carex.
Fit the model, a PLS of all the data (Analysis menu).
Fit a PLS model for each of the peat type, Sphagnum and Carex.
Use the PLS model for predictions and classification (Prediction menu).

Create the project


Start a new project. The data set name now is NIRKHAM.XLS
Start SIMCA and create a new project from FILE: NEW.
The import wizard opens.

The first two columns are correctly marked as observations numbers and names and the first raw is
variable names.
Mark the variables starting with Ramos to end, and from the combo box select Y Variable.

2 0BNIR SIMCA-P Tutorial


SIMCA-P marks these variables as Y (response) variables.
Click on Next to open the Project specification page. You can change, as desired, the destination folder,
or the project name. Click on Finish, the data set Nirkham is imported.

Prepare the data


Default Workset
SIMCA-Ps default workset consists of all the observations in the primary data set with all variables,
scaled to unit variance and defined as X's or Ys as specified at import. This is the starting workset when
you select Workset | New.

Transform the variables


Mark model 1 in the project window, right click and chose Edit Model 1.
Mark all the Y variables, select Log, with C1= 1 and C2=0.1 (some of the concentrations are 0.0), and
click on Set.

SIMCA-P Tutorial 0BNIR 3


Group observations in classes:
Select the Observations tab and display the secondary ID's
Right click on the Primary ID's and select Observation label.

Use the Find function to select Sphagnum and Carex peat and set class 1 and 2 for them.

Click on the arrow to the right of the find selection box and chose to find in the Obs. Sec. ID:1 column.
We want to use character 7 in the ID so we use the following search string ??????C* to search for Carex
observations, set them as class 1. Do the same for Sphagnum but change search string to ??????S* and set
the class to 2.

4 0BNIR SIMCA-P Tutorial


Change model type to PLS-Class.

Click on OK to exit the Workset window. The project window will show a CM-group with the 2 models
prepared. Change the Title of the models to Carex and Sphagnum.

Analysis
PLS model of all the samples
Right click on the CM1 mark in the project window and select New as Model Group 1. Change Model
Type to PLS and click OK. In this way we will get a model with all observations.

SIMCA-P Tutorial 0BNIR 5


The project window is updated and an unfitted PLS model with all observations are created.

Autofit the model.

The model overview plot is updated as the model is fitted. This plot displays R2Y cumulative by
component and Q2 Y cumulative by component. R2 Y is the fraction of the variation of Y (all the
responses) explained by the model after each component, and Q2Y is the fraction of the variation of Y
that can be predicted by the model according to the cross-validation. Values of R2Y(cum) and Q2Y(cum)
close to 1.0 indicate an excellent model.

6 0BNIR SIMCA-P Tutorial


Double click on the Model Summary line to display the corresponding list.

Multivariate calibration with NIR spectra often leads to many components due to the high precision of the
data. The present model is indeed excellent and explains 88.2% of the variation of Y, with a predictive
ability (Q2) of 73.9%.
Summary: X/Y Overview
Click on Analysis | Summary | X/Y Overview | Plot and display the cumulative R2Y and Q2Y for every
response. Use the Properties page to select variable labels and Click on Save As default Options to
always have variable names.. With the exception of Bitumen all responses have an excellent R2 and Q2.

Scores t1 vs. t2
Click on Analysis : Scores : t1 vs. t2 plot. Observation 32 lies far away in the second component,
indicating that sample 32 is different with respect to NIR spectra.

SIMCA-P Tutorial 0BNIR 7


Contribution plot
To understand why sample 32 differs from the others, double click on observation 32 in the Scores t1 vs.
t2.

This contribution plot displays the differences, in scaled units, for all the terms in the model, between the
outlying observation 32 and the normal (or average) observation, weighted by w*1 w*2 (the importance of
the X-variables in component 1, 2
In the plot we see some spectral variables close to 8 standard deviations, indicating some contamination
in this sample. We shall remove sample 32.
Comparing the spectra of observation 32 and 39
Mark both observations, right click and select Plot Xobs to display the spectra of these 2 observations.

8 0BNIR SIMCA-P Tutorial


Scores t1 vs. u1
We have a relatively good relationship between the first summary of the X's (t1), and the first summary
of the Y's (u1), with some spread in the data.

To display informative labels, select in properties Obs Sec ID, start in position 6 for length 2.

SIMCA-P Tutorial 0BNIR 9


You can now distinguish two groups of observations, S Sphagnum peat and C Carex peat and see how
low (L), medium (M) and high (H) decomposition is spread in the classes.
Scores u1 vs. u2
The projection of the samples in the Y space (traditional chemical analyses) does not show observation 32
as outlier as in the Scores plot.. NIR spectroscopy can detect very small changes in chemical composition
(PPM level) compared to the traditional analyses which typically have large measurements errors (3-
50%). With NIR spectroscopy one achieves better control of the samples.

Loadings w*c1 vs. w*c2


The w*'s are the weights that combine the original X variables (not their residuals in contrast to w) to
form the scores t. In the first component w* is equal to w. The w*'s are related to the correlation between
the X variables and the Y scores u. X variables with large values of w* (positive or negative) are highly
correlated with u (and thereby Y).
The c's are the weights used to combine the Y's (linearly) to form the scores u. The c's express the
correlation between the Y's and the t's (X-scores).

10 0BNIR SIMCA-P Tutorial


This plot shows how the different chemical compounds correlate to the different parts of the NIR spectra.
Plots displaying the loadings, one component at a time, may be more informative.
Loadings: Column plot w*c1
Click on Analysis | Loadings | Column plot w*c1.. This plot shows the importance of different parts of
the NIR spectra, in the first component, to explain the variation among the constituents of the peat.

Predictions
We now have two good models describing the relation between NIR spectra and Chemical composition of
peat and they can be used to classify peat samples as Sphagnum or Carex.
In this tutorial we do not have new peat samples. However, we will use the data set and classify every
sample with respect to the two models. We first will want to remove sample 32.

Making a prediction Set


By default the prediction set is all of the primary data set.

SIMCA-P Tutorial 0BNIR 11


Separate PLS models for the Sphagnum and Carex
Edit CM1 and exclude observation 32 in the workset dialogue. Press OK and the class autofit
specification opens, press OK.

The model for class 1 (Carex) has 4 components and the model for class 2 (Sphagnum) has only one
component. This is probably due to variation in X (NIR) that confuses the models. Therefore
OPLS/O2PLS class models are created to separate orthogonal variation in X that has nothing to do with
variation in Y.
In order to use the 2 models in prediction we can use more of the variation in the X-matrix (NIR).
Therefore, both models are expanded with more components (7 in each, as for the total model with all
samples). Mark a model a model in the project window and add components using Next Component or
the icon . The project window will look like below.

Coomans' Plot
Exclude sample 32 from the Prediction set (Predictions | Specify Prediction set | Remove observation 32
from prediction set) and display the Coomans plot.
This plot displays the Distance to the model of every observation with respect to model M2 and M3, and
shows a very good separation between the Sphagnum and the Carex peat samples.

12 0BNIR SIMCA-P Tutorial


Sample 21 (ME) is correctly classified as being neither a Sphagnum sample nor a Carex peat sample.

Summary
As a tutorial, this provides just a brief introduction to the main functionalitys and plots in SIMCA-P. We
recommend that you continue with your own data, may be another tutorial, and then look in the Manual
for details. The Help system contains the same information as the Manual, but organized in a different
way.

Plots and Lists


You can display the results of SIMCA-P in numerous graphs and lists.
From the Analysis and Prediction menu, results of the active model are available as quick plots and lists.
With the menu Plot/List, you have access, to the raw data and every computed value from every model.
You can even plot coefficient vectors from different models against each other.

SIMCA-P Tutorial 0BNIR 13


Multivariate calibration using spectral
data

Introduction
This example illustrates multivariate calibration using PLS, spectral filtering and OPLS.
The data set of this example was collected at Akzo Nobel, rnskldsvik, in Sweden. The raw material for
their cellulose derivative process is delivered to the factory in form of cellulose sheets. Before entering
the process the cellulose sheets are controlled by a viscosity measurement, which functions as a steering
parameter for that particular batch.
In this data set NIR spectra for 180 cellulose sheets were collected after the sheets had been sent through
a grinding process. Hence the NIR spectra were measured on the cellulose raw material in powder form.
Data are divided in two parts, one used for modeling and one part for testing.

Data
The data consists of:
X: 1201 wavelengths in the VIS-NIR region
Y: Viscosity of cellulose powder.

Objective
The objective of this study is to develop calibration models with original data, filtered data and OPLS.

Analysis Outline
1. Make an ordinary PLS model relating the NIR spectra variables to the
viscosity with the original data.
2. Review and validate the calibration model with the test samples.
3. As 1 but with SNV filtered data.
4. Review and validate the calibration model with the test samples.
5. As 1 but with 1st derivative is used as filter.
6. Review and validate the calibration model with the test samples.
7. Make an OPLS model.
8. Review and validate the calibration model with the test samples.

The steps to follow in SIMCA-P are


Define the project: Import the primary data set.
Specify which variables are process variables (X) and which are
responses (Y).
Examine the raw data (Plot Xobs)
Prepare the data (Workset menu).
Divide the data in two classes, one for modeling and the other for
testing and vice versa.
Change scaling to Ctr (centering) for the spectral variables (X).
Fit the calibration model, and review the fit (Analysis menu).

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 1


Validate the model with the test set. (Prediction menu)

Ordinary PLS project


New project
Start a new project with the data set Cellulose.dif
Start SIMCA-P and define a new project from FILE: NEW.

The import wizard opens, select a normal SIMCA-P project.

The 1st column is marked as primary observation ID which is OK. The 2nd column is marked as a
secondary ID which is also OK. This column will later be used to divide the data in training and test sets.

2 0BMultivariate calibration using spectral data SIMCA-P Tutorial


The 3rd column is Viscosity which is the Y variable. Use the small arrow in the column header and mark
this column as a Y-variable.
Click on Next to open the Project specification page. You can change, as desired, the destination folder,
or the project name. Click on finish and the data set Cellulose is imported.

Plotting the Spectra


Open the spreadsheet showing the imported data. The spectra can now be inspected. To show all spectra
right click somewhere in the table and chose Create: Plot Xobs. If you want to select specific samples
(observations) mark the rows (use control key if you want to chose more than one spectrum). If you select
one spectrum, then it is possible to use the up and down arrow on the keyboard to shift the spectrum in
the table.

All spectra are plotted together:

The scale of the X-axis is ordinal and to change to a scale representing nm use properties (right click) for
the plot and change the axis label to the following. The label interval must be set high to see anything on
the axis.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 3


Prepare the Data
Workset
SIMCA-Ps default workset consists of all the observations in the primary data set with all variables,
scaled to unit variance and defined as X's or Ys as specified at import. SIMCA has also prepared an
unfitted PLS model as default.
Edit: Model
Right click on the model and select Edit model 1:

The Workset window opens where variables, observations, transformation scaling etc. can be changed.
Change the scaling of the X variables to CTR (centered only)
If you want to do an interpretation of the calculated components it is common to just center the data. The
default scaling in SIMCA is UV so it has to be changed.
Click on Scale, mark variables 1 to 1201, select in Type, Ctr and click on Set. The X variables are now
just centered, and not scaled.

4 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Create training and test set
Click on the observations tab. Right click in the left column and chose to also show the secondary ID
(class).

The classID column can now be used to divide the data in two parts called classes. In this way one class
can be used as model and the other as test and vice versa.
Start by indicating that we want to use the find function for the classID column by clicking on the small
arrow to the right of the find box and chose to find in the classID column.

In the Find box select 1 and then press Set on the button to the right of the Set Class box.
Do the same thing but chose 2 and Set. In this way we have created 2 classes.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 5


At the bottom of the form select Model type PLS-Class. Click on OK and exit the workset menu.

Analysis
When you exit the workset window, SIMCA has prepared a class model group (CM1) that contains two
unfitted models for the two defined classes with 90 observations in each.

Ordinary PLS model


Autofit class models
Click on Analysis: Autofit class models, or use the speed button .
The following window appears, press OK.

The model overview plot updates as the models are fitted. Below is a model overview for model 2 (last
model).

6 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Double click in the project window on the model summary line to display the details by component.

R2Y(cum) the fraction of the variation of Y explained by the model after 5 components, is equal to 0.723
and Q2(cum) the fraction of the variation of Y that can be predicted by the model according to the cross-
validation is equal to 0.579. Values of R2Y(cum) and Q2Y(cum) close to 1.0 indicate an excellent model.
Component 4 and 5 describes each less than 1% of the variation.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. The relationship between t1 and u1 is not very good in
particular for the cluster of samples 153, 162 and 168 etc. The line in the plot is drawn using the line tool

. It is obvious that the data are not homogeneous. Most of the observations are in the middle and
others are grouped outside the main cluster.

Plotting the Spectra of selected observations


Press on the CTRL key and mark the observations around observations 168 down left, and 27 high right,
then right click and select Plot Xobs.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 7


In this way it is possible to examine differences in raw data.
Loadings
Loadings for each component will show which part of the spectra holds information. Click on Loadings:
Line plot and select to plot w* for all components (5). Then remove component 4 and 5.

The X-axis labels can be changed to show the nm scale.

8 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Distance to the Model (DModX)

Several samples have a distance to the model larger than the critical distance, indicating data
inhomogeneity. The labels in the plot can be set by first use the cursor to draw around the points. Release
the mouse button and the points are marked. Then use the speed button in the toolbar to chose Primary
ID.

Observed vs. Predicted


The predictions are poor particularly for a cluster of samples as in the t1 vs. u1 plot.

It is possible to zoom in to enlarge parts of the plot. Use the magnifier tool (X/Y).

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 9


Validating the Model 2
Click on Predictions: Specify Prediction set: Class: 1.
The prediction list opens and can be closed. Click on Predictions: Y Predicted: Scatter plot.

The predictions are reasonable with an RMSEP of 138 compared to the training set RMSEE of 147. The
R2 value for the regression line is 0.7308 and this is often called Q2 external.
Some time it is of interest to have both training and test data in the same plot. In this case use menu
Predictions: Specify Predictionset: Specify and add the training set data (class 1 in this case) to the right
list. Class 1 is already in the test set (right list) so chose to add class 2 in the left list. Click on one row and
then Ctrl A to mark all. Then use the arrow to send the left observations over to the right list.

Use properties for the plot and change colors according to predictions.

10 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Validating the Model 1
The same procedure can now be done using the 1st model and use class 2 as test data.
It is very practical to define classes working with calibration models. In this way the data is split so there
will be two models and two test sets of the data.

SNV filtering
The above modeling indicates systematic variation in the X block that is not related to the response Y.
We will apply an SNV filter to the X block (the NIR data) to remove unwanted variation in X and see if
the prediction error becomes smaller. SNV also normalize the observations and is often used as filter for
spectroscopic data.
Click on Dataset: Spectral Filters:

Chose SNV and click on the right arrow.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 11


The Y variable is excluded (SNV works only on the spectral part of the data, NIR in this case). All
observations are selected, press Next.
A new project will be created with the default name = old name+ the ending _SNV.

SIMCA will automatically switch to the new project.


When testing different filters it is nice to see several SIMCA project at the same time. To do that use
menu View: General Options: General and uncheck item By default, close already open projects when
opening a new project. In this way the projects will appear as tabs at the left bottom part of the SIMCA
window. You can switch between the projects to compare them.

Plotting the Spectra


Open the spreadsheet showing the imported data. The spectra can now be inspected. To show all spectra
right click somewhere in the table and chose Create: Plot Xobs.
In this case using SNV filtered data we can see that the filtering has taken away variation and the spectra
form a narrower band. The biggest variation in the spectra is now in the region 1500-1800 nm.

Prepare the Data


Edit the workset for the default model in the same way as for the ordinary PLS approach shown above.
Set scaling to centre and define 2 classes in the same way. Use Model type PLS-class.

12 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Analysis
Autofit the class models in the same way as above.

SIMCA finds 3 significant components.


Double click in the project window on the model 2 summary line to display the details by component.

Model 2 now contains 3 significant components. R2Y(cum) the fraction of the variation of Y explained by
the model after 3 components, is equal to 0.637 and Q2(cum) the fraction of the variation of Y that can be
predicted by the model according to the cross-validation is equal to 0.585.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. Using SNV seems to further highlights that some
observation does not fit very well 153, 162, 168 etc.

Loadings
Loadings for each component will show which part of the spectra holds information. Click on Loadings:
Line plot and select to plot w* for all components (3).

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 13


With SNV filtered data component 1 picks out the variation around 1500-1800 nm as seen as the region
with biggest variation in the raw data. From an interpretation point it seems that SNV has done a good
job.
Distance to the Model (DModX)

Some more samples have a larger distance to the model after SNV filtering.
Observed vs. Predicted
The predictions are poor particularly for a cluster of samples as in the t1 vs. u1 plot. They can be labeled
by marking them and then clicking on the selected item fast button, and selecting labels as Primary ID.

14 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Validating the Model 2
Click on Predictions: Specify Prediction set: Class: 1.
The prediction list opens and can be closed. Click on Predictions: Y Predicted: Scatter plot.

The prediction error are higher with an RMSEP of 148 compared to the ordinary PLS model (without
filtering). The R2 value for the regression line is 0.690 and this is often called Q2 external.
Validating the Model 1
The same procedure can now be done using the 1st model and use class 2 as test data.
It is very practical to define classes working with calibration models. In this way the data is split so there
will be two models and two test sets of the data.

Filtering using derivatives


Another type of filter used in calibration is to calculate the derivate and use the derivated data in the
modeling.
Switch back to the original project Cellulose.
Click on Dataset: Spectral Filters:

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 15


Chose Derivatives and click on the right arrow.

Derivatives are only calculated on the X-data (like SNV). Select all spectral variables and all
observations.

16 0BMultivariate calibration using spectral data SIMCA-P Tutorial


In this window you can chose between 1st, 2nd and 3rd derivative. The derivative is calculated using a
Savitsky-Golay algorithm which can be adjusted in how many points are used for calculation and distance
between points (every point, every second etc.).
Changes in order and points will automatically be shown in the derivated data sub window.
Due to an error in the implementation the start and end of the derivated data will be wrong (edge effects).
This will be adjusted for when models are created later.
In this case select 13 points in the sub-model and distance between points 1.
A new project will be created with the default name = old name+ the ending _dydx.

SIMCA will automatically switch to the new project.

Plotting the Spectra


Open the spreadsheet showing the imported data. The spectra can now be inspected. To show all spectra
right click somewhere in the table and chose Create: Plot Xobs.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 17


Prepare the Data
Edit the workset for the default model in the same way as for the ordinary PLS approach shown above.
Set scaling to centre and define 2 classes in the same way. Use Model type PLS-class.
Note:
Due to an error in the software the calculation of the derivatives in the beginning and at the end of the
spectra is not correct. Therefore, exclude variables in the beginning and at the end so that they dont
influence the modeling.
In this case the first and last 10 variables were excluded.

Analysis
Autofit the class models in the same way as above.

In this case SIMCA finds two significant components for each class.
Double click in the project window on the model 2 summary line to display the details by component.

Model 2 now contains 2 significant components. R2Y(cum) the fraction of the variation of Y explained by
the model after 2 components, is equal to 0.509 and Q2(cum) the fraction of the variation of Y that can be
predicted by the model according to the cross-validation is equal to 0.485.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. Using derivatives resembles very much the SNV
filtering.

18 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Loadings
Loadings for each component will show which part of the spectra holds information. Click on Loadings:
Line plot and select to plot w* for all components (2).

With derivated data the components seems to pick out approximately the same information except for the
region around 1500-2000 nm where they differ.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 19


Distance to the Model (DModX)

The picture shows approximately the same type of outliers as before.


Observed vs. Predicted
The predictions are poor particularly for a cluster of samples as in the t1 vs. u1 plot.

Validating the Model 2


Click on Predictions: Specify Prediction set: Class: 1.
The prediction list opens and can be closed. Click on Predictions: Y Predicted: Scatter plot.

The prediction error is higher with an RMSEP of 179.8 compared to the ordinary PLS model (without
filtering). The R2 value for the regression line is 0.5416 and this is often called Q2 external.
In this case we also see 2 main clusters and 2 smaller indicating that the data are not homogeneous or that
this type of model does not fit well.

20 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Validating the Model 1
The same procedure can now be done using the 1st model and use class 2 as test data.
It is very practical to define classes working with calibration models. In this way the data is split so there
will be two models and two test sets of the data.

OPLS
The two last filtering approaches (SNV and derivative) did not improve the prediction error for these data.
Both filtering methods try to make corrections on the X-data to remove unwanted variation (imperfection
in the instrument leading to baseline shift and scattered light etc.) These filters do not take Y in to account
in these correction.
In SIMCA ver 12, OPLS is implemented as a new type of PLS that will seek out variation in X that is not
correlated to variation in Y and remove that part from the X-data. This will hopefully lead to better
calibration models focusing on the interesting correlation between X and Y.

Prepare the Data


Go back (or open) to the original PLS project (Cellulose). Right click on the CM1 in the project window
and select New as model group CM1.

In this way all settings from CM1 will be preserved, the only thing that must be done is to change model
type to OPLS/OPLS2-Class.

Analysis
When you exit the workset window, SIMCA has prepared a class model group (CM2) that contains two
unfitted models for the two defined classes with 90 observations in each.
The Autofit class models window will appear automatically so press OK.

Double click on class 2 in the CM2 group to show the details of the OPLS model.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 21


R2Y(cum) the fraction of the variation of Y explained by the model after 1 component, is equal to 0.749
and Q2(cum) the fraction of the variation of Y that can be predicted by the model according to the cross-
validation is equal to 0.661.
An OPLS model will also show the orthogonal part of X that does not correlate to Y. In this case SIMCA
has found 5 orthogonal components. In this example we will focus on the predictive properties of the
model to be able to compare with the models created.
Scores t1 vs. u1
Click on Scores: t1 vs. u1 to display the score plot. The relationship between t1 and u1 is the best so far
even if some samples deviates as before (153, 162 and 168 etc.).

Loadings
Loadings for each component will show which part of the spectra holds information. Click on Loadings:
Line plot and select to plot w* for all components (1).

The X-axis labels have been changed to show the nm scale.

22 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Distance to the Model (DModX)

Observed vs. Predicted


In this case the model has improved the relation between X and Y even if some samples are deviating.

Validating the Model 2


Click on Predictions: Specify Prediction set: Class: 1.
The prediction list opens and can be closed. Click on Predictions: Y Predicted: Scatter plot.

The prediction has a RMSEP of 134.38. This is better than for the ordinary PLS model and the filtered
models.
Validating the Model 1
The same procedure can now be done using the 1st model and use class 2 as test data.

SIMCA-P Tutorial 0BMultivariate calibration using spectral data 23


It is very practical to define classes working with calibration models. In this way the data is split so there
will be two models and two test sets of the data.

Conclusions
OSC, Wavelets, and OPLS are tools that have some additional features beyond ordinary PLS making
these tools useful. OPLS makes the PLS model easier to interpret only one component, and an
interpretable loading plot. Wavelets compress the spectra with little loss of information, and, sometimes,
especially in combination with OSC (OSC-Wavelets) even improves the predictions somewhat.

24 0BMultivariate calibration using spectral data SIMCA-P Tutorial


Batch Modelling with SIMCA-P+

Introduction
The following example is taken from the article:
J.MacGregor and P.Nomikos, Multivariate SPC Charts for Monitoring Batch Processes, Technometrics
Vol. 37 No. 1 (1995) 41-57
The duration of a batch was 2 hours. During this period, 10 variables were measured every 1.2 minutes,
for a total of 100 measurements. A quality variable was measured at the completion of every batch.
Data were collected on 55 batches.
Batches 40 to 42 and 50 to 55 had their quality variable outside the specification limits. The quality
variable of batches 38, 45, 46 and 49 was on the boundary.

Data
Variables
The following 10 variables were measured at equally spaced intervals during the evolution of a batch.
x1 to x3: Temperature inside the reactor
x6 and x7: Temperature inside the heating- cooling medium
x4,x8 and x9: Pressures variables
x5 and x10: Flow rates of material added to the reactor.

Objectives
1. Develop a model of the evolution of good batches (the observation level
model), and use it to monitor new batches as they are evolving, in order to
detect problems as early as possible.
2. Make a model of the whole batch based on the scores of the observation level
model, and use this model to classify the new batches as good or bad ones.

Analysis Outline
We will use 18 good batches (1800 observations) to model the evolution of good batches. This is done
by fitting a PLS model relating Y, the relative batch time, to the 10 measured variables.
This observation level model is used to monitor the evolution of the new batches, batch 30 to 33 (good
batches) and 49 to 55(bad batches).
We will make a PCA model of the whole batch, with the unfolded scores of the observation level as X-
variables.

The steps in SIMCA-P are:


Create the observation level project and import the primary data set with the 18 good batches
Fit the observation level model, a PLS with Y, the relative batch time, and X, the 10 measured variables
(Analysis menu).
Display the control charts of the training set, (Analysis: Batch: Control Charts menu)
Import the secondary data set with the new batches
Monitor the evolution of the new batches (Prediction: Batch: Control Charts menu) and use
contribution plots to interpret the problems.

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 1


Create the whole Batch project and fit a PCA model to the data
Classify the new batches as good or bad using the distance to the model (DModX) and use contribution
plots to interpret the results

Create the observation level project


Start a new project. The data set name is NOM18a.xls
Start SIMCA-P and create a new project from FILE: NEW.

The import wizard opens.


Select the radio button SIMCA-P Batch project and click on Next.

The second column labeled observation names contains the batch identifiers.
Both the Batch identifiers and the phase identifiers (when present) can be located in any variable
(column) in the spreadsheet.
Mark this second column and from the combo box (top of column) select batch identifiers.
In this example you do not need to define phase identifiers, as the batch process has only one phase.

2 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


The following window opens:

Click OK and Next.

The Batch page displays the list of batches in the dataset with the number of observations in each batch.
The Conditional delete allows you to delete batches with fewer observations than a selected number.

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 3


In this example we do not use the Conditional Delete.
Click on Next to display the project specification page and then click on Finish.

The following message is displayed:

Click on OK.

4 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


Analysis
The workset M1 has been prepared with all the 10 measured variables specified as Xs and the auto
generated variable $Time (relative batch time normalized) specified as Y, and all variables scaled and
centered to unit variance (UV). You are ready to fit the PLS Batch model.
Click on Autofit.

SIMCA-P finds 2 components and they explain 85% of X.


The Model window summarizes the fit of the model per component. We have an excellent model with 2
components, explaining 87% of X and 98% of Y.
Scores Line plot of t1
Click on Analysis: Scores: Line Plot: t1 to display the first summary variable t1, summarizing all the 10
variables.

All the 18 batches are within the 2 standard deviation limit.


Loadings p1
Click on Analysis: Loading s: Column: p 1.

With batches we are interested in summarizing the X variables and the loadings p1 are the weights that
indicate the importance of the original Xs for t1.

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 5


We can see here that all the variables participate in forming t1 with the first 3 variables having positive
weights while the others have negative weights.

Batch Control charts (Training set)


Analysis: Batch Control Charts: Scores Plot
The Batch Control charts show how t1 and t2 vary with time, for good batches. A new good batch should
evolve in the same way and its trace should be inside the control limits.

Use the up/down arrow keys to shift between the scores. You can also use the property bar.
Use the up arrow to display the control chart of t2.

To display the control chart in normalized units, from the limits and averages tab, select remove the
average and normalize the values (from List and Averages) and then use Select batch and Phase to
show all batches and then click on apply.

6 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


The plot is displayed in normalized units.
Batch Control Charts DModX, Variables, Hotelling T2 and Observed vs. predicted
The plots of the distance to the model (DModX), Hotelling T2, and Observed vs. Predicted time, with
their control limits, are also important monitoring charts for new batches.
Display univariate Batch Control charts when needed.

Monitoring new batches


Import the secondary data set with the new batches
Use the menu File: Import Secondary dataset, and import the file Alpred.xls as a secondary data set.

Mark the 2nd column as the Batch IDs.


Creating a prediction set with the new batches

Click on Predictions: Specify Predictionset: Dataset: Alpred to select the alpred prediction set.

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 7


Control Charts for new batches
Predictions: Batch Control Charts | Scores

Click on Predictions: Batch Control charts: Scores Plot to display the new batches in the control charts
with the control limits derived from the training set. Use the Properties page to include batches 50 to 55.
If the legend does not appear right click in the window and chose Plot Settings: Area and check to show
the legend.
Use the up key to display the Control chart of t2.

In both of these control charts, batches 50 to 55 are out of the control limits in the first time period (0 -
15). Batches 50 - 55 are also out of the control limits in t1, for the last time period (90 to 100) of the
polymerization process.
Contribution plot
To use the Contribution tool, double click in the t1 control chart on one of the outlying batches, 50 for
example, at time point 5.

8 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


The Contribution plot clearly displays variable V-4 (pressure) as being lower than average trace.
Control Chart of batch 49 and Contribution plot

Batch 49 is slightly out of the control limits around time period 55- 60 (use properties: Select Batch and
Phase to select batch 49).
The Contribution plot around time point 59 shows variable V-10 slightly lower than average good
batches.

Prediction: Batch Control Charts: DModX

Batches 50 to 55 are clearly out of the control limit for the time period 0-20.
Contribution plot
The Contribution plot for any of these batches in that time period shows again variable V4 (pressure) as
being lower than in good batches.

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 9


The Control chart of variable 4 (pressure), double click on it in the window, clearly shows the problem
with the pressure for these 5 batches.

Creating and Modelling the batch level project


Select the menu File | Batch |Create batch level project, mark scores, and the check box Bring secondary
dataset to the batch level.
In the batch level project, each row has the data from one batch and consists of the unfolded scores, from
the observation level model, which describe the evolution of each batch.
This example has no initial conditions.

10 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


Analysis: Autofit
Click on Analysis: Autofit to fit a PC model. SIMCA extracts 4 components.

Analysis: Scores
Click on Analysis: Scores: t1 vs. t2

The 18 good batches span the space with no outliers.

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 11


Analysis: Batch Control Charts: Batch Variable Importance

This plot, by combining the importance of the scores in the batch level model, with the weights w*
derived from the observation level model, displays the overall importance of the measured variables in
the whole batch model. Here we see that all the 10 variables are important (this is to be expected as the 10
measured variables are highly correlated).

Predicting the quality of the new batches


In the menu Predictions: Specify Predictionset: Dataset select the data set alpred as a prediction set.
It contains the data for batches 30-33 and 49 to 55, one observation per batch, and the predicted scores of
the observation level as xs.

Predictions: T Predicted

We clearly see that batches 50 to 55 (with the exception of 52) are outside the Hotelling T2 ellipse and
are outliers in the second dimension.

Predictions: Contribution Scores for batch 51


Use the Contribution and tool double click on batch 51.

12 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


Double click on the t2-M1:4 and the score variable is resolved with respect to original variables and
displays variable 4 (pressure) as the problem variable.

Predictions: Distance to the Model (DmodX)

Use properties to put labels on the points (you can also change size of the font).
Batches 50 to 55 have their distance to the model way above the control limit, and batch 49 is also above
the control limit. Clearly these batches are different than the good ones.
Prediction: Contribution: Distance to the model
Using the Contribution tool double click on batch 50

SIMCA-P Tutorial 0BBatch Modelling with SIMCA-P+ 13


Double click on the score t2-M1:3 and the score variable is resolved with respect to original variables and
displays variable 4 (pressure) as the problem variable.

Conclusion
Modeling the evolution of a representative set of good batches allowed us to construct control charts to
monitor new batches during their evolution. We detected problems in the evolution of the bad batches and
understood why these batches were outside the control limits.
The model of the whole batch has allowed us to classify the new batches as good or bad and understand
why these batches had an inferior quality.

14 0BBatch Modelling with SIMCA-P+ SIMCA-P Tutorial


Modelling of a Batch Digester

Introduction
The following example is derived from a batch digester.
Batch digesters are used in the pulp and paper industry to produce pulp from wood chips.
The batch process has 5 phases: chip, acid, cook, blowback and blow.
In the chip phase, the wood chips are fed into the digester and steamed.
In the acid phase, the chips are impregnated with an acid.
They are then cooked at high temperature and pressure during the cook phase. This is the most important
phase, as this is where the de-lignifications happen.
In the blowback phase, the pressure is released and thereby brought back to atmospheric pressure. The
temperatures also drop.
Finally, in the blow phase, the pulp is blown out of the digester.
The duration of a batch varies between 8 and 10 hours, and on the average, is around 9.4 hours in the
present data set.
27 variables (including the sampling time) were measured every 2 minutes during the batch evolution.
Different variables are meaningful in the different phases.
Data were collected on 52 batches. Of these, thirty good batches are used to build the training set model.

Data
Variables
The following variables are meaningful in the following phases:
Chip and Acid phase:
State of the acid (2 variables)
State of the vent (2 variables)
State of Steam1 (2 variables)
State of Steam2 (2 variables)
Temperature4
Pressure2
Cook phase:
Pressure1
Steam
Temperature1
Temperature2
Temperature3
Temperature4
Temperature5
Pressure2
Temperature6
Pump

SIMCA-P Tutorial 0BModelling of a Batch Digester 1


Blowback phase:
Pressure1
Temperature2
Temperature3
Temperature4
Temperature5
Relief valve
Blow1
Blow2
Pressure3
Pressure4
State of Dilution (2 variables)
Dilution flow

Objectives
1. To develop a model of the evolution of good batches (the observation level
model), and use the model to monitor other batches as they are evolving, in
order to detect problems as early as possible.
2. Make a model of the whole batch based on the scores of the observation
level model, and use this model to classify other batches as good or bad.

Analysis Outline
We will use 30 good batches to develop the model of the evolution of good batches.
In the analysis, we will combine the chip and acid phase (they are not meaningful alone) and delete the
blow phase which has no effect on the quality of the pulp.
We will fit 3 different PLS models relating Y, the relative batch time, to the measured variables in the 3
relevant phases (chp+acid, cook and Blowback).
These observation level models are used to monitor the evolution of the other batches, in this example
those left out of the training set.
We will make a PCA model of the good batches at the batch level, with the unfolded scores of the
observation level as X-variables.

The steps in SIMCA-P are


Create the observation level project, import the primary data set with the 52 batches, merge phases chip
and acid and delete the blow phase.
In menu workset, select 30 specified good batches and select the variables relevant in each phase.
Fit the observation level models, one for each phase, by PLS with Y= relative batch time, and X = the
relevant variables in each phase. (Analysis menu).
Interpret the scores of the cook phase, and display the control charts of the training set (Analysis: Batch:
Control Charts menu).
Select the complement of workset (training set) and save it as a secondary data set (Menu
Prediction/Prediction Set).
Monitor the evolution of the batches left out of the training set (Prediction: Batch: Control Charts
menu) and use contribution plots to interpret the problems with some of the batches.
Create the whole Batch project and fit a PCA model to the data (Menu File/Create Batch Level
Project).
Classify the prediction set batches as good or bad using the distance to the model (DModX), and use
contribution plots to interpret the results (Menu Prediction).

2 0BModelling of a Batch Digester SIMCA-P Tutorial


Create the observation level project
Start a new project (Menu File/New). The present data set is DIGESTER.DIF
Start SIMCA-P and create a new project from FILE: NEW.

SIMCA shows a message about the DIF-file:

Click OK.
The import wizard opens and select to start a SIMCA-P+ Batch project and click on Next.

The second column labeled observation names contains the batch and phase identifiers.
Both the Batch identifiers and the phase identifiers (when present) can be located in the same variable
(column) or in separate column in the spreadsheet.

SIMCA-P Tutorial 0BModelling of a Batch Digester 3


Mark this second column and from the top of the column and select Batch/Phase identifiers.

The following window opens:

The batch identifiers are sequential numbers from 01 to 52 and the phases are chip, acid, cook, blbk, and
blow, click OK.
The batch and phase ID are now in 2 separate columns. The original column will be marked excluded
(can be set as a secondary ID).

4 0BModelling of a Batch Digester SIMCA-P Tutorial


Mark the last column with the sampling time variable and from the drop down menu, select Y Variable
(Time or Maturity) and click on Next.
The Phase page displays the list of phases in the dataset with the number of observations and batches in
each. Under every phase is the list of variables.
Using the CTRL key mark both the chip and the acid phases and click on Merge. Mark the Blow phase
and click on Delete.

We now have 3 phases left: chip+acid, cook and blbk, click on Next.
The Batch page opens listing all the batches with their numbers of observations. Listed under every batch
are the phases included in the batch. In our example all the batches include all the phases.

SIMCA-P Tutorial 0BModelling of a Batch Digester 5


The Conditional delete allows you to delete batches or phases or a selected phase with fewer observations
than a selected number.

In this example we do not use the Conditional Delete.


Click on Next to display the project specification page and then click on Finish.

The following message is displayed and new variable $Time is created.

Some of the variables will be constant for some of the phases and SIMCA warns about that. Select No in
the following messages.

6 0BModelling of a Batch Digester SIMCA-P Tutorial


Specify the Workset
BM1 is an umbrella model which has been prepared with 3 unfitted models, one for every phase, and all
the measured variables specified as Xs and the relative sampling time as Y. All variables are scaled and
cantered to unit variance (UV).
We need to edit BM1 to include only the relevant variables in each phase, and select the 30 good batches.
Click on Workset: Edit MB1 and select the Variables Tab.
Mark variables in the list and assign them to the correct phase in the Phases list box (to the right). See
below how variables are assigned. Variables can be assigned to several phases (check several in the list).

Note the Y variable, sampling time, will automatically be shifted, to start at 0 for every phase and
Normalized for better alignment. Normalizing the sampling time achieves linear time warping.
Click on the Batch Page and select 30 good batches: 1, 4, 6-13, 16. 18, 21, 23, 25, 29, 31, 32, 34, 36-38,
40, 42, 43, 46-49 and 51.
To do this, first press Select All and Exclude. This excludes all batches. Then use the CTRL key, mark
the 30 good batches and click on Include.

SIMCA-P Tutorial 0BModelling of a Batch Digester 7


Click on OK to exit the workset.

Analysis
Fitting All the Class models
SIMCA opens an Autofit window with all phases present.

Click on OK.
The 3 class models are fitted and they all explain more than 80% of X.
A R2/Q2 plot will appear for each model during calculation.

8 0BModelling of a Batch Digester SIMCA-P Tutorial


We will examine the cook phase at is the most important.
Double click on the cook model to examine its components.

The first three components are the most important, explaining together 69% of the variation of X; t1
explain 44%, t2 16% and t3 9%.

Scores Line plot of t1, t2 and t3


Click on Scores: Line Plot: t1 to display the first summary variable t1, summarizing all the variables of
the Cook phase.

The 30 batches are all within the 3 sigma limit of t1.


Select t2 by using the up arrow on the keyboard. The 30 batches are within the 3 sigma limit of t2.
Select t3, which shows more variability and several batches have some time points above the 3 sigma
limits.

SIMCA-P Tutorial 0BModelling of a Batch Digester 9


Loadings p1, p2 and p3
With batches we are interested in summarizing the X variables, and the loadings p1, p2 and p3 are the
weights that combine the original X variables to form t1, t2 and t3.
To interpret the first three scores t1, t2 and t3 (new variables summarizing all the X variables) we look at
the loadings p1, p2 and p3.
Click on Analysis: Loadings: Column plot: p1, and then p2 and p3 (use arrow keys up/down on the
keyboard).

We can see that t1 consists mainly of the first 5 temperatures and pressure 1.
The second score t2 is primarily pressure1, the steam and temperature1

The score t3 is again dominated by the pressures (1 and 2) with steam, temp1 and temp6.

Batch Control charts (Training set)


Analysis: Batch: Control Charts: Scores
All the batches in the training set are aligned to the same length with the same time points. Hence we can
now, at each time point, compute the average t1 with its standard deviation.
The Batch Control chart of t1 shows how this summary (the temperature trace) varies during the
evolution of the cook phase. The green line is the average t1 computed from all good batches. The red
limits are the 3 sigma limits computed from the variation of t1 around its average of all good batches.

10 0BModelling of a Batch Digester SIMCA-P Tutorial


This green line represents the finger print of the ideal good batch. All new good batches should evolve in
the same way and should be inside the red control limits.

Individual batches can be included in this control chart the first training batch is included as default.
More can be included in the stack of displayed batches by using the properties menu (after right click).
Use the side arrows to move the stack of displayed batches forward or backward by one batch.
Another way is to use the properties window (right click and select properties) on the plot and make the
selection there. Select all in the left window and use the right arrow and send them to the right window.

In this case the traces of all the good batches are within the red control limits.

SIMCA-P Tutorial 0BModelling of a Batch Digester 11


To display the Control chart in Normalized units, from the Limits and Averages tab (under Properties),
select Remove the average and Normalize the values, and click on Apply.

The plot is now displayed in normalized units.

In the component tab, select the 2nd component using the up arrow key on the keyboard and use properties
to switch back default Limits and Averages.

12 0BModelling of a Batch Digester SIMCA-P Tutorial


Batch Control Charts DModX, Hotelling T2 and Observed vs. predicted
The plots of the distance to the model (DModX), Hotelling T2, and Observed vs. Predicted time, with
their control limits, are also important monitoring charts for new batches.

Display univariate Batch Control charts when needed by selecting the Variable Plot.

Monitoring new batches


Creating the Prediction set: Complement of Workset
Use the menu Prediction: Specify prediction set: Specify

SIMCA-P Tutorial 0BModelling of a Batch Digester 13


Remove all batches from the prediction set (the right window), select all batches from the left window
(the Complement batches of the Training set) move them to the right window and press OK.
From the Prediction menu, save them as a Secondary data set, give it the name Pred1 and click OK

Batch Control Chart of the Prediction set


For the cook phase, select Prediction: Batch Control Chart: Scores and use the Properties page to include
all the batches.

In the Control chart of t1 with the average and 3 computed from the good batches, we can see batch 28
far outside the control limits.

14 0BModelling of a Batch Digester SIMCA-P Tutorial


OOC plot
Right click in the Predicted Scores plot and select Out of Control Summary Plot.

Only batches which are 10% of area outside 3, -3 sigma are shown.

Right click on the control chart and select properties. Select all batches (move from left to right window).
This plot displays for every batch the percent of the area outside the limits relative to the total area inside
the limits of the control chart.
Hence batch 28 has 40% of its area outside the control area.
Go back to the batch control chart and select to show only batch 28. Mark the time points outside the 3
sigma and click on the action plot (toolbar)

SIMCA-P Tutorial 0BModelling of a Batch Digester 15


The Contribution plot shows pressure1 being 6 standard deviations lower than the average batch for these
time points, and temperature2 to temperature5 as also being lower than the average at these time points.

Variable control chart


Double click on Pressure1 to display the control chart of that variable.

Prediction: Batch Control Charts: DModX


Use menu Prediction: Batch Control Charts: Distance to Model X Plot.

Batch 28 is clearly out of the control limit for the time period 0 to 1.5 hrs.
Contribution plot
The Contribution plot for batch 28 in that time period shows that the problem is also associated with
pressure 2 and temp6 (correlated with pressure 2).

16 0BModelling of a Batch Digester SIMCA-P Tutorial


Double click on pressure2 to display the control chart.

Creating and Modelling the batch level project


Select the menu File: Batch: Create batch level project, mark scores, and the check box Bring secondary
dataset and select the prediction set Pred1. Click on next, select the batch level name and click OK.
In the batch level project, each row has the data from one batch and consists of the unfolded score vectors
from the observation level models, which describe the evolution of each batch.
This example has no initial conditions.

SIMCA-P Tutorial 0BModelling of a Batch Digester 17


Analysis: Autofit
Click on Analysis: Autofit to fit a PC model. SIMCA extracts 5 components.

Analysis: Scores
Click on Analysis: Scores: t1 vs t2

Batch 6 is slightly out of the Hotelling T2 confidence interval.


Using the Contribution Tool, clicking on batch 6 gives the contribution plot.

The Contribution plot is colored by phases, and shows that t1 in the cook phase, at time 5.2 hours is lower
than the average by 6.5 standard deviations. With the Contribution tool double click on this bar to resolve
this contribution into the original variables,

18 0BModelling of a Batch Digester SIMCA-P Tutorial


The temperature2, around time 5.2 hours is lower than the average of the good batches at the same time
point.
Displaying the Control chart of temperature2, by double clicking on it, we can see that temperature2 at
time 5.36 hours is equal to 114.9 degree and is slightly below the control limit. Temperature2 is equal to
141degrees for the average of the good batches at this time point.

Analysis: Batch Variable Importance


Considering that the different phases have different variables, one must display the Batch Variable
Importance separately for each phase.
Select the Cook phase, as it is the most important.

SIMCA-P Tutorial 0BModelling of a Batch Digester 19


This plot, by combining the importance of the scores in the batch level model, with the weights p derived
from the observation level model, displays the overall importance of the measured variables for the whole
batch model in the cook phase. Here we see that the temperatures, pressure1 and the steam dominate.

Predicting the quality of the prediction set batches


In the menu Predictions: Specify, select both the training set and the prediction set batches in Pred1.

Predictions: T Predicted

Select t1 vs. t3. Batches 28 and 26 are outside the Hotelling T2 ellipse.

Predictions: Contribution Scores for batch 28.


Use the Contribution tool double click on batch 28.

What is causing batch 28 to be an outlier? The problem clearly is the cook phase. Double click on one of
the scores with large deviations, for example t1 at time 1.1 hours, to resolve the contribution into original
variables.

20 0BModelling of a Batch Digester SIMCA-P Tutorial


The resolved contribution plot shows pressure1 as being much lower than average.
The Control chart of pressure1 confirms this and shows the problem with batch 28.

Predictions: Distance to the Model (DModX)

Batches 28, 26, 33, 50 and 52 have the largest DModXPS.

Contribution Plot
Double click with the contribution tool on batch 33 to display the contribution plot

SIMCA-P Tutorial 0BModelling of a Batch Digester 21


The problem seems to be in t2 of the cook phase around time 0.4 hours (the beginning of that phase) and
also in the chip+acid phase.
The resolved contribution for the large t2 in the cook phase shows both the pressure1 and the steam lower
than average, probably due to the problem with the steam state.

The Control charts for batch 33 of both pressure1 and steam confirms this.

22 0BModelling of a Batch Digester SIMCA-P Tutorial


Conclusion
Modeling the evolution of a representative set of good batches allowed us to construct control charts to
monitor new batches during their evolution. We detected problems in the evolution of the bad batches and
understood why these batches were outside the control limits.
The model of the whole batch has allowed us to classify the new batches as good or bad and understand
why these batches had an inferior quality.

SIMCA-P Tutorial 0BModelling of a Batch Digester 23

You might also like