02-2021 - Quant Advanced 2

Advance QUANT2 (PLS)
Sirinnapa (Mui) Saranwong

Bruker Optik GmbH, Germany
Innovation with Integrity

Content
• PCA/PLS Principle
• Component assignment
• Change path / Copy Spectra
• Set test set by wet chem
• Set test set by PCA
• Remove redundant
• Sample statistics
• Explain each plot in graph
• Explain validation report
• Explain regression and loading
• Routine analysis
• Quant2/Filelist
-Bruker Confidential- 2
Principles and properties of factor
analysis
• Variance analysis: ‘looking for changes in the data set‘

• Common statistical method for data analysis
• Different names used in chemometrics:
- Factor analysis
- Principal Component Analysis (PCA)
• Orthogonal transformation of the data
• Enormous data compression: representing of the data set by a few latent
(new) variables
Factor analysis of spectra
Factor Analysis breaks apart the spectral data into the most
common spectral variations (factors, loadings, principal
components) and the corresponding scaling coefficients
(scores)
p d p
Factors
Scores
Spectral data matrix = d
n n
Data matrix: n spectra with p data points

Scores: d score values for each spectrum (d < n)
Factors: d Factors with p data points (d < n)
Factor analysis:PCA
Without component values
5 Spectra Scores Factors

1 Factor 1 5.216 1
Factor 2 -0.216
Factor 3 1.73E-02
Factor 4 -1.52E-02
Factor 5 3.17E-02
2 Factor 1 5.95
Factor 2 -0.103 2
Factor 3 4.97E-04
Factor 4 4.33E-02
Factor 5 5.65E-03
3 Factor 1 7.731
Factor 2 -0.699 3
Factor 3 3.67E-04
Factor 4 -1.15E-02
Factor 5 -2.04E-02
4 Factor 1 5.768
Factor 2 0.693 4
Factor 3 2.97E-02
Factor 4 -3.76E-03
Factor 5 -1.27E-02
5 Factor 1 7.13
Factor 2 0.441 5
Factor 3 -3.75E-02
Factor 4 -9.54E-03
Factor 5 4.46E-03
Inverse Factor analysis: PCA
Reconstruction of spectrum using all factors
Scores of Factors Spectrum

spectrum
7.731
+ -0.699
+ 3.67E-04
+ -1.15E-02
+ -2.04E-02
Inverse Factor analysis: PCA
Reconstruction of spectrum using all factors
Scores of Factors Spectrum

Spectrum
7.731
+ -0.699
• In the software the spectra are not reconstructed. The spectra are
represented just by the few scores values (data compression) which
are used in the modeling calculations.
• Spectral residuals: difference between original spectra and

reconstructed spectra using n factors
Moving from factor analysis to PLS
• PLS is a factor analysis (variance analysis) taking component or

property values (e.g. concentrations) into account
• For each component or property a set of PLS factors is calculated
• The factors are calculated based only on the spectral variance

correlated with the given component or property values
• PLS can be seen as a variance analysis including a kind of regression

step
• PLS is very effective in making use of correlated information and

discriminating non useful information
• Even overlapping bands and structures in the spectra can be

separated
PLS Factors for components A and B
PLS factors
A B Comp. A
1
2
3
PLS factors
Comp. B
1
2
3
Analysis of spectra using PCA or PLS
models are based on scores and
loadings
Scores of Factors in Spectrum

spectrum the model measured
7.731
+ -0.699
• For the measured spectrum the scores are calculated according to the
factors (loadings) stored in the model.
• The scores are used for the final evaluation in the PCA model
(identification) or PLS model (quantification).
Component Assignment
• Evaluate  Assign Categories / Reference Values
Use this if you already

assign into q2
Case sensitive!
For new component
• Evaluate  Assign Categories / Reference Values
2. Click
arrow
1. Select
parameter
3. Add
component
values
Important:
Use average component
values for spectra of the
same sample!
Change path
• When changing PC or moving spectra folder, OPUS/QUANT cannot find

spectra in the new location.
Change path
1
2
Change base path is used for example when

spectra were separated in different folders by date
COPY Spectra
• This is to use when:

• Organizing files in the same PC
• Sending spectra and q2 to Bruker for evaluation
• Creating average spectra (for example when the NIR sampling size is
smaller than wet chem sampling size)
COPY Spectra – Standard
CAUTION: Be careful if spectra has same name but in
different folders!
1 2
COPY Spectra – AVERAGE
1 2
COPY Spectra – AVERAGE
Output
AV.q2 Average spectra files
Selection of calibration and test
samples
• Calibration and test set samples should be well distributed over the entire
property range
• As many as possible samples should be used for the test set but important
samples must be in the calibration. In case of big data sets the splitting is
done by having 50-50 or 60-40 in the calibration and test set, respectively
• Required number of samples

• feasibility study: ~ 20 samples minimum
• typical applications: ~ 50-100 samples
• complex application: > 150 samples
Distribution of samples
Prediction

„rare sample“ or

outlier


 
  typical
 concentration range



 The concentration range of the calibration
 should extent the expected analysis range
if possible.

Reference value
Validation set
• Cross validation is used only when there are limited number of samples
(feasibility test or very costly wet chem). Normally, 2-5% are used for leave
out. There is no statistical benefits of full cross validation (leave one out)
over leave 2-5% out.
• Test set is more reliable method but you need to have sufficient samples in
both calibration and test set. The sufficient number of samples in calibration
set ensure proper PLS loading calculation (normally needs 10 samples per
rank/factor). On the other hand, sufficient number of samples in test set
ensure that the validation results (RMSEP, Bias, RPD) are not over
optimistic.
• Cross validation should be used first before test set selection to remove
severe outlier.
How to select Test set samples
• Remove obvious outliers by (i) look at spectra; (ii) use cross validation; (iii)
check PCA scores
• If there is one or very few kinds of samples in the matrix, use component
values to separate test set. Caution: This cannot be done if component
values of some samples are missing. Those samples must be excluded first.
• If there is several kinds of samples in the matrix, use PCA scores.
Automatic selection of test samples on
component values (Kennard-Stone)
Samples with
lowest and
highest
property values
are in the
calibration set,
the next inner
ones in the test
set
Next test
sample is
chosen with the
Next test
maximum
sample
distance from
the already
selected ones in
all dimensions
(properties).
Here it is found
in the middle.
10 % Test samples
Next test
sample is
chosen with the
maximum
distance from
the already
selected ones in
all dimensions
(properties)
until the
required
percentage of
test samples is
reached.
20 % Test samples
50 % Test samples
Set Test set by component values
3
4
Set Test set by component values
-Bruker Confidential-
Clear Test Set
Set Test set by PCA
2
3
Set Test set by PCA
Set Test set by PCA
1
2
Set Test set by PCA
• PCA does not consider component value. Please adjust manually.
Remove redundant
• This is needed when you have too big populations such as several thousands
samples and many have repeated information.
• All samples must be “Calibration” samples.
1
2
3
Remove redundant samples
View on PCA scores

plot of IV method with
7330 spectra.
About 6500 spectra
(blue) which are quite
similar.
Quant2 OPUS 7: exclude redundant
samples
Detail view on PCA

scores plot of IV
method with
7330 spectra.
About 6500 spectra
from one country
(blue) which are quite
similar.
Quant2 OPUS 7: exclude redundant
samples
Test Set validation with

687 spectra and 1162
spectra in calibration
set.
RMSEP = 0.73
(Before with 7330

spectra in calibration
RMSEP = 0.73)
Sample Statistics
• The repeatability in spectra is as important as wet chem repeatability. There

is a possibility that some sample might have spectra with low repeatability.
Sample Statistics
• The repeatability in spectra is as important as wet chem repeatability. There

is a possibility that some sample might have spectra with low repeatability.
Possibility of low repeatability.

Check spectra first. If needed, we
might use average spectra for this
or even remove completely.
Error values for characterizing
calibration performance and validation
• Multivariate PLS models can’t be

checked by regression coefficient r2 and
X slope of a regression line.
• The standard deviation of NIR
predictions from the true values
Calibration: (reference) are calculated as Root
RMSEE or RMSEC Mean Square Error of..
Cross validation: • Depending on the data set used for

RMSECV prediction different errors are defined
• Another parameter is the R2 value
Test set validation:
which should be as close as possible to
RMSEP
100%
Error values for characterizing
calibration performance and validation
• Root Mean Square Error of..
X  Estimation or Calibration for the

predictions of all samples using
the calibration model based on all
Calibration: samples
RMSEE or RMSEC  Cross Validation for the
predictions during the Cross
Cross validation:
Validation, i.e. samples are
RMSECV
temporarily independent
Test set validation:  Prediction for the prediction of
RMSEP independent samples
RMSECV = SECV 2 − Bias 2

RMSECV versus R2
NIR Value
NIR Value
Actual value Actual value
R2: 90％ R2: 96％

SEC: 0.55% SEC: 0.545%
M
∑(y pred ,i − ytrue ,i ) 2 R2 is the ratio of error in prediction and

R2 = ( 1− i =1
M
)×100 sample distribution. Therefore, adding of
a few extreme samples could significantly
∑(y
i =1
i − y)2
improve the R2 value
RMSEP and the real error
RMSEP = 0.5? Great!

All my results within a
range of +/- 0.5.
Excellent accuracy!
Manager YOU
Normal (Gaussian) Distribution
We find always the same number of

events within the following intervals:
+/- 1σ • +/- 1 standard deviation
68.3% (68.3%)
+/- 2σ
• +/- 2 standard deviations
(95.5%)
95.5%
• +/- 3 standard deviations
+/- 3σ (99.7%)
99.7%
RMSEP / RMSECV are identical to the Standard deviation !
Ruminant Feed – Fat: Test Set Validation
Error distribution
68% 95% 99%
R2 and its meaning: expresses the
relation of error bar and value range
R2 = 66.4%
R2 = 81.4%
R2 = 98.9%
∑(y pred ,i − ytrue ,i ) 2

R2 = ( 1− i =1
M
)×100
∑(y
i =1
i − y)2
R2: Calibration of Fat in Milk
n RMSEP R^2 Range

500 0.06 99.51 5.03
R2 480 0.06 99.45 3.35
400 0.05 99.22 2.17
300 0.05 98.41 1.36
200 0.05 96.51 0.82
100 0.05 88.79 0.46
80 0.05 81.71 0.35
60 0.04 73.87 0.26
50 0.04 70.24 0.21
Data 40 0.04 65.70 0.16
range
RMSEP
Statistics for the model validation
RPD Classification Application

RPD = SD/SECV
<1.0 very poor not recommended or
1.0 - 2.4 poor not recommended RPD = SD/SEP
2.5 - 2.9 fair rough screening
SD = Standard
3.0 - 3.9 reasonable screening deviation of the
4.0 - 5.9 good QC true values
6.0 - 7.9 very good QA (reference)
8.0 - 10.0 excellent any application

>10.0 superior as good as reference
Mahalanobis Distance threshold
To check real MD limit,

you need to go to
calibration set up
For cross validation

results, the MD values
are sometimes extreme
because samples are
outside the calibration
when those values are
obtained.
Mahalanobis Distance threshold
In OPUS 7 the
threshold is set based
on the calibration set
statistic (99.9% prob)
Almost all calibration
spectra will be below
the threshold. This is
logical because those
samples belong to the
calibration set.
However, if your
calibration set is still
small, setting the MDI
factor of 2 in O/LAB or
ME could be a good
idea
Regression coefficients (b-vector)
The regression
coefficients are
showing the
weighting of data
point
(wavenumbers or
wavelength) in
the model.
PLS loadings (factors)
The loadings are

showing where
spectral variance
is located which is
coded in this
factor. Important
to look for noise
loadings.
Robust model does not mean lowest
RMSEP
• Sunflower samples were scanned on 3 Bruker Instruments

• Each sample were scanned 2 times with re-filling
• Same cup filling was measured on all instruments
• Predictions were done with 5 models obtained during model optimization
process
• All models showed very similar calibration results but act different in
terms of
• prediction repeatability between re-fills on one instrument

• prediction repeatability between the instruments
Prediction of independent samples
across instruments
38 38
33 33
Model 2
28 28
RMSECV = 0.99
23 23
SEP = 1.7
18 18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
Protein
Model 1
38
RMSECV = 1.0 33
SEP = 1.3 28
Model 3
23
RMSECV = 1.1
18
SEP = 1.7
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
Prediction of independent samples
across instruments
38 38
33 33
28 28
23 23
MPA 1 MPA 2 MATRIX-I

18 18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
Protein Protein
Model 4 Model 5
RMSECV = 1.1 RMSECV = 1.2
SEP = 1.7 SEP = 2.5
Routine analysis
Methods must be validated over time
Calibration
Val
Calibration
Set
Val Val Val Val
Set Set Set Set
Test
Calibration
Set
Test Val
Calibration
Set Set
Method setup ‘today’ Method validation time

‘in the future’
Val Set = dataset of independent samples

Updating of methods and data sets with
new samples (new batches, new
recipes)
Test Val Val Val Val Val

Calibration
Set Set Set Set Set Set
robustness
of model
Method setup Method validation time
Quant 2/Filelist
Adding true values (reference) for
comparison with predictions
Copy/paste of true values (reference)
for comparison with predictions
Predictions overview
Prediction vs. true value (reference)
with target and regression line (blue)
Easy comparison of different models
Difference vs. true value (reference)
with bias line (blue)
Quant2/Filelist
Marking of outside and outlier
Marking according to
the indication in the
table on page ‘Analysis
Results’:
MD/range OK
MD not OK
(outlier)
out of range
MD and range
not OK
Result statistics
Trouble shooting in case of poor
prediction
• Selection of suitable spectral ranges?

• Were ranges with spectral noise included in the calibration?
• Were ranges with total absorption included in the calibration?
• Selection of correct experiment for measurements?

• Selection of a robust Quant2 method?
• Selection of suitable data preprocessing ?
• Were the property values of the calibration samples well distributed over the
selected range?
Trouble shooting in case of outliers
• Was the sample not homogenized properly?

• No temperature control with critical liquid samples?
• Probe not properly immerged?
• Was an air bubble in the optical gap?
• Selection of the wrong method or measuring experiments?
• Measurements through vials: Identical vials for calibration and
measurement?
• Comparable measuring conditions (e.g. angle of attack of the probe, ...)?
• Don’t just throw away the samples after NIR or wet chem analysis.
Sometime, revision of measurement (either NIR or wet chem) will be
required for outliers
• Don’t throw all the red dots when develop calibrations. The red dots only
indicate potential outliers.
Innovation with Integrity
©Copyright
Copyright Bruker
© 2011 Bruker Corporation.
Corporation. All rights
All rights reserved. reserved.-Bruker Confidential-
www.bruker.com

02-2021 - Quant Advanced 2

Uploaded by

Copyright:

Available Formats

02-2021 - Quant Advanced 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02-2021 - Quant Advanced 2

Uploaded by

Copyright:

Available Formats

Advance QUANT2 (PLS)

Sirinnapa (Mui) Saranwong

Innovation with Integrity

• Variance analysis: ‘looking for changes in the data set‘

Data matrix: n spectra with p data points

5 Spectra Scores Factors

Scores of Factors Spectrum

Scores of Factors Spectrum

• Spectral residuals: difference between original spectra and

• PLS is a factor analysis (variance analysis) taking component or

• For each component or property a set of PLS factors is calculated

• The factors are calculated based only on the spectral variance

• PLS can be seen as a variance analysis including a kind of regression

• PLS is very effective in making use of correlated information and

• Even overlapping bands and structures in the spectra can be

Scores of Factors in Spectrum

• Evaluate  Assign Categories / Reference Values

Use this if you already

For new component

• Evaluate  Assign Categories / Reference Values

• When changing PC or moving spectra folder, OPUS/QUANT cannot find

Change base path is used for example when

• This is to use when:

AV.q2 Average spectra files

• Required number of samples

• If there is several kinds of samples in the matrix, use PCA scores.

• PCA does not consider component value. Please adjust manually.

View on PCA scores

Detail view on PCA

Test Set validation with

(Before with 7330

• The repeatability in spectra is as important as wet chem repeatability. There

• The repeatability in spectra is as important as wet chem repeatability. There

Possibility of low repeatability.

• Multivariate PLS models can’t be

Cross validation: • Depending on the data set used for

• Root Mean Square Error of..

X  Estimation or Calibration for the

RMSECV = SECV 2 − Bias 2

Actual value Actual value

R2: 90％ R2: 96％

∑(y pred ,i − ytrue ,i ) 2 R2 is the ratio of error in prediction and

RMSEP = 0.5? Great!

We find always the same number of

RMSEP / RMSECV are identical to the Standard deviation !

68% 95% 99%

∑(y pred ,i − ytrue ,i ) 2

n RMSEP R^2 Range

RPD Classification Application

8.0 - 10.0 excellent any application

To check real MD limit,

For cross validation

The loadings are

• Sunflower samples were scanned on 3 Bruker Instruments

• prediction repeatability between re-fills on one instrument

MPA 1 MPA 2 MATRIX-I

Method setup ‘today’ Method validation time

Val Set = dataset of independent samples

Test Val Val Val Val Val