02-2021 - Quant Advanced 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Advance QUANT2 (PLS)

Sirinnapa (Mui) Saranwong


Bruker Optik GmbH, Germany

Innovation with Integrity


Content

• PCA/PLS Principle
• Component assignment
• Change path / Copy Spectra
• Set test set by wet chem
• Set test set by PCA
• Remove redundant
• Sample statistics
• Explain each plot in graph
• Explain validation report
• Explain regression and loading
• Routine analysis
• Quant2/Filelist

-Bruker Confidential- 2
Principles and properties of factor
analysis

• Variance analysis: ‘looking for changes in the data set‘


• Common statistical method for data analysis
• Different names used in chemometrics:
- Factor analysis
- Principal Component Analysis (PCA)
• Orthogonal transformation of the data
• Enormous data compression: representing of the data set by a few latent
(new) variables

-Bruker Confidential- 3
Factor analysis of spectra

Factor Analysis breaks apart the spectral data into the most
common spectral variations (factors, loadings, principal
components) and the corresponding scaling coefficients
(scores)
p d p

Factors

Scores
Spectral data matrix = d
n n

Data matrix: n spectra with p data points


Scores: d score values for each spectrum (d < n)
Factors: d Factors with p data points (d < n)

-Bruker Confidential- 4
Factor analysis:PCA
Without component values

5 Spectra Scores Factors


1 Factor 1 5.216 1
Factor 2 -0.216
Factor 3 1.73E-02
Factor 4 -1.52E-02
Factor 5 3.17E-02
2 Factor 1 5.95
Factor 2 -0.103 2
Factor 3 4.97E-04
Factor 4 4.33E-02
Factor 5 5.65E-03
3 Factor 1 7.731
Factor 2 -0.699 3
Factor 3 3.67E-04
Factor 4 -1.15E-02
Factor 5 -2.04E-02
4 Factor 1 5.768
Factor 2 0.693 4
Factor 3 2.97E-02
Factor 4 -3.76E-03
Factor 5 -1.27E-02
5 Factor 1 7.13
Factor 2 0.441 5
Factor 3 -3.75E-02
Factor 4 -9.54E-03
Factor 5 4.46E-03

-Bruker Confidential- 5
Inverse Factor analysis: PCA
Reconstruction of spectrum using all factors

Scores of Factors Spectrum


spectrum

7.731

+ -0.699

+ 3.67E-04

+ -1.15E-02

+ -2.04E-02

-Bruker Confidential- 6
Inverse Factor analysis: PCA
Reconstruction of spectrum using all factors

Scores of Factors Spectrum


Spectrum

7.731

+ -0.699

• In the software the spectra are not reconstructed. The spectra are
represented just by the few scores values (data compression) which
are used in the modeling calculations.

• Spectral residuals: difference between original spectra and


reconstructed spectra using n factors

-Bruker Confidential- 7
Moving from factor analysis to PLS

• PLS is a factor analysis (variance analysis) taking component or


property values (e.g. concentrations) into account

• For each component or property a set of PLS factors is calculated

• The factors are calculated based only on the spectral variance


correlated with the given component or property values

• PLS can be seen as a variance analysis including a kind of regression


step

• PLS is very effective in making use of correlated information and


discriminating non useful information

• Even overlapping bands and structures in the spectra can be


separated

-Bruker Confidential- 8
PLS Factors for components A and B

PLS factors
A B Comp. A

1
2
3

PLS factors
Comp. B

1
2
3

-Bruker Confidential- 9
Analysis of spectra using PCA or PLS
models are based on scores and
loadings

Scores of Factors in Spectrum


spectrum the model measured

7.731

+ -0.699

• For the measured spectrum the scores are calculated according to the
factors (loadings) stored in the model.
• The scores are used for the final evaluation in the PCA model
(identification) or PLS model (quantification).

-Bruker Confidential- 10
Component Assignment

• Evaluate  Assign Categories / Reference Values

Use this if you already


assign into q2

Case sensitive!

For new component

-Bruker Confidential- 11
Component Assignment

• Evaluate  Assign Categories / Reference Values

2. Click
arrow

1. Select
parameter

-Bruker Confidential- 12
Component Assignment

3. Add
component
values

Important:
Use average component
values for spectra of the
same sample!

-Bruker Confidential- 13
Change path

• When changing PC or moving spectra folder, OPUS/QUANT cannot find


spectra in the new location.

-Bruker Confidential- 14
Change path

1
2

Change base path is used for example when


spectra were separated in different folders by date

-Bruker Confidential- 15
COPY Spectra

• This is to use when:


• Organizing files in the same PC
• Sending spectra and q2 to Bruker for evaluation
• Creating average spectra (for example when the NIR sampling size is
smaller than wet chem sampling size)

-Bruker Confidential- 16
COPY Spectra – Standard
CAUTION: Be careful if spectra has same name but in
different folders!

1 2

-Bruker Confidential- 17
COPY Spectra – AVERAGE

1 2

-Bruker Confidential- 18
COPY Spectra – AVERAGE
Output

AV.q2 Average spectra files

-Bruker Confidential- 19
Selection of calibration and test
samples

• Calibration and test set samples should be well distributed over the entire
property range

• As many as possible samples should be used for the test set but important
samples must be in the calibration. In case of big data sets the splitting is
done by having 50-50 or 60-40 in the calibration and test set, respectively

• Required number of samples


• feasibility study: ~ 20 samples minimum
• typical applications: ~ 50-100 samples
• complex application: > 150 samples

-Bruker Confidential- 20
Distribution of samples

Prediction

„rare sample“ or

outlier


 
  typical
 concentration range



 The concentration range of the calibration
 should extent the expected analysis range
if possible.

Reference value

-Bruker Confidential- 21
Validation set

• Cross validation is used only when there are limited number of samples
(feasibility test or very costly wet chem). Normally, 2-5% are used for leave
out. There is no statistical benefits of full cross validation (leave one out)
over leave 2-5% out.

• Test set is more reliable method but you need to have sufficient samples in
both calibration and test set. The sufficient number of samples in calibration
set ensure proper PLS loading calculation (normally needs 10 samples per
rank/factor). On the other hand, sufficient number of samples in test set
ensure that the validation results (RMSEP, Bias, RPD) are not over
optimistic.

• Cross validation should be used first before test set selection to remove
severe outlier.

-Bruker Confidential- 22
How to select Test set samples

• Remove obvious outliers by (i) look at spectra; (ii) use cross validation; (iii)
check PCA scores

• If there is one or very few kinds of samples in the matrix, use component
values to separate test set. Caution: This cannot be done if component
values of some samples are missing. Those samples must be excluded first.

• If there is several kinds of samples in the matrix, use PCA scores.

-Bruker Confidential- 23
Automatic selection of test samples on
component values (Kennard-Stone)

Samples with
lowest and
highest
property values
are in the
calibration set,
the next inner
ones in the test
set

-Bruker Confidential- 24
Automatic selection of test samples on
component values (Kennard-Stone)

Next test
sample is
chosen with the
Next test
maximum
sample
distance from
the already
selected ones in
all dimensions
(properties).
Here it is found
in the middle.

-Bruker Confidential- 25
Automatic selection of test samples on
component values (Kennard-Stone)

10 % Test samples
Next test
sample is
chosen with the
maximum
distance from
the already
selected ones in
all dimensions
(properties)
until the
required
percentage of
test samples is
reached.

-Bruker Confidential- 26
Automatic selection of test samples on
component values (Kennard-Stone)

20 % Test samples

-Bruker Confidential- 27
Automatic selection of test samples on
component values (Kennard-Stone)

50 % Test samples

-Bruker Confidential- 28
Set Test set by component values

3
4

-Bruker Confidential- 29
Set Test set by component values

-Bruker Confidential-
Clear Test Set

-Bruker Confidential- 31
Set Test set by PCA

2
3

-Bruker Confidential- 32
Set Test set by PCA

-Bruker Confidential- 33
Set Test set by PCA

1
2

-Bruker Confidential- 34
Set Test set by PCA

• PCA does not consider component value. Please adjust manually.

-Bruker Confidential- 35
Remove redundant

• This is needed when you have too big populations such as several thousands
samples and many have repeated information.
• All samples must be “Calibration” samples.

1
2
3

-Bruker Confidential- 36
Remove redundant samples

View on PCA scores


plot of IV method with
7330 spectra.
About 6500 spectra
(blue) which are quite
similar.

-Bruker Confidential-
Quant2 OPUS 7: exclude redundant
samples

Detail view on PCA


scores plot of IV
method with
7330 spectra.
About 6500 spectra
from one country
(blue) which are quite
similar.

-Bruker Confidential-
Quant2 OPUS 7: exclude redundant
samples

Test Set validation with


687 spectra and 1162
spectra in calibration
set.

RMSEP = 0.73

(Before with 7330


spectra in calibration
RMSEP = 0.73)

-Bruker Confidential-
Sample Statistics

• The repeatability in spectra is as important as wet chem repeatability. There


is a possibility that some sample might have spectra with low repeatability.

-Bruker Confidential- 40
Sample Statistics

• The repeatability in spectra is as important as wet chem repeatability. There


is a possibility that some sample might have spectra with low repeatability.

Possibility of low repeatability.


Check spectra first. If needed, we
might use average spectra for this
or even remove completely.

-Bruker Confidential- 41
Error values for characterizing
calibration performance and validation

• Multivariate PLS models can’t be


checked by regression coefficient r2 and
X slope of a regression line.
• The standard deviation of NIR
predictions from the true values
Calibration: (reference) are calculated as Root
RMSEE or RMSEC Mean Square Error of..

Cross validation: • Depending on the data set used for


RMSECV prediction different errors are defined
• Another parameter is the R2 value
Test set validation:
which should be as close as possible to
RMSEP
100%

-Bruker Confidential- 42
Error values for characterizing
calibration performance and validation

• Root Mean Square Error of..

X  Estimation or Calibration for the


predictions of all samples using
the calibration model based on all
Calibration: samples
RMSEE or RMSEC  Cross Validation for the
predictions during the Cross
Cross validation:
Validation, i.e. samples are
RMSECV
temporarily independent
Test set validation:  Prediction for the prediction of
RMSEP independent samples

RMSECV = SECV 2 − Bias 2


-Bruker Confidential- 43
RMSECV versus R2

NIR Value
NIR Value

Actual value Actual value

R2: 90% R2: 96%


SEC: 0.55% SEC: 0.545%
M

∑(y pred ,i − ytrue ,i ) 2 R2 is the ratio of error in prediction and


R2 = ( 1− i =1
M
)×100 sample distribution. Therefore, adding of
a few extreme samples could significantly
∑(y
i =1
i − y)2
improve the R2 value
-Bruker Confidential- 44
RMSEP and the real error

RMSEP = 0.5? Great!


All my results within a
range of +/- 0.5.
Excellent accuracy!

Manager YOU

-Bruker Confidential-
Normal (Gaussian) Distribution

We find always the same number of


events within the following intervals:
+/- 1σ • +/- 1 standard deviation
68.3% (68.3%)

+/- 2σ
• +/- 2 standard deviations
(95.5%)
95.5%
• +/- 3 standard deviations
+/- 3σ (99.7%)
99.7%

RMSEP / RMSECV are identical to the Standard deviation !

-Bruker Confidential-
Ruminant Feed – Fat: Test Set Validation

Error distribution

68% 95% 99%

-Bruker Confidential-
R2 and its meaning: expresses the
relation of error bar and value range

R2 = 66.4%

R2 = 81.4%

R2 = 98.9%

∑(y pred ,i − ytrue ,i ) 2


R2 = ( 1− i =1
M
)×100
∑(y
i =1
i − y)2

-Bruker Confidential-
R2: Calibration of Fat in Milk

n RMSEP R^2 Range


500 0.06 99.51 5.03
R2 480 0.06 99.45 3.35
400 0.05 99.22 2.17
300 0.05 98.41 1.36
200 0.05 96.51 0.82
100 0.05 88.79 0.46
80 0.05 81.71 0.35
60 0.04 73.87 0.26
50 0.04 70.24 0.21
Data 40 0.04 65.70 0.16
range

RMSEP

-Bruker Confidential-
Statistics for the model validation

RPD Classification Application


RPD = SD/SECV
<1.0 very poor not recommended or
1.0 - 2.4 poor not recommended RPD = SD/SEP
2.5 - 2.9 fair rough screening
SD = Standard
3.0 - 3.9 reasonable screening deviation of the
4.0 - 5.9 good QC true values
6.0 - 7.9 very good QA (reference)

8.0 - 10.0 excellent any application


>10.0 superior as good as reference

-Bruker Confidential- 50
Mahalanobis Distance threshold

To check real MD limit,


you need to go to
calibration set up

For cross validation


results, the MD values
are sometimes extreme
because samples are
outside the calibration
when those values are
obtained.

-Bruker Confidential-
Mahalanobis Distance threshold

In OPUS 7 the
threshold is set based
on the calibration set
statistic (99.9% prob)
Almost all calibration
spectra will be below
the threshold. This is
logical because those
samples belong to the
calibration set.

However, if your
calibration set is still
small, setting the MDI
factor of 2 in O/LAB or
ME could be a good
idea
-Bruker Confidential-
Regression coefficients (b-vector)

The regression
coefficients are
showing the
weighting of data
point
(wavenumbers or
wavelength) in
the model.

-Bruker Confidential- 53
PLS loadings (factors)

The loadings are


showing where
spectral variance
is located which is
coded in this
factor. Important
to look for noise
loadings.

-Bruker Confidential- 54
Robust model does not mean lowest
RMSEP

• Sunflower samples were scanned on 3 Bruker Instruments


• Each sample were scanned 2 times with re-filling
• Same cup filling was measured on all instruments
• Predictions were done with 5 models obtained during model optimization
process
• All models showed very similar calibration results but act different in
terms of

• prediction repeatability between re-fills on one instrument


• prediction repeatability between the instruments

-Bruker Confidential- 55
Prediction of independent samples
across instruments

38 38

33 33

Model 2
28 28

RMSECV = 0.99
23 23
SEP = 1.7

18 18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

Protein
Model 1
38

RMSECV = 1.0 33

SEP = 1.3 28

Model 3
23

RMSECV = 1.1
18
SEP = 1.7
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

-Bruker Confidential- 56
Prediction of independent samples
across instruments

38 38

33 33

28 28

23 23

MPA 1 MPA 2 MATRIX-I


18 18
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191

Protein Protein
Model 4 Model 5
RMSECV = 1.1 RMSECV = 1.2
SEP = 1.7 SEP = 2.5

-Bruker Confidential- 57
Routine analysis
Methods must be validated over time

Calibration

Val
Calibration
Set
Val Val Val Val
Set Set Set Set
Test
Calibration
Set

Test Val
Calibration
Set Set

Method setup ‘today’ Method validation time


‘in the future’

Val Set = dataset of independent samples


-Bruker Confidential- 58
Updating of methods and data sets with
new samples (new batches, new
recipes)

Test Val Val Val Val Val


Calibration
Set Set Set Set Set Set

robustness
of model

Method setup Method validation time

-Bruker Confidential- 59
Quant 2/Filelist

-Bruker Confidential- 60
Adding true values (reference) for
comparison with predictions

-Bruker Confidential- 61
Copy/paste of true values (reference)
for comparison with predictions

-Bruker Confidential- 62
Predictions overview

-Bruker Confidential- 63
Prediction vs. true value (reference)
with target and regression line (blue)

-Bruker Confidential- 64
Easy comparison of different models

-Bruker Confidential- 65
Difference vs. true value (reference)
with bias line (blue)

-Bruker Confidential- 66
Quant2/Filelist
Marking of outside and outlier

Marking according to
the indication in the
table on page ‘Analysis
Results’:

MD/range OK

MD not OK
(outlier)

out of range

MD and range
not OK

-Bruker Confidential-
Result statistics

-Bruker Confidential- 68
Trouble shooting in case of poor
prediction

• Selection of suitable spectral ranges?


• Were ranges with spectral noise included in the calibration?
• Were ranges with total absorption included in the calibration?

• Selection of correct experiment for measurements?


• Selection of a robust Quant2 method?
• Selection of suitable data preprocessing ?
• Were the property values of the calibration samples well distributed over the
selected range?

-Bruker Confidential- 69
Trouble shooting in case of outliers

• Was the sample not homogenized properly?


• No temperature control with critical liquid samples?
• Probe not properly immerged?
• Was an air bubble in the optical gap?
• Selection of the wrong method or measuring experiments?
• Measurements through vials: Identical vials for calibration and
measurement?
• Comparable measuring conditions (e.g. angle of attack of the probe, ...)?

• Don’t just throw away the samples after NIR or wet chem analysis.
Sometime, revision of measurement (either NIR or wet chem) will be
required for outliers

• Don’t throw all the red dots when develop calibrations. The red dots only
indicate potential outliers.

-Bruker Confidential- 70
Innovation with Integrity

©Copyright
Copyright Bruker
© 2011 Bruker Corporation.
Corporation. All rights
All rights reserved. reserved.-Bruker Confidential-
www.bruker.com

You might also like