Statistical Methods For Bioinformatics Lecture 2
Statistical Methods For Bioinformatics Lecture 2
Rob Jelier
1/47
Rob Jelier Statistical Methods for Bioinformatics
Statistics and the Philosophy of Science
2/47
Rob Jelier Statistical Methods for Bioinformatics
Statistics & Philosophy of Science
Deductive reasoning
4/47
Rob Jelier Statistical Methods for Bioinformatics
The Role of Statistics in Science
Statistics is
... a formal way to deal with uncertainty in data
finding generalizable patterns in observations
collecting, visualizing, analyzing, finding and then testing
hypotheses.
Statistical tests are used to decide if a statement/hypothesis
is supported by data
important paradigm in scientific communication
Control of data quality
Use statistical reasoning to optimally design experiments
Statistical (also Machine) Learning approaches help predict
the future
5/47
Rob Jelier Statistical Methods for Bioinformatics
Statistical Methods for Bioinformatics: Part I
Content
6/47
Rob Jelier Statistical Methods for Bioinformatics
Statistical Methods for Bioinformatics: Part II
7/47
Rob Jelier Statistical Methods for Bioinformatics
Reading material
Required books:
An introduction to statistical learning, G. James, D. Witten, T.
https://www.statlearning.com/
Many of the examples and figures are from the book
Recommended reading:
An introduction to generalized linear models, Annette J
Dobson, CHAPMAN & HALL/CRC, 2002
8/47
Rob Jelier Statistical Methods for Bioinformatics
Course rule book
Keep up!
Later lessons build on earlier lessons.
It is a lot of material, waiting till the end may cause troubles
The course will include both Theory and Practical Skills
Later contact moments will not have a lecture: lectures are
recorded and on Toledo.
Contact moments will be dedicated to discussing questions,
and exercises.
For each class there is a reading assignment. Up until the day
before the class you can ask questions, that will then be
discussed during class.
The exercises will be in R. Let me know if you are unfamiliar
with R!
Evaluation
1 graded assignment, counts for 4/20pts (for part II)
Exam with theoretical questions and computer exercises
9/47
Rob Jelier Statistical Methods for Bioinformatics
Planning today
10/47
Rob Jelier Statistical Methods for Bioinformatics
1. Statistical Modeling: survey the data
11/47
Rob Jelier Statistical Methods for Bioinformatics
2. Statistical Modeling: choose a model
Typical case: one response variable and several explanatory
variables.
There is no perfect method for all data
A single perfect model is rare; different models can be fit with
good performance.
Which level of complexity is adequate? Avoid overly complex
models with limited benefit.
12/47
Rob Jelier Statistical Methods for Bioinformatics
3. Statistical Modeling: Fitting parameters
The most commonly used estimation methods are maximum
likelihood and least squares.
Maximum likelihood: given the data and the choice of model,
what values of the parameters of the model make the observed
data most likely?
Minimize
Pn least squares: find the fit for which
S = i (Yi − Ŷi )2 is minimal
13/47
Rob Jelier Statistical Methods for Bioinformatics
4. Statistical Modeling: Checking the model
14/47
Rob Jelier Statistical Methods for Bioinformatics
Thinking about modeling: a single predictor
15/47
The perennial trade-off: bias vs variance
1
From “Understanding the Bias-Variance Tradeoff” by S. Fortmann Roe 16/47
Rob Jelier Statistical Methods for Bioinformatics
Bias Variance Trade-Off
Models with high bias are intuitively simple models:
restrictions on the kind of regularities that can be learned
(e.g. linear classifiers).
These models tend to underfit, i.e. not learn the relationship
between predicted (target) variables and features.
Models with high variance are those that can learn many kinds
of complex regularities
These models can learn noise in the training data, i.e.
overfitting.
17/47
Rob Jelier Statistical Methods for Bioinformatics
For example: a linear or a non-linear fit?
18/47
Rob Jelier Statistical Methods for Bioinformatics
Another example: which decision boundary in a classifier?
19/47
Rob Jelier Statistical Methods for Bioinformatics
Another example: which decision boundary in a classifier?
2
From “Understanding the Bias-Variance Tradeoff” by S. Fortmann Roe 21/47
Rob Jelier Statistical Methods for Bioinformatics
Bias and variance trade-off: a crucial concept
22/47
Rob Jelier Statistical Methods for Bioinformatics
Progress
23/47
Rob Jelier Statistical Methods for Bioinformatics
Curse of dimensionality
24/47
Rob Jelier Statistical Methods for Bioinformatics
The Challenges of High Dimensionality
25/47
Rob Jelier Statistical Methods for Bioinformatics
Curse of dimensionality
26/47
Rob Jelier Statistical Methods for Bioinformatics
Curse of dimensionality
27/47
Rob Jelier Statistical Methods for Bioinformatics
High dimensional datasets
How do you decide which (or all) predictors you will keep in
your modeling?
The methods discussed in the 2nd and 3rd classes deal
properly with high dimensionality
Considerations for interpreting analyses of high dimensional
datasets in 3rd lecture.
28/47
Rob Jelier Statistical Methods for Bioinformatics
Linear Models: powerful simplicity
Y = β0 + β1 x1 + . . . + βm xm + ε
29/47
Rob Jelier Statistical Methods for Bioinformatics
Linear Models for Essential Questions
Through a linear model you can test or evaluate the following questions:
30/47
Rob Jelier Statistical Methods for Bioinformatics
Testing if a coefficient is relevant
31/47
Rob Jelier Statistical Methods for Bioinformatics
The assumptions of linear regression
32/47
Rob Jelier Statistical Methods for Bioinformatics
Potential Fit Problems
33/47
Rob Jelier Statistical Methods for Bioinformatics
Challenges with models
34/47
Rob Jelier Statistical Methods for Bioinformatics
Re-sampling Methods
Introduction
Single validation set
Cross Validation
Leave-one-out Cross Validation
K-fold Cross Validation
Bias-Variance Trade-off for k-fold Cross Validation
Bootstrap
35/47
Rob Jelier Statistical Methods for Bioinformatics
Re-sampling Methods
36/47
Rob Jelier Statistical Methods for Bioinformatics
Classical validation set approach
Find a set of variables that give lowest test (instead of
training) error rate
If we have a large data set, we can achieve this goal by
randomly splitting the data into training and validation
(testing) parts
Build models on the training part, choose model with lowest
error rate when applied to the validation data
37/47
Rob Jelier Statistical Methods for Bioinformatics
Validation set approach
Advantages:
Simple
Easy to implement
Disadvantages:
The validation performance estimate (e.g.
PnMean Squared
Error) can be highly variable MSE = n1 i=1 (Ŷi − Yi )2
Only a subset of observations are used to fit the model
(training data). Statistical methods tend to perform worse
when trained on fewer observations.
38/47
Rob Jelier Statistical Methods for Bioinformatics
Leave-One-Out Cross Validation (LOOCV)
39/47
Rob Jelier Statistical Methods for Bioinformatics
LOOCV vs Validation set approach
40/47
Rob Jelier Statistical Methods for Bioinformatics
k-fold Cross Validation
MSE for simulated data: true test MSE in blue, LOOCV as a black dashed line,
10-fold CV estimate in orange. Crosses indicate minimum of MSE curves.
41/47
Rob Jelier Statistical Methods for Bioinformatics
Bias-Variance trade-off for CV
42/47
Rob Jelier Statistical Methods for Bioinformatics
Re-sampling Methods
43/47
Rob Jelier Statistical Methods for Bioinformatics
Bootstrap
The bootstrap is a resampling technique with replacement
From a dataset with n examples
Randomly select (with replacement) n examples and use this
set for training
The remaining examples that were not selected for training are
used for testing
This value is likely to change from fold to fold
Repeat this process for a specified number of folds (k)
The true error is estimated as the average error rate on test
data
44/47
Rob Jelier Statistical Methods for Bioinformatics
Why Bootstrap
45/47
Rob Jelier Statistical Methods for Bioinformatics
Can Bootstrap estimate Prediction Error?
46/47
Rob Jelier Statistical Methods for Bioinformatics
To do:
Exercises
Lab of chapter 5
Chapter 5, exercises 1,4,5,6 & 8
47/47
Rob Jelier Statistical Methods for Bioinformatics