0% found this document useful (0 votes)

59 views

Statistical Methods For Bioinformatics Lecture 2

Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5Statistical Methods for Bioinformatics lecture 5

Uploaded by

javabe7544

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Statistical Methods For Bioinformatics Lecture 2

Uploaded by

javabe7544

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Statistical Methods for Bioinformatics

II-1: Bias and Variance trade-off, Cross-validation & Bootstrap

Rob Jelier

1/47
Rob Jelier Statistical Methods for Bioinformatics
Statistics and the Philosophy of Science

“In so far as a scientific statement speaks about reality, it

must be falsifiable; and in so far as it is not falsifiable, it
does not speak about reality.” Karl Popper

Popper believed scientific theories can be tested only indirectly, by

evaluating their implications.

2/47
Rob Jelier Statistical Methods for Bioinformatics
Statistics & Philosophy of Science
Deductive reasoning

“The value for which P=0.05, is 1.96σ or nearly 2σ ; it is

convenient to take this point as a limit in judging whether a
deviation ought to be considered significant or not.”
RA Fischer

Neyman & Pearson proposed to use the p-value to formalize a

decision making process. After your investigation, you either
reject the null hypothesis and accept an alternative hypothesis,
or vice versa.
3/47
Rob Jelier Statistical Methods for Bioinformatics
The Higgs Boson discovery at 5σ

“The test statistic we use for looking at p-values is basically

the likelihood ratio for the two hypotheses ( H0 = Standard
Model (S. M.) of Particle Physics, but no Higgs; H1 = S.M
with Higgs). A small p0 (and a reasonable p1 ) then implies
that H1 is a better description of the data than H0 . This of
course does not prove that H1 is correct, but maybe Nature
corresponds to some H2 , which is more like H1 than it is
like H0 . Indeed in principle data will never prove a theory
is true, but the more experimental tests it survives, the
happier we are to use it – e.g. Newtonian mechanics was
fine for centuries till the arrival of Relativity.” Louis Lyons,
CERN

4/47
Rob Jelier Statistical Methods for Bioinformatics
The Role of Statistics in Science

Statistics is
... a formal way to deal with uncertainty in data
finding generalizable patterns in observations
collecting, visualizing, analyzing, finding and then testing
hypotheses.
Statistical tests are used to decide if a statement/hypothesis
is supported by data
important paradigm in scientific communication
Control of data quality
Use statistical reasoning to optimally design experiments
Statistical (also Machine) Learning approaches help predict
the future

5/47
Rob Jelier Statistical Methods for Bioinformatics
Statistical Methods for Bioinformatics: Part I
Content

Lecture 1: Linear regression and correlation

Lecture 2: Generalized linear models: Logistic Regression. Model
building and model selection using Akaike Information Criterion
(AIC)
Lecture 3: Multilevel Models: Longitudinal data
Lecture 4: Multilevel Models: Cluster data
Lecture 5: Missing data

6/47
Rob Jelier Statistical Methods for Bioinformatics
Statistical Methods for Bioinformatics: Part II

26-3 Lecture 1: Bias and Variance trade-off,

Cross-validation & Bootstrap
16-4 No class (conference)
23-4 & 30-4 Lectures 2&3: High Dimensionality; Ridge, LASSO,
PCR and Partial Least Squares
07-5 Lectures 4: Beyond Linearity; regression splines,
smoothing splines, LOESS, Generalized Additive
Models; Assignment handed out
14-5 Lecture 5: Trees, Bagging and Boosting
21-5 Remaining material, Revision and Exam preparation;
Assignment due

7/47
Rob Jelier Statistical Methods for Bioinformatics
Reading material

Required books:
An introduction to statistical learning, G. James, D. Witten, T.
https://www.statlearning.com/
Many of the examples and figures are from the book
Recommended reading:
An introduction to generalized linear models, Annette J
Dobson, CHAPMAN & HALL/CRC, 2002

8/47
Rob Jelier Statistical Methods for Bioinformatics
Course rule book
Keep up!
Later lessons build on earlier lessons.
It is a lot of material, waiting till the end may cause troubles
The course will include both Theory and Practical Skills
Later contact moments will not have a lecture: lectures are
recorded and on Toledo.
Contact moments will be dedicated to discussing questions,
and exercises.
For each class there is a reading assignment. Up until the day
before the class you can ask questions, that will then be
discussed during class.
The exercises will be in R. Let me know if you are unfamiliar
with R!
Evaluation
1 graded assignment, counts for 4/20pts (for part II)
Exam with theoretical questions and computer exercises
9/47
Rob Jelier Statistical Methods for Bioinformatics
Planning today

A general approach for statistical modeling

Bias-Variance trade off
Relevant for choice of model. E.g. a non-linear vs a linear
model
The challenges of high dimensional datasets
In biology today, many datasets have observations for many
actors. This causes specific challenges with modeling.
The linear model and their assumptions
Re-sampling approaches
Validation set
Cross-Validation
Bootstrap

10/47
Rob Jelier Statistical Methods for Bioinformatics
1. Statistical Modeling: survey the data

Any analysis of data should begin with a consideration of each

variable separately, to check data quality but also to understand
how a model could be formulated.
1 What kind of response and explanatory variables do you have?
Binary, Categorical, Ordinal
Continuous
Proportion
Count
Time at death
2 What is the shape of the distribution (e.g. look at histograms)
3 Do you see associations with other variables (e.g. look at
scatter plot)

11/47
Rob Jelier Statistical Methods for Bioinformatics
2. Statistical Modeling: choose a model
Typical case: one response variable and several explanatory
variables.
There is no perfect method for all data
A single perfect model is rare; different models can be fit with
good performance.
Which level of complexity is adequate? Avoid overly complex
models with limited benefit.

12/47
Rob Jelier Statistical Methods for Bioinformatics
3. Statistical Modeling: Fitting parameters
The most commonly used estimation methods are maximum
likelihood and least squares.
Maximum likelihood: given the data and the choice of model,
what values of the parameters of the model make the observed
data most likely?
Minimize
Pn least squares: find the fit for which
S = i (Yi − Ŷi )2 is minimal

13/47
Rob Jelier Statistical Methods for Bioinformatics
4. Statistical Modeling: Checking the model

Have a look at the fit and the residuals (the difference

between the predicted Y and the actual Y).
Is the fit good over the whole range? Evidence of non-linearity?
Do the residuals behave as expected, e.g. approximately
normal, with mean 0?
Could the model be simplified?
The law of parsimony (otherwise known as Occam’s Razor)
dictates that no more causes should be assumed than will
account for the effect.

14/47
Rob Jelier Statistical Methods for Bioinformatics
Thinking about modeling: a single predictor

How do you decide which statistical learning method to

choose?
How do you choose how complex or flexible a model should
be?
For example: is a variable significantly predictive in a linear
regression?
For a function to estimate the relationship (given as f (x))
between predictor x and the response variable Y:
The expected square error: Err (x) = E [(Y − b f (x))2 ]
The error has reducible and irreducible components:
Err (x) = (f (x) − fˆ(x))2 + Var (ϵ) with Var (ϵ) the variance by
the irreducible error component. The reducible error can be
split up further:
Err (x) = Var (fˆ(x0 )) + Bias(fˆ(x0 ))2 + Var (ϵ)

15/47
The perennial trade-off: bias vs variance

One wants a model that captures the regularities in training

data, but also generalizes well to unseen data.
E (y0 − fˆ(x0 ))2 = Var (fˆ(x0 )) + Bias(fˆ(x0 ))2 + Var (ϵ)

With Variance: Var (fˆ) = E ((E (fˆ) − fˆ)2 )

Bias: Bias(fˆ)2 = E ((f − E (fˆ))2 )

1
From “Understanding the Bias-Variance Tradeoff” by S. Fortmann Roe 16/47
Rob Jelier Statistical Methods for Bioinformatics
Bias Variance Trade-Off
Models with high bias are intuitively simple models:
restrictions on the kind of regularities that can be learned
(e.g. linear classifiers).
These models tend to underfit, i.e. not learn the relationship
between predicted (target) variables and features.
Models with high variance are those that can learn many kinds
of complex regularities
These models can learn noise in the training data, i.e.
overfitting.

17/47
Rob Jelier Statistical Methods for Bioinformatics
For example: a linear or a non-linear fit?

18/47
Rob Jelier Statistical Methods for Bioinformatics
Another example: which decision boundary in a classifier?

19/47
Rob Jelier Statistical Methods for Bioinformatics
Another example: which decision boundary in a classifier?

Images from Abhishek Ghose’s blog 20/47

Rob Jelier Statistical Methods for Bioinformatics
The perennial trade-off: bias vs variance

One wants a model that captures the regularities in training

data, but also generalizes well to unseen data.
E (y0 − fˆ(x0 ))2 = Var (fˆ(x0 )) + Bias(fˆ(x0 ))2 + Var (ϵ)

With Variance: Var (fˆ) = E ((E (fˆ) − fˆ)2 )

Bias: Bias(fˆ)2 = E ((f − E (fˆ))2 )

2
From “Understanding the Bias-Variance Tradeoff” by S. Fortmann Roe 21/47
Rob Jelier Statistical Methods for Bioinformatics
Bias and variance trade-off: a crucial concept

Returns throughout the course

How complex or flexible should the model be?
Relevant e.g. in the part on non-linear models and Random
Forests
Degrees of freedom in parametric flexible models, e.g.
smoothing splines and non-parametric flexible models, e.g.
kernel smoothing

22/47
Rob Jelier Statistical Methods for Bioinformatics
Progress

A general approach for statistical modeling

23/47
Rob Jelier Statistical Methods for Bioinformatics
Curse of dimensionality

In this age of Big Data we often have many predictors and

high-dimensional datasets
High-dimensional datasets pose special challenges with respect
to statistical learning
Curse of Dimensionality.

24/47
Rob Jelier Statistical Methods for Bioinformatics
The Challenges of High Dimensionality

When your number of dimensions is on the order of the

number of observations you have a problem fitting data, e.g.
you get saturated systems, with limited fitting/generalization
a line fits every 2 points in 2D perfectly
a plane fits every 3 points in 3D perfectly
a hyperplane fits every 4 points in 4D perfectly
Also in high dimensions the distance between observations
tends to be larger
Sparse observations make it hard to make predictions
Complexity of functions of many variables can grow
exponentially with D. For accurate fitting of the parameters an
exponential increase in observations would be needed.

25/47
Rob Jelier Statistical Methods for Bioinformatics
Curse of dimensionality

26/47
Rob Jelier Statistical Methods for Bioinformatics
Curse of dimensionality

What are the consequences of high dimensionality for bias and

variance?

27/47
Rob Jelier Statistical Methods for Bioinformatics
High dimensional datasets

How do you decide which (or all) predictors you will keep in
your modeling?
The methods discussed in the 2nd and 3rd classes deal
properly with high dimensionality
Considerations for interpreting analyses of high dimensional
datasets in 3rd lecture.

28/47
Rob Jelier Statistical Methods for Bioinformatics
Linear Models: powerful simplicity

Y = β0 + β1 x1 + . . . + βm xm + ε

Easy interpretability: βj is the average increase in Y when xj

increases by one, and the other variables are constant.
Estimate parameters with P e.g. Least Squares;
argminβo ...βm MSE = n1 (Yi − Ŷi )2 and
Ŷ = βˆ0 + β̂1 x1 + . . . + β̂m xm + ε
Measure of fit can be coefficient of determination, range
2
P
2 i (yi −yˆi )
[0, 1], fraction explained variance R = 1 − (y −ȳ )2
P
i i

29/47
Rob Jelier Statistical Methods for Bioinformatics
Linear Models for Essential Questions
Through a linear model you can test or evaluate the following questions:

Do two variables have a relationship? Quantify the evidence!

How strong is the relationship?
Can we distinguish between variables? Which is most
predictive?
Is the relationship linear, or is there evidence for a non-linear
effect?
Are there interactions between variables?
How well can we predict a variable knowing a single or set of
other variables?

30/47
Rob Jelier Statistical Methods for Bioinformatics
Testing if a coefficient is relevant

Assuming the normality of the error distribution.

For the estimated mean of a normal distribution we have:
2
SE (µ̂)2 = σn , you can test significance with a t - test and
calculate 95% confidence intervals.
For Ŷ = β0 + β1 x1 , we can write
x̄ 2 2
SE (β0 )2 = σ 2 ( 12 + Pn (x 2
2 ) and SE (β1 ) =
Pn σ 2)
i=1 i −x̄) i=1 i −x̄)
(x
with σ 2 = var (ε)
β̂i
The statistic follows from t = S.E .(β̂i )

31/47
Rob Jelier Statistical Methods for Bioinformatics
The assumptions of linear regression

1 Linearity of the relationship

Violations will cause problems when you try to make
predictions
Does a transformation of the data produce a linear
relationship?
2 Normality of the errors
3 Independence of the errors
Random effects can cause problems, e.g. plants grown in plots
Repeated measurements of the same subject, e.g. different
times or conditions
4 Homoscedastity (equal variance over the predictions)
Affects confidence intervals

32/47
Rob Jelier Statistical Methods for Bioinformatics
Potential Fit Problems

There are a number of possible problems that one may encounter

when fitting the linear regression model. In addition to looking at
the performance statistics RSE and R2 , one should analyze and plot
the data. Graphical summaries can reveal problems with a model.
1 Non-linearity of the data
2 Dependence of the error terms
3 Non-constant variance of error terms
4 Outliers
5 High leverage points
6 Collinearity

33/47
Rob Jelier Statistical Methods for Bioinformatics
Challenges with models

How do you compare models?

Parameter selection
Random forests vs a GLM?
How do you decide on significance of a test or the confidence
intervals of a coefficient in a linear model if the assumptions
underlying the statistical distributions are violated?

34/47
Rob Jelier Statistical Methods for Bioinformatics
Re-sampling Methods

Introduction
Single validation set
Cross Validation
Leave-one-out Cross Validation
K-fold Cross Validation
Bias-Variance Trade-off for k-fold Cross Validation
Bootstrap

35/47
Rob Jelier Statistical Methods for Bioinformatics
Re-sampling Methods

Methods that draw a sample or samples from a training set

and fit a model on each sample to obtain more information
about the model’s properties
Model Assessment: estimate test error rates
Model Selection: select the appropriate level of model flexibility
Can be computationally expensive; but in exceptional cases
nearly free
Re-sampling methods:
Cross Validation
Bootstrapping

36/47
Rob Jelier Statistical Methods for Bioinformatics
Classical validation set approach
Find a set of variables that give lowest test (instead of
training) error rate
If we have a large data set, we can achieve this goal by
randomly splitting the data into training and validation
(testing) parts
Build models on the training part, choose model with lowest
error rate when applied to the validation data

37/47
Rob Jelier Statistical Methods for Bioinformatics
Validation set approach

Advantages:
Simple
Easy to implement
Disadvantages:
The validation performance estimate (e.g.
PnMean Squared
Error) can be highly variable MSE = n1 i=1 (Ŷi − Yi )2
Only a subset of observations are used to fit the model
(training data). Statistical methods tend to perform worse
when trained on fewer observations.

38/47
Rob Jelier Statistical Methods for Bioinformatics
Leave-One-Out Cross Validation (LOOCV)

Similar to the Validation Set Approach, but it tries to address

the latter’s disadvantages
For each suggested model, do:
Split the data set of size n into
Training data set size: n -1
Validation data set size: 1
Fit the model using the training data
Validate model using the validation data, and compute the
corresponding MSE
Repeat this process n times
1 Pn
The MSE for the model is CV = n i=1 MSEi

39/47
Rob Jelier Statistical Methods for Bioinformatics
LOOCV vs Validation set approach

LOOCV has less bias

almost all the data set is used
LOOCV produces a less variable MSE estimate
effect randomness of splitting process reduced
LOOCV can be computationally intensive
fitting model n times
Except with least squares linear or polynomial regression; here
an shortcut makes LOOCV cost the same as the cost of single
model fit. The following holds:
n
1 X yi − yˆi 2
CVn = ( )
n 1 − hi
i=1

with hi the leverage for this data point

40/47
Rob Jelier Statistical Methods for Bioinformatics
k-fold Cross Validation

We randomly divide the data set into k folds

each fold is used in turn as the validation set, with the
remainder as training
The estimated error rate is simply the average MSE
Stable, like LOOCV, but a.o. less computationally intensive

MSE for simulated data: true test MSE in blue, LOOCV as a black dashed line,
10-fold CV estimate in orange. Crosses indicate minimum of MSE curves.

41/47
Rob Jelier Statistical Methods for Bioinformatics
Bias-Variance trade-off for CV

LOOCV has less bias than k-fold CV (k<<n)

larger training sets
LOOCV tends to have higher variance than k-fold CV (k<<n)
Learning sets are very similar!
Correlated learning sets inflate variance (a deep statistical
truth)
LOOCV can be useful but normally a trade-off is made: k-fold
CV with K = 5 / K = 10
Empirical evidence to prevent excessively high bias and high
variance in test error rate estimates.

42/47
Rob Jelier Statistical Methods for Bioinformatics
Re-sampling Methods

When considering learning models, where should we apply CV?

Parameter Selection: We select the most informative
parameter(s) for a given classification problem
Model selection: Once we have chosen a set of parameters,
how should we estimate the true error rate of a model?
The true error rate is the classifier’s error rate when tested on
the entire population / known production function

43/47
Rob Jelier Statistical Methods for Bioinformatics
Bootstrap
The bootstrap is a resampling technique with replacement
From a dataset with n examples
Randomly select (with replacement) n examples and use this
set for training
The remaining examples that were not selected for training are
used for testing
This value is likely to change from fold to fold
Repeat this process for a specified number of folds (k)
The true error is estimated as the average error rate on test
data

44/47
Rob Jelier Statistical Methods for Bioinformatics
Why Bootstrap

Compared to basic CV, the bootstrap increases the variance

that can occur in each fold
This is a desirable: a more realistic simulation of the real-life
experiment from which our dataset was obtained
Sampling with replacement preserves a-priori probabilities of
the classes throughout the random selection process
The bootstrap provides accurate measures of both the bias
and variance of the estimator
The method is mostly used to assess uncertainty/standard
errors/confidence intervals in estimators

45/47
Rob Jelier Statistical Methods for Bioinformatics
Can Bootstrap estimate Prediction Error?

In cross-validation, each of the K validation folds is distinct

from the other K − 1 folds used for training: there is no
overlap.
This is crucial for its success. (Why?)
To estimate prediction error using the bootstrap, observations
not selected by the bootstrap are used as a test set
But your test-set varies in size and you may see some
observations in your test-sets more than others.

46/47
Rob Jelier Statistical Methods for Bioinformatics
To do: