Applied Statistics For Bioinformatics PDF
Applied Statistics For Bioinformatics PDF
Applied Statistics For Bioinformatics PDF
Wim P. Krijnen
Preface
The purpose of this book is to give an introduction into statistics in order
to solve some problems of bioinformatics. Statistics provides procedures to
explore and visualize data as well as to test biological hypotheses. The book
intends to be introductory in explaining and programming elementary statis-
tical concepts, thereby bridging the gap between high school levels and the
specialized statistical literature. After studying this book readers have a suf-
ficient background for Bioconductor Case Studies (Hahne et al., 2008) and
Bioinformatics and Computational Biology Solutions Using R and Biocon-
ductor (Genteman et al., 2005). The theory is kept minimal and is always
illustrated by several examples with data from research in bioinformatics.
Prerequisites to follow the stream of reasoning is limited to basic high-school
knowledge about functions. It may, however, help to have some knowledge
of gene expressions values (Pevsner, 2003) or statistics (Bain & Engelhardt,
1992; Ewens & Grant, 2005; Rosner, 2000; Samuels & Witmer, 2003), and
elementary programming. To support self-study a sufficient amount of chal-
lenging exercises are given together with an appendix with answers.
The programming language R is becoming increasingly important because
it is not only very flexible in reading, manipulating, and writing data, but
all its outcomes are directly available as objects for further programming.
R is a rapidly growing language making basic as well as advanced statisti-
cal programming easy. From an educational point of view, R provides the
possibility to combine the learning of statistical concepts by mathematics,
programming, and visualization. The plots and tables produced by R can
readily be used in typewriting systems such as Emacs, LATEX, or Word.
Chapter 1 gives a brief introduction into basic functionalities of R. Chap-
ter 2 starts with univariate data visualization and the most important de-
scriptive statistics. Chapter 3 gives commonly used discrete and continuous
distributions to model events and the probability by which these occur. These
distributions are applied in Chapter 4 to statistically test hypotheses from
bioinformatics. For each test the statistics involved are briefly explained and
its application is illustrated by examples. In Chapter 5 linear models are ex-
plained and applied to testing for differences between groups. It gives a basic
approach. In Chapter 6 the three phases of analysis of microarray data (pre-
processing, analysis, post processing) are briefly introduced and illustrated
by many examples bringing ideas together with R scrips and interpretation of
results. Chapter 7 starts with an intuitive approach into Euclidian distance
iii
and explains how it can be used in two well-known types of cluster analysis to
find groups of genes. It also explains how principal components analysis can
be used to explore a large data matrix for the direction of largest variation.
Chapter 8 shows how gene expressions can be used to predict the diagnosis
of patients. Three such prediction methods are illustrated and compared.
Chapter 9 introduces a query language to download sequences efficiently and
gives various examples of computing important quantities such as alignment
scores. Chapter 10 introduces the concept of a probability transition matrix
which is applied to the estimation of phylogenetic trees and (Hidden) Markov
Models.
R commands come after its prompt >, except when commands are part
of the ongoing text. Input and output of R will be given in verbatim
typewriting style. To save space sometimes not all of the original output
from R is printed. The end of an example is indicated by the box . In
its Portable Document Format (PDF)1 there are many links to the Index,
Table of Contents, Equations, Tables, and Figures. Readers are encouraged
to copy and paste scripts from the PDF into the R system in order to study
its outcome. Apart from using the book to study application of statistics in
bioinformatics, it can also be useful for statistical programming.
I would like to thank my colleges Joop Bouman, Sven Warris and Jan
Peter Nap for their useful remarks on parts of an earlier draft. Many thanks
also go to my students for asking questions that gave hints to improve clarity.
Remarks to further improve the text are appreciated.
1 c
This document falls under the GNU Free Document Licence and may be used freely
for educational purposes.
iv
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
v
vi CONTENTS
3 Important Distributions 31
3.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Binomial distribution . . . . . . . . . . . . . . . . . . . 31
3.2 Continuous distributions . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Normal distribution . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Chi-squared distribution . . . . . . . . . . . . . . . . . 37
3.2.3 T-Distribution . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.4 F-Distribution . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.5 Plotting a density function . . . . . . . . . . . . . . . . 41
3.3 Overview and concluding remarks . . . . . . . . . . . . . . . . 42
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Linear Models 73
5.1 Definition of linear models . . . . . . . . . . . . . . . . . . . . 74
5.2 One-way analysis of variance . . . . . . . . . . . . . . . . . . . 77
5.3 Two-way analysis of variance . . . . . . . . . . . . . . . . . . 83
5.4 Checking assumptions . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Robust tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Overview and concluding remarks . . . . . . . . . . . . . . . . 88
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS vii
B References 257
List of Figures
ix
x LIST OF FIGURES
8.1 ROC plot for expression values of CCND3 Cyclin D3. . . . . . 149
8.2 ROC plot for expression values of gene Gdf5. . . . . . . . . . 149
8.3 Boxplot of expression values of gene a for each leukemia class. 151
8.4 Classification tree for gene for three classes of leukemia. . . . . 151
8.5 Boxplot of expression values of gene a for each leukemia class. 154
8.6 Classification tree of expression values from gene A, B, and C
for the classification of ALL1, ALL2, and AML patients. . . . 154
8.7 Boxplot of expression values from gene CCND3 Cyclin D3 for
ALL and AML patients . . . . . . . . . . . . . . . . . . . . . 156
LIST OF FIGURES xi
xiii
xiv LIST OF TABLES
Chapter 1
1
2 CHAPTER 1. BRIEF INTRODUCTION INTO USING R
In this and the following chapters we will illustrate many statistical ideas
by the Golub et al. (1999) data, see also Section 1.8. The golub data become
available by the following.2
> library(multtest)
> data(golub)
R is object-oriented in the sense that everything consists of objects belonging
to certain classes. Type class(golub) to obtain the class of the object golub
and str(golub) to obtain its structure or content. Type objects() or ls()
to view the currently loaded objects, a list probably growing soon to be large.
To prevent conflicting definitions, it is wise to remove them all at the end of
a session by rm(list=ls()). To quit a session, type q(), or simply click on
the cross in the upper right corner of your screen.
search, Rseek, or other useful search engines. There are a number of useful
URLs with information on R.5
This sequence can also be produced by the function seq, which allows for
various sizes of steps to be chosen. For instance, in order to compute per-
centiles of a distribution we may want to generate numbers between zero and
one with step size equal to 0.1.
> seq(0,1,0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
The three conditions are often called levels of a factor. Each of these
levels has five repeats corresponding to the number of observations (patients)
within each level (type of disease). We shall further illustrate the idea of a
factor soon because it is very useful for purposes of visualization.
Now we have created the object gene1 containing three gene expression val-
ues. To compute the sum, mean, and standard deviation of the gene expres-
sion values we use the corresponding built-in-functions.
> sum(gene1)
[1] 3.75
> mean(gene1)
[1] 1.25
> sum(gene1)/3
[1] 1.25
> sd(gene1)
[1] 0.25
> sqrt(sum((gene1-mean(gene1))^2)/2)
[1] 0.25
By defining x1 =P1.00, x2 = 1.50, and x3 = 1.25, the sum of the weightsPcan
be expressed as ni=1 xi = 3.75. The mathematical summationP symbol is
in R language simply sum. The mean is denoted by x = 3i=1 xi /3 = 1.25
and the sample standard deviation as
v
u 3
uX
s = t (x x)2 /(3 1) = 0.25.
i
i=1
Before constructing the matrix it is convenient to add the names of the rows
and the columns. To do so we construct the following list.
> rowcolnames <- list(c("gene1","gene2","gene3","gene4"),
+ c("Eric","Peter","Anna"))
After the last comma in the first line we give a carriage return for R to come
up with a new line starting with + in order to complete a command. Now we
can construct a matrix containing the expression values from our four genes,
as follows.
> gendat <- matrix(c(gene1,gene2,gene3,gene4), nrow=4, ncol=3,
+ byrow=TRUE, dimnames = rowcolnames)
Here, nrow indicates the number of rows and ncol the number of columns.
The gene vectors are placed in the matrix as rows. The names of the rows
and columns are attached by the dimnames parameter. To see the content of
the just created object gendat, we print it to the screen.
> gendat
Eric Peter Anna
gene1 1.00 1.50 1.25
gene2 1.35 1.55 1.30
gene3 -1.10 -1.50 -1.25
gene4 -1.20 -1.30 -1.00
A matrix such as gendat has two indices [i,j], the first of which refers to
rows and the second to columns9 . Thus, if you want to print the second
element of the first row to the screen, then type gendat[1,2]. If you want
to print the first row, then use gendat[1,]. For the second column, use
gendat[,2].
It may be desirable to write the data to a file for using these in a later
stage or to send these to a college of yours. Consider the following script.
> write.table(gendat,file="D:/data/gendat.Rdata")
> gendatread <- read.table("D:/data/gendat.Rdata")
> gendatread
Eric Peter Anna
gene1 1.00 1.50 1.25
9
Indices referring to rows, columns, or elements are always between square brackets [].
8 CHAPTER 1. BRIEF INTRODUCTION INTO USING R
> apply(gendat,2,mean)
Eric Peter Anna
0.0125 0.0625 0.0750
> apply(gendat,1,mean)
gene1 gene2 gene3 gene4
1.250000 1.400000 -1.283333 -1.166667
Thus gene2 appears first because it has the largest mean 1.4, then gene1
with 1.25, followed by gene4 with -1.16 and, finally, gene3 with -1.28. Now
that we have collected the order numbers in the vector o, we can re-order
the whole matrix by specifying o as the row index.11
> gendat[o,]
Eric Peter Anna
gene2 1.35 1.55 1.30
gene1 1.00 1.50 1.25
gene4 -1.20 -1.30 -1.00
gene3 -1.10 -1.50 -1.25
> gendat[c(1,2),]
Eric Peter Anna
gene1 1.00 1.50 1.25
gene2 1.35 1.55 1.30
> gendat[c("gene1","gene2"),]
Eric Peter Anna
gene1 1.00 1.50 1.25
gene2 1.35 1.55 1.30
Now we can use the evaluation of meanexprsval > 0 in terms of the values
TRUE or FALSE as a row index.
11
You can also use functions like sort or rank.
10 CHAPTER 1. BRIEF INTRODUCTION INTO USING R
Observe that this selects genes for which the evaluation equals TRUE. This
illustrates that genes can be selected by their row index, row name or value
on a logical variable.
The data are stored in a matrix called golub. The number of rows and
columns can be obtained by the functions nrow and ncol, respectively.
> nrow(golub)
[1] 3051
> ncol(golub)
[1] 38
12
The data are pre-processed by procedures described in Dudoit et al. (2002).
1.8. APPLICATION TO THE GOLUB (1999) DATA 11
So the matrix has 3051 rows and 38 columns, see also dim(golub). Each
data element has a row and a column index. Recall that the first index refers
to rows and the second to columns. Hence, the second value from row 1042
can be printed to the screen as follows.
> golub[1042,2]
[1] 1.52405
So 1.52405 is the expression value of gene CCND3 Cyclin D3 from patient
number 2. The values of the first column can be printed to the screen by the
following.
> golub[,1]
To save space the output is not shown. We may now print the expression
values of gene CCND3 Cyclin D3 (row 1042) to the screen.
> golub[1042,]
[1] 2.10892 1.52405 1.96403 2.33597 1.85111 1.99391 2.06597 1.81649
[9] 2.17622 1.80861 2.44562 1.90496 2.76610 1.32551 2.59385 1.92776
[17] 1.10546 1.27645 1.83051 1.78352 0.45827 2.18119 2.31428 1.99927
[25] 1.36844 2.37351 1.83485 0.88941 1.45014 0.42904 0.82667 0.63637
[33] 1.02250 0.12758 -0.74333 0.73784 0.49470 1.12058
To print the expression values of gene CCND3 Cyclin D3 to the screen only
for the ALL patients, we have to refer to the first twenty seven elements of
row 1042. A possibility to do so is by the following.
> golub[1042,1:27]
However, for the work ahead it is much more convenient to construct a factor
indicating the tumor class of the patients. This will turn out useful e.g.
for separating the tumor groups in various visualization procedures. The
factor will be called gol.fac and is constructed from the vector golub.cl,
as follows.
> gol.fac <- factor(golub.cl, levels=0:1, labels = c("ALL","AML"))
In the sequel this factor will be used frequently. Obviously, the labels corre-
spond to the two tumor classes. The evaluation of gol.fac=="ALL" returns
TRUE for the first twenty seven values and FALSE for the remaining eleven.
This is useful as a column index for selecting the expression values of the
ALL patients. The expression values of gene CCND3 Cyclin D3 from the
ALL patients can now be printed to the screen, as follows.
12 CHAPTER 1. BRIEF INTRODUCTION INTO USING R
> golub[1042,gol.fac=="ALL"]
For many types of computations it is very useful to combine a factor with
the apply functionality. For instance, to compute the mean gene expression
over the ALL patients for each of the genes, we may use the following.
> meanALL <- apply(golub[,gol.fac=="ALL"], 1, mean)
The row means of the expression values per patient group are computed and
stored in the object mall and maml, respectively. The absolute values of the
differences in means are computed and their order numbers (from large to
small) are stored in the vector o. Next, the names of the five genes with the
largest differences in mean are printed to the screen.
After saving the script under e.g. the name meandif.R in the directory
D:\\Rscripts\\meandif.R, it can be executed by using source("D:\\Rscripts\\meandif.R").
Once the script is available for a typewriter it is easy to adapt it and to re-run
it.
Readers are strongly recommended to trial-and-error with respect to writ-
ing programming scripts. To run these it is very convenient to have your
favorite word processor available and to use, for instance, the copy-and-paste
functionality.
sponding scientific articles are freely available from the web. Having these
available may further motivate readers for the computations ahead.
1.11 Exercises
1. Some questions to orientate yourself.
(a) Use the function class to find the class to which the follow-
ing objects belong: golub, golub[1,1],golub.cl, golub.gnames,
apply, exp, gol.fac, plot, ALL.
(b) What is the meaning of the following abbreviations: rm, sum, prod,
seq, sd, nrow.
(c) For what purpose are the following functions useful: grep, apply,
gl, library, source, setwd, history, str.
(b) Select the expression values of the genes with standard deviation
larger than two.
(c) How many genes have this property?
(a) How many oncogenes are there in the dataset? Hint: Use grep.
(b) Find the biological names of the three oncogenes with the largest
mean expression value for the ALL patients.
(c) Do the same for the AML patients.
(d) Write the gene probe ID and the gene names of the ten genes with
largest mean gene expression value to a csv file.
7. Gene means for B1 patients. Load the ALL data from the ALL library
and use str and openVignette() for a further orientation.
A few essential methods are given to display and visualize data. It quickly
answers questions like: How are my data distributed? How can the frequen-
cies of nucleotides from a gene be visualized? Are there outliers in my data?
Does the distribution of my data resemble that of a bell-shaped curve? Are
there differences between gene expression values taken from two groups of
patients?
The most important central tendencies (mean, median) are defined and
illustrated together with the most important measures of spread (standard
deviation, variance, inter quartile range, and median absolute deviation).
17
18 CHAPTER 2. DATA DISPLAY AND DESCRIPTIVE STATISTICS
will be illustrated by the Zyxin gene which plays an important role in cell
adhesion (Golub et al., 1999). The accession number (X94991.1) of one of
its variants can be found in a data base like NCBI (UniGene). The code
below illustrates how to read the sequence X94991.1 of the species homo
sapiens from GenBank, , to construct a pie from a frequency table of the four
nucleotides.
install.packages(c("ape"),repo="http://cran.r-project.org",dep=TRUE)
library(ape)
table(read.GenBank(c("X94991.1"),as.character=TRUE))
pie(table(read.GenBank(c("X94991.1"))))
From the resulting frequencies in Table 2.1 it seems that the nucleotides are
not equally likely. A nice way to visualize a frequency table is by plotting a
pie.
A C G T
410 789 573 394
t
g
From the resulting Figure 2.2 it can be observed that the CCND3 Cyclin D3
expression values of the ALL patients tend to have larger expression values
than those of the AML patients.
2.1.3 Histogram
Another method to visualize data is by dividing the range of data values into
a number of intervals and to plot the frequency per interval as a bar. Such
a plot is called a histogram.
The function hist divides the data into 5 intervals having width equal to
0.5, see Figure 2.3. Observe from the latter that one value is small and the
other are more or less symmetrically distributed around the mean.
20 CHAPTER 2. DATA DISPLAY AND DESCRIPTIVE STATISTICS
2.5
2.5
2.0
2.0
1.5
1.5
golub[1042, ]
1.0
1.0
0.5
0.5
0.0
0.0
0.5
0.5
0 10 20 30 ALL AML
Index
2.1.4 Boxplot
It is always possible to sort n data values to have increasing order x1 x2
xn , where x1 is the smallest, x2 is the first-to-the smallest, etc. Let
x0.25 be a number for which it holds that 25% of the data values x1 , , xn
is smaller. That is, 25% of the data values lay on the left side of the number
x0.25 , reason for which it is called the first quartile or the 25th percentile.
The second quartile is the value x0.50 such that 50% of the data values are
smaller. Similarly, the third quartile or 75th percentile is the value x0.75 such
that 75% of the data is smaller. A popular method to display data is by
drawing a box around the first and the third quartile (a bold line segment
for the median), and the smaller line segments (whiskers) for the smallest and
the largest data values. Such a data display is known as a box-and-whisker
plot.
Example 1. A vector with gene expression values can be put into in-
creasing order by the function sort. We shall illustrate this by the ALL
2.1. UNIVARIATE DATA DISPLAY 21
2.5
10
2.0
8
1.5
Frequency
1.0
6
0.5
4
0.0
2
0.5
0
Figure 2.3: Histogram of ALL ex- Figure 2.4: Boxplot of ALL and
pression values of gene CCND3 AML expression values of gene
Cyclin D3. CCND3 Cyclin D3.
The first quartile x0.25 = 1.796, the second x0.50 = 1.928, and the third
x0.75 = 2.179. The smallest observed expression value equals x0.00 = 0.458
and the largest x1.00 = 2.77. The latter can also be obtained by the function
min(golub[1042, gol.fac=="ALL"]) and max(golub[1042, gol.fac=="ALL"]),
or more briefly by range(golub[1042, gol.fac=="ALL"]).
Outliers are data values laying far apart from the pattern set by the
majority of the data values. The implementation in R of the (modified)
boxplot draws such outlier points separately as small circles. A data point
x is defined as an outlier point if
x < x0.25 1.5 (x0.75 x0.25 ) or x > x0.75 + 1.5 (x0.75 x0.25 ).
From Figure 2.4 it can be observed that there are outliers among the gene
expression values of ALL patients. These are the smaller values 0.45827 and
1.10546, and the largest value 2.76610. The AML expression values have one
outlier with value -0.74333.
To define extreme outliers, the factor 1.5 is raised to 3.0. Note that this
is a descriptive way of defining outliers instead of statistically testing for the
existence of an outlier.
expression values appear to the line, the more likely it is that the data are
normally distributed.
Normal QQ Plot
2.5
2.0
Sample Quantiles
1.5
1.0
0.5
2 1 0 1 2
Theoretical Quantiles
Figure 2.5: Q-Q plot of ALL gene expression values of CCND3 Cyclin D3.
exercises below, the reader will gather more experience with the degree in
which gene expression values are normally distributed.
Thus the sample mean is simply the average of the n data values. Since it
is the sum of all data values divided by the sample size, a few extreme data
values may largely influence its size. In other words, the mean is not robust
against outliers.
The median is defined as the second quartile or the 50th percentile, and
is denoted by x0.50 . When the data are symmetrically distributed around the
mean, then the mean and the median are equal. Since extreme data values
do not influence the size of the median, it is very robust against outliers.
Robustness is important in bioinformatics because data are frequently con-
taminated by extreme or otherwise influential data values.
Note that the mean and the median do not differ much so that the distribu-
tion seems quite symmetric.
Hence, it is the average of the squared differences between the data values
and the sample mean. The sample standard deviation s is the square root
of the sample variance and may be interpreted as the distance of the data
values to the mean. The variance and the standard deviation are not robust
against outliers.
The interquartile range is defined as the difference between the third and
the first quartile, that is x0.75 x0.25 . It can be computed by the function
IQR(x). More specifically, the value IQR(x)/1.349 is a robust estimator of
the standard deviation. The median absolute deviation (MAD) is defined as
a constant times the median of the absolute deviations of the data from the
median (e.g. Jureckova & Picek, 2006, p. 63). In R it is computed by the
function mad defined as the median of the sequence |x1 x0.50 |, , |xn x0.50 |
multiplied by the constant 1.4826. It equals the standard deviation in case
the data come from a bell-shaped (normal) distribution (see Section 3.2.1).
Because the interquartile range and the median absolute deviation are based
on quantiles, these are robust against outliers.
Due to the three outliers (cf. Figure 2.4) the standard deviation is larger
than the interquartile range and the mean absolute deviation. That is, the
absolute differences with respect to the median are somewhat smaller, than
the root of the squared differences.
2.4 Exercises
Since the majority of the exercises are based on the Golub et al. (1999)
data, it is essential to make these available and to learn to work with it. To
stimulate self-study the answers are given at the end of the book.
(a) Compute the mean and the standard deviation for 1, 1.5, 2, 2.5, 3.
(b) Compute the mean and the standard deviation for 1, 1.5, 2, 2.5, 30.
(c) Comment on the differences.
2. Comparing normality for two genes. Consider the gene expression val-
ues in row 790 and 66 of the Golub et al. (1999) data.
(a) Produce a boxplot for the expression values of the ALL patients
and comment on the differences. Are there outliers?
2.4. EXERCISES 27
(a) Determine the five genes with the largest effect size of the ALL
patients from the Golub et al. (1999) data. Comment on their
size.
(b) Invent a robust variant of the effect size and use it to answer the
previous question.
4. Plotting gene expressions "CCND3 Cyclin D3". Use the gene expres-
sions from "CCND3 Cyclin D3" of Golub et al. (1999) collected in row
1042 of the object golub from the multtest library. After using the
function plot you produce an object on which you can program.
(b) Add text to the plot to explain the meaning of the upper and
lower part of the box.
(c) Do the same for the wiskers.
(d) Export your plot to eps format.
Hint 1: Use locator() to find coordinates of the position of the plot.
Hint 2: Use xlim to make the plot somewhat wider.
Hint 3: Use arrows to add an arrow.
Hint 4: Use text to add information at a certain position.
2.5
2.0
Median
1.5
Outlier
1.0
0.5
Important Distributions
31
32 CHAPTER 3. IMPORTANT DISTRIBUTIONS
Example 2. If two carriers of the gen for albinism marry, then each of the
children has probability of 1/4 of being albino. What is the probability for
one child out of three to be albino? To answer this question we take n = 3,
k = 1, and p = 0.25 into Equation (3.1) and obtain
3!
P (X = 1) = 0.251 0.752 = 3 0.140625 = 0.421875.
1!(3 1)!
An elementary manner to compute this in R is by
> choose(3,1)* 0.25^1* 0.75^2
where choose(3,1) computes the binomial coefficient. It is more efficient to
compute this by the built-in-density-function dbinom(k,n,p), for instance
to print the values of the probabilities.
1
p For a binomially distributed variable np is the mean, np(1 p) the variance, and
np(1 p) the standard deviation.
3.1. DISCRETE DISTRIBUTIONS 33
Changing d into p yields the so-called distribution function with the cumula-
tive probabilities. That is, the probability that the number of Heads is lower
than or equal to two P (X 2) is computed by pbinom(2,3,0.25). The
values of the density and distribution function are summarized in Table 3.1.
From the table we read that the probability of no albino child is 0.4218 and
the probability that all three children are albino equals 0.0156.
22!
P (X = 14) = 0.714 0.38 = dbinom(14, 22, 0.7) = 0.1423.
14!(22 14)!
This is the value of the density function at 14. The probability of the event of
less than or equal to 13 purines equals the value of the distribution function
at value 13, that is
1.0
f(x)
F(x)
0.15
0.8
0.6
0.10
0.4
0.05
0.2
0.00
0.0
x x
0 5 10 15 20 0 5 10 15 20
By the first line the sequence of integers {1, 2, , 22} is constructed and by
the second the density function is plotted, where the argument h specifies
pins. From Figure 3.1 it can be observed that the largest probabilities oc-
cur near the expectation 15.4. The graph in Figure 3.2 illustrates that the
distribution is an increasing step function, with x on the horizontal axis and
P (X x) on the vertical.
A random sample of size 1000 from the binomial distribution with n = 22
and p = 0.7 can be drawn by the command rbinom(1000,22,0.7). This
simulates the number of purines in 1000 microRNAs each with purine prob-
ability equal to 0.7 and length 22.
Figure 3.4 illustrates the value 0.16 of the distribution function at x = 1.4.
It corresponds to the area of the blue colored surface below the graph of the
density function in Figure 3.3. The probability that the expression values
are larger than 2.4 is
0.8
1.0
0.8
0.6
0.6
density f(x)
P(X<=1.4)
= 0.16
0.4
0.16
0.4
0.2
0.2
0.0
0.0
1.4 1.4
0 1 2 3 4 0 1 2 3 4
x x
Figure 3.3: Graph of normal den- Figure 3.4: Graph of normal dis-
sity with mean 1.9 and standard tribution with mean 1.9 and stan-
deviation 0.5. dard deviation 0.5.
1.0
0.15
0.84
0.8
ChiSquared Distribution F(x)
ChiSquared Density f(x)
0.10
0.6
area=0.84
0.4
0.05
0.2
0.00
0.0
8 8 x
0 5 10 15 20 25 0 5 10 15 20
library(multtest); data(golub)
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
x <- golub[1042,gol.fac=="ALL"]
z <- (x-1.90)/0.50
sum(z^2)
pchisq(sum(z^2),27, lower.tail=FALSE)
3.2. CONTINUOUS DISTRIBUTIONS 39
3.2.3 T-Distribution
The T -distribution has many useful applications for testing hypotheses about
means of gene expression values, in particular when the sample size is lower
than thirty. If the data are normally distributed, then the values of n(x
)/s follow a T -distribution with n1 degrees of freedom. The T -distribution
is approximately equal to the normal distribution when the sample size is
thirty.
0.4
1.0
0.8
0.3
Distribution F(x)
0.6
Density f(x)
0.2
0.4
0.1
0.2
0.0
0.0
4 2 0 2 4 4 2 0 2 4
xaxis xaxis
n <- 11
x <- golub[2058, gol.fac=="AML"]
40 CHAPTER 3. IMPORTANT DISTRIBUTIONS
From the above we know that this has a T10 distribution. The probability
that T10 is greater than 1.236324 can be computed, as follows.
This probability corresponds to the area of the blue colored surface below of
the graph of the density function in Figure 3.7. The T distribution function
with ten degrees of freedom is illustrated in Figure 3.8. The probability that
the random variable T10 is between -2 and 2 equals
3.2.4 F-Distribution
The F -distribution is important for testing the equality of two variances. It
can be shown that the ratio of variances from two independent sets of nor-
mally distributed random variables follows an F -distribution. More specifi-
cally, if the two population variances are equal (12 = 22 ), then s21 /s22 follows
an F -distribution with n1 1, n2 1 degrees of freedom, where s21 is the
variance of the first set, s22 that of the second, and n1 is the number of ob-
servations in the first and n2 in the second.4
> var(golub[1042,gol.fac=="ALL"])/var(golub[1042,gol.fac=="AML"])
[1] 0.7116441
4
It is more correct to define S12 /S22 for certain random variables S12 and S22 , we shall ,
however, not border.
3.2. CONTINUOUS DISTRIBUTIONS 41
1.0
0.8
0.8
0.6
F Distribution F(x)
F density f(x)
0.6
0.4
0.4
0.23
0.2
0.23
0.2
0.0
0.0
0.71 x
0 2 4 6 8 10 0.71
0 2 4 6 8 10
Figure 3.9 illustrates that this value corresponds to the area of the blue col-
ored surface below the graph of the density function. Figure 3.10 gives the
distribution function. To find the quantile x0.025 use qf(.025,26,10)=0.3861673.
This subject is taken further in Section 4.1.5.
> f<-function(x){dnorm(x,1.9,0.5)}
> plot(f,0,4,xlab="x-axis",ylab="density f(x)")
5
This subsection is solemly on plotting and can be skipped without loss of continuity.
42 CHAPTER 3. IMPORTANT DISTRIBUTIONS
This produces the graph of the density function in Figure 3.3. The specifica-
tion 0,4 defines the interval on the horizontal axis over which f is plotted.
The vertical axis is adapted automatically. We can give the surface under f
running x from 0 to 1.4 a nice blue color by using the following.
plot(f,0,4,xlab="x-axis",ylab="density f(x)")
x<-seq(0,1.4,0.01)
polygon(c(0,x,1.4), c(0,f(x),0), col="lightblue")
The basic idea of plotting is to start with a plot and next to add colors, text,
arrows, etc. In particular, the command polygon is used to give the surface
below the graph the color "lightblue". The polygon (surface enclosed by
many angles) is defined by the sequence of points defined as x and f(x).
Table 3.3: Density, mean, and variance of distributions used in this chapter.
Distribution parameters density expectation variance
n! k nk
Binomial n, p k!(nk)!
p (1 p) np np(1 p)
1 1 x 2
Normal ,
2
exp( 2 ( ) ) 2
Chi-squared df=m m 2m
3.4 Exercises
It is importance to obtain some routine with the computation of probabilities
and quantiles.
1. Binomial Let X be binomially distributed with n = 60 and p = 0.4.
Compute the following.
(a) P (X = 24), P (X 24), and P (X 30).
(b) P (20 X 30), P (20 X).
(c) P (20 X or X 40), and P (20 X and X 10).
(d) Compute the mean and standard deviation of X.
(e) The quantiles x0.025 , x0.5 , and x0.975 .
2. Standard Normal. Compute the following probabilities and quantiles.
(a) P (1.6 < Z < 2.3).
(b) P (Z < 1.64).
(c) P (1.64 < Z < 1.02).
(d) P (0 < Z < 1.96).
(e) P (1.96 < Z < 1.96).
(f) The quantiles z0.025 , z0.05 , z0.5 , z0.95 , and z0.975 .
3. Normal. Compute for X distributed as N (10, 2) the following proba-
bilities and quantiles.
44 CHAPTER 3. IMPORTANT DISTRIBUTIONS
(a) Compute the probability that the expression values are smaller
than 1.2?
(b) What is the probability that the expression values are between 1.2
and 2.0?
(c) What is the probability that the expression values are between 0.8
and 2.4?
(d) Compute the exact values for the quantiles x0.025 and x0.975 .
(e) Use rnorm to draw a sample of size 1000 from the population and
compare the sample mean and standard deviation with that of the
population.
(a) Take = 0 and compute the t-values for the ALL gene expression
values. Find the three genes with largest absolute t-values.
(b) Compute per gene the ratio of the variances for the ALL and the
AML patients. How many are between 0.5 and 1.5?
Now plot the density from the normalized maxima and add the extreme
value function f (x) from Pevsner his book, and add the density (dnorm)
from the normal distribution. What do you observe?
Chapter 4
Questions that we deal with in this chapter are related to statistically testing
biological hypothesis. Does the mean gene expression over ALL patients
differ from that over AML patients? That is, does the mean gene expression
level differ between experimental conditions? Is the mean gene expression
different from zero? To what extent are gene expression values normally
distributed? Are there outliers among a sample of gene expression values?
How can an experimental effect be defined? How can genes be selected with
respect to an experimental effect? Other important questions are: How can
it be tested whether the frequencies of nucleotide sequences of two genes are
different? How can it be tested whether outliers are present in the data?
What is the probability of a certain micro RNA to have more than a certain
number of purines?
In the foregoing chapters many population parameters were used to define
families of theoretical distributions. In any research (empirical) setting the
specific values of such parameters are unknown so that these must be esti-
mated. Once estimates are available it becomes possible to statistically test
biologically important hypotheses. The current chapter gives several basic
examples of statistical testing and some of its background. Robust type of
testing is briefly introduced as well as an outlier test.
47
48 CHAPTER 4. ESTIMATION AND INFERENCE
The conclusion from the test is now as follows: If the p-value is larger than
the significance level , then H0 is not rejected and if it is smaller than the
significance level, then H0 is rejected.
makes it likely that this gene is not directly related to leukemia. Hence,
we may hypothesize that the population mean of the ALL expression values
equals zero. Accordingly, we test H0 : = 0 against H1 : 6= 0. For the sake
of illustration we shall pretend that the standard deviation is known to be
equal to 0.25. The z-value (=0.001116211) can be computed as follows.
> 2*pnorm(-abs(z.value),0,1)
[1] 0.9991094
Since it is clearly larger than 0.05, we conclude that the null hypothesis of
mean equal to zero is not rejected (accepted).
Note that the above procedure implies rejection of the null hypothesis
when z is highly negative or highly positive. More precisely, if z falls in the
region (, z0.025 ] or [z0.975 , ), then H0 is rejected. For this reason these
intervals are called rejection regions. If z falls in the interval (z0.025 , z0.975 ),
then H0 is not rejected and consequently this region is called acceptance
region. The situation is illustrated in Figure 4.1.
The interval (z0.025 , z0.975 ) is often named confidence interval, because
if the null hypothesis is true, then we are 95% confident that the observed
z-value falls in it. It is custom to rework the confidence interval into an
interval with respect to (Samuels & Witmer, 2003, p. 186). In particular,
the 95% confidence interval for the population mean is
x + z0.025 , x + z0.975 . (4.1)
n n
That is, we are 95% certain3 that the true mean falls in the confidence inter-
val. Such an interval is standard output of statistical software.
3
If we would repeat the procedure sufficiently often
50 CHAPTER 4. ESTIMATION AND INFERENCE
0.4
0.3
normal density
z
4 2 0 2 4
Example 2. Using the data from Example 1, the 95% confidence interval
given by Equation 4.1 can be computed as follows.4
> mean(x)+qnorm(c(0.025),0,1)*sigma/sqrt(n)
[1] -0.0942451
> mean(x)+qnorm(c(0.975),0,1)*sigma/sqrt(n)
[1] 0.09435251
> library(TeachingDemos)
> z.test(x,mu=0,sd=0.25)
data: x
z = 0.0011, n = 27.000, Std. Dev. = 0.250, Std. Dev. of the sample mean
= 0.048, p-value = 0.9991
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.09424511 0.09435251
sample estimates:
mean of x
5.37037e-05
From the z-value, the p-value, and the confidence interval, the conclusion is
not to reject the null-hypothesis of mean equal to zero. This illustrates that
testing by either of these procedures yields equivalent conclusions.
Then 100 samples of size 25 from the N (0, 1) distribution are drawn and for
each of these the confidence interval for the population mean is computed
and represented as a line segment. Apart from sampling fluctuations, the
confidence level corresponds to the percentage of intervals containing the
true mean (colored in black) and that the significance level corresponds to
intervals not containing it (colored in red or blue).
0.2
t0.025 t0.975
4 2 0 2 4
xaxis
so that the conclusion is not to reject the null hypothesis of mean equal to
zero.
To see whether the observed t-value belongs to the 95% confidence inter-
val, we compute
Since this interval does contain the t-value, we do not reject the hypothesis
that equals zero. The left boundary of the 95% confidence interval for the
population mean can be computed, as follows.
> mean(x)+qt(0.025,26)*sd(x)/sqrt(n)
[1] -0.1024562
The 95% confidence interval equals (0.1025, 0.1025). Since it contains zero,
we do not reject the null-hypothesis.
In daily practice it is much more convenient to use the built-in-function
t.test. We illustrate it with the current testing problem.
> t.test(x,mu=0)
data: x
t = 0.0011, df = 26, p-value = 0.9991
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.1024562 0.1025636
sample estimates:
mean of x
5.37037e-05
54 CHAPTER 4. ESTIMATION AND INFERENCE
This yields by one command line the observed t-value, the p-value, and the
95% confidence interval for 0 .
Example 1. Golub et al. (1999) argue that gene CCND3 Cyclin D3 plays
an important role with respect to discriminating ALL from AML patients.
The boxplot in Figure 2.4 suggests that the ALL population mean differs from
that of AML. The null hypothesis of equal means can be tested by the func-
tion t.test and the appropriate factor and specification var.equal=FALSE.
> t.test(golub[1042,] ~ gol.fac, var.equal=FALSE)
When the first group is an experimental group and the second a control
group, then 1 2 is the experimental effect in the population and xy that
in the sample. The t-value is the experimental effect in the sample relative
to the standard error. The size of the effect is measured by the p-value in
the sense that it is smaller for larger effects.
If the two population variances are equal, then the testing procedure
simplifies considerably. This is the subject of the next paragraph.
(n 1)s21 + (m 1)s22
s2p = .
n+m2
Then the t-value can be formulated as
x y (1 2 )
t= q .
1 1
sp n + m
Example 1. The null hypothesis for gene CCND3 Cyclin D3 that the
mean of the ALL differs from that of AML patients can be tested by the
two-sample t-test using the specification var.equal=TRUE.
sample estimates:
mean in group ALL mean in group AML
1.8938826 0.6355909
From the p-value 6.046 108 , the conclusion is to reject the null hypothesis
of equal population means. Note that the p-value is slightly smaller than
that of the previous test.
Example 1. The null hypothesis for gene CCND3 Cyclin D3 that the
variance of the ALL patients equals that of the AML patients can be tested
by the built-in-function var.test, as follows.
From the p-value 0.4652, the null-hypothesis of equal variances is not re-
jected.
the conclusion follows not to reject the null-hypothesis. This test can also
be conducted by the function binom.test as follows.
The p-value 0.1645 is larger than the significance level 0.05, so that the null
hypothesis is not rejected.
4.1. STATISTICAL HYPOTHESIS TESTING 59
0.15
0.10
acceptance rejection
0.05
region region
0.00
7.8 q
0 5 10 15 20 25
data: zyxinfreq
X-squared = 187.0674, df = 3, p-value < 2.2e-16
The package ape is loaded, the Zyxin sequence "X94991.1" is downloaded,
and the frequency table is constructed. The observed frequencies are given
as input to chisq.test which has equal probabilities as the default option.
The q-value equals X-squared and the degrees of freedom df = 3. From the
corresponding p-value, the conclusion is to reject the null hypothesis of equal
probabilities. The testing situation is illustrated in Figure 4.3, where the
red colored surface corresponds to the rejection region (7.81, ). Remember
from the previous chapter that the left bound of this rejection interval can
by found by qchisq(0.95, 3). The observed q = 187.0674 obviously falls
far into the right hand side of the rejection region, so that the corresponding
p-value is very close to zero.
of ornamental sweet peas. A crossing of B and b yields off spring BB, Bb and
bb with probability 0.25, 0.50, 0.25. Since Mendel could not distinguish Bb
from BB, his observations theoretically occur with probability 0.75 (BB and
Bb) and 0.25 (bb). To test the null hypothesis H0 : (1 , 2 ) = (0.75, 0.25)
against H1 : (1 , 2 ) 6= (0.75, 0.25), we use the chi-squared test6 , as follows.
> pi <- c(0.75,0.25)
> x <-c(5474, 1850)
> chisq.test(x, p=pi)
data: x
X-squared = 0.2629, df = 1, p-value = 0.6081
From the p-value 0.6081, we do not reject the null hypothesis.
data: dat
X-squared = 0.2, df = 1, p-value = 0.6547
Since the p-value is larger than the significance level, the null hypothesis of
independence is not rejected.
Suppose that for another cutoff value we obtain 8 true positives (tp), 2
false positives (fp), 8 true negatives (tn), and 2 false negatives (fn). Then
testing independence yields the following.
data: dat
X-squared = 5, df = 1, p-value = 0.02535
Since the p-value is smaller than the significance level, the null hypothesis of
independence is rejected.
significant non-significant
genes genes
Chromosome 1 100 2000
genome 300 6000
data: dat
p-value = 0.01912
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.029519 1.396922
sample estimates:
odds ratio
1.199960
Since the p-value is smaller than the significance level, the null hypothesis
of odds ratio equal to one is rejected. There are more significant oncogenes
in Chromosome 1 compared to that in the genome. Other examples of the
Fisher test will be given in Chapter 6.
Example 1. To test the hypothesis that the ALL gene expression values
of CCND3 Cyclin D3 from Golub et al. (1999) are normally distributed, the
Shapiro-Wilk test can be used as follows.
Since the p-value is greater than 0.05, the conclusion is not to reject the null
hypothesis that CCND3 Cyclin D3 expression values follow from a normal
distribution. The Anderson-Darling test is part of the nortest package which
probably needs to be installed and loaded first. Running the test on our
CCND3 Cyclin D3 gene expression values comes down to the following.
> library(nortest)
> ad.test(golub[1042,gol.fac=="ALL"])
Hence, the same conclusion is drawn as from the Shapiro-Wilk test. Note
that the p-values from both tests are somewhat low. This confirms our obser-
vation in Section 2.1.5 based on the Q-Q plot that the distribution resembles
the normal. From the normality tests the conclusion is that the differences
in the left tail are not large enough to reject the null-hypothesis that the
CCND3 Cyclin D3 expression values are normally distributed.
deviation s.
> library(outliers)
> grubbs.test(golub[1042, gol.fac=="ALL"])
Since the p-value is smaller than 0.05, the conclusion is to reject the null-
hypothesis of no outliers.
testing (e.g. Lehmann, 2006). To broaden our view we switch from hypothe-
ses about means to those about distributions. An alternative hypothesis
may then be formulated as that the distribution of a first group lays to the
left of a second. To set the scene let the gene expression values of the first
group (x1 to xm ) have distribution F and those of the second group (y1 to
yn ) distribution G. The null hypothesis is that both distributions are equal
(H0 : F = G) and the alternative that these are not. For example that the
xs are smaller (or larger) than the ys. By the two-sample Wilcoxon test the
data x1 , , xm , y1 , , yn are ranked and the rank numbers of the xs are
summed to form the statistic W after a certain correction (Lehmann, 2006).
The idea is that if the ranks of xs are smaller than those of the ys, then the
sum is small. The distribution of the sum of ranks is known so that a p-value
can be computed on the basis of which the null hypothesis is rejected if it is
smaller than the significance level .
Example 1. The null hypothesis that the expression values for gene
CCND3 Cyclin D3 are equally distributed for the ALL patients and the AML
patients can be tested by the built-in-function wilcox.test, as follows.
Since the p-value is much smaller than = 0.05, the conclusion is to reject
the null-hypothesis of equal distributions.
> data(golub,package="multtest")
> gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
> sh <- apply(golub[,gol.fac=="ALL"], 1, function(x) shapiro.test(x)$p.value)
> sum(sh > 0.05)/nrow(golub) * 100
[1] 58.27598
Hence, according to the Shapiro-Wilk test, 58.27% of the ALL gene ex-
pression values is normally distributed (in the sense of non-rejection). For
the AML expression values this is 60.73%. It can be concluded that about
forty percent of the genes do not pass the normality test.
The p-value is extracted from the output of the t.test function and stored
in the vector pt. The logical operator & is used to select genes for which the
68 CHAPTER 4. ESTIMATION AND INFERENCE
Wilcoxon p-value is smaller than 0.05 and the absolute difference with the
p-value from the t-test is larger than 0.2. Since there are only two such genes
we can draw the reassuring conclusion that the tests give similar results.
test is recommended.
Because the Wilcoxon p-values are based on ranks many of these are
equal for different genes, so that it is less suitable for ordering in case of
small sample size. On the other hand, it is obviously questionable whether
extremely small differences in p-values produced by the t-test contribute to
biologically relevant gene discrimination. That is, extremely small differences
should not be over-interpreted.
4.4 Exercises
1. Gene CD33. Use grep to find the index of the important gene CD33
among the list of characters golub.gnames. For each test below for-
mulate the null hypothesis, the p-value and your conclusion.
(a) Test the normality of the ALL and AML expression values.
(b) Test for the equality of variances.
(c) Test for the equality of the means by an appropriate t-test.
(d) Is the experimental effect strong?
(a) Test the normality of the expression values of the ALL patients.
(b) Test for the equality of means by an appropriate t-test.
5. Gene selection. Select the genes from the golub data with smallest
two-sample t-test values for which the ALL mean is greater than the
AML mean. Report the names of the best ten. Scan the Golub (1999)
article for genes among the ten you found and discuss their biological
function briefly.
8. Comparing two genes. Consider the gene expression values in row 790
and 66 of the Golub et al. (1999) data.
(a) Produce a boxplot for the ALL expression values and comment on
the differences. Are there outliers?
(b) Compute the mean and the median for the ALL gene expression
values for both genes. Do you observed difference between genes?
(c) Compute three measures of spread for the ALL expression values
for both genes. Do you observe difference between genes?
(d) Test by Shapiro-Wilk and Anderson-Darling the normality for the
ALL gene expression values for both genes.
9. Normality tests for gene expression values of the Golub et al. (1999)
data. Perform the Shapiro-Wilk normality test separately for the ALL
and AML gene expression values. What percentage passed the normal-
ity test separately for the ALL and the AML gene expression values?
What percentage passes both testes?
10. Two-sample tests on gene expression values of the Golub et al. (1999)
data.
4.4. EXERCISES 71
(a) Perform the two-sample Welch t-test and report the names of the
ten genes with the smallest p-values.
(b) Perform the Wilcoxon rank test and report the names of the ten
genes with the smallest p-values.
(a) Program the two-sample t-test with equal variances and illustrate
it with the expression values of row 1024 the of Golub et al. (1999)
data.
(b) The value of W in the two-sample Wilxoxon test equals the sum
of the ranks of Group 1 minus n(n + 1)/2, where n is the number
of gene expression values in Group 1. Program this and illustrate
it with the expression values of row 1024 of Golub et al. (1999)
data.
(c) The value of W in the two-sample Wilxoxon test equals the num-
ber of values xi > yj , where xi , yj are values from Group 1 and
2, respectively. Program this and illustrate it with the expression
values of row 1024 of Golub et al. (1999) data.
72 CHAPTER 4. ESTIMATION AND INFERENCE
Chapter 5
Linear Models
We have seen that the t-test can be used to discover genes with different
means in the population with respect to two groups of patients. In case,
however, there are three groups of patients the question arises how genes can
be selected having the largest differential expressions between group means
(experimental effect)? A technique making this possible is an application of
the linear model and is called analysis of variance. It is frequently applied
bioinformatics.
The validity of the technique is based on the assumption that the gene
expression values are normally distributed and have equal variance across
groups of patients. It is of importance to investigate these assumptions be-
cause it either reassures our confidence in the conclusions or it indicates that
alternative tests should be used.
In this chapter the linear model will briefly be explained. The main focus,
however, is on application of the linear model for testing the hypothesis that
three or more group means are equal. Several illustrations of analyzing gene
expression data will be given. It will be explained how the assumptions about
normality and equal variances (homogeneity) can be investigated and what
alternatives can be used in case either of these does not hold. The somewhat
technical concepts of model matrix and contrast matrix are explained
because these are useful for several applications in the next chapter.
73
74 CHAPTER 5. LINEAR MODELS
Yi = xi + i , for i = 1, , n,
Yi = 1 + xi 2 + i , for i = 1, , n,
so that the model part represents a straight line with intercept 1 and
slope 2 . Given data points y1 , , yn and x1 , , xn , a best fitting line
through the data can easily be computed by least squares estimation of the
intercept and slope. A nice application to explore this is by the function
put.points.demo() from the TeachingDemos package. It allows points to
be added and deleted to a plot which interactively computes estimates for
the slope and the intercept given the data. By choosing the points more or
less on a horizontal line, the slope will be near zero. By choosing the points
nearly vertical, the slope will be large. By choosing a few gross errors in the
data it can be observed that the estimates are not robust against outliers.
In order to handle gene expression data for three or more groups of pa-
tients we need to extend the model. The idea simply is to increase the
number of weights to the number of groups k, so that, we obtain the weights
1 , , k and the corresponding design values xi1 , , xik . The system-
atic part of the model consists of a weighted sum of these design values:
5.1. DEFINITION OF LINEAR MODELS 75
The design values xij for Patient i in Group j are collected in the so-called
design matrix denoted by X. In particular, the design value xij is chosen
to be equal to 1 if Patient i belongs to Group j and zero if (s)he does not.
By this choice it becomes possible to use linear model estimation for testing
hypotheses about group means. This will be illustrated by an example.
9 0 0 1
10 0 0 1
11 0 0 1
12 0 0 1
The notation y~a-1 represents a model equation, where -1 means to skip
the intercept or general constant.1 In this situation, the weights (1 , 2 , 3 )
of the model specialize to the population means (1 , 2 , 3 ). The model for
the first gene expression value of Group 1 is Y1 = 1 + 1 , for the second
expression value of Group 1 it is Y2 = 1 + 2 , for the first member of Group
2 it is Y5 = 2 + 5 , and for the first member of Group 3 it is Y9 = 3 + 9 .
Recall that population means are generally estimated by sample means.
Similarly, in the current setting, estimation of the linear model comes down
to estimation of group means for which there are one-sample t-type of tests
available (see e.g. Rao & Toutenburg, 1995; Samuels & Witmer, 2003). To
illustrate this we employ the estimation function lm and ask for a summary.
> summary(lm(y ~ a - 1))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
a1 2.0000 0.4082 4.899 0.000849 ***
a2 8.0000 0.4082 19.596 1.09e-08 ***
a3 12.0000 0.4082 29.394 2.98e-10 ***
The output in the first column gives the estimated mean per group. The sec-
ond gives the standard error of each mean, the third the t-value (the estimate
divided by the standard error), and the last the corresponding p-values. From
the p-values the conclusion follows to reject the null hypotheses H0 : j = 0
for Group index j running from 1 to 3.
Using the above design matrix, the model for the gene expression values
from different groups can be written as
Yij = j + ij , where ij is distributed as N (0, 2 ),
and Yij is the expression of Person i in Group j, j the mean of Group j, and
the ij the error of Person i in Group j. The error is assumed to be normally
distributed with zero mean and variance equal for different persons.
1
See also Chapter 11 of the manual An Introduction to R.
5.2. ONE-WAY ANALYSIS OF VARIANCE 77
The above illustrates that the linear model is useful for testing hypotheses
about group means. In bioinformatics the linear model is applied to many
sets of gene expressions, so that it is of great importance to have an overall
test for the equality of means.
where g is the number of groups. The sum of squares between (SSB) is the
sum of squares of the deviances of the group mean with respect to the total
mean, that is
g n g
X X X
SSB = (y j y)2 = n (y j y)2 .
j=1 i=1 j=1
78 CHAPTER 5. LINEAR MODELS
Example 1. Lets continue with the data from the previous example.
Recall that the data of Group 1 are 2, 3, 1, 2, those of Group 2 are 8, 7, 9,
8, and of Group 3 are 11, 12, 13, 12. The number of expression values per
group n = 4, the total number of data values N = 12, and the number of
groups g = 3.
To load the data, to construct the corresponding factor, and to compute
the group means one may use the following.
> y <- c(2,3,1,2, 8,7,9,8, 11,12,13,12)
> a <- gl(3,4)
> gm <- as.numeric(tapply(y, a, mean))
> gm
[1] 2 8 12
Thus we find that y 1 = 2, y 2 = 8, and y 3 = 12. These group means are
now collected in the vector gm. The grand mean y can be computed by
mean(y)=7.333333. An elementary manner to compute the sums of squares
between SSB is by
gm <- as.numeric(tapply(y, a, mean))
g <- 3; n <- 4; N <-12; ssb <- 0
for (j in 1:g) {ssb <- ssb + (gm[j]- mean(y))^2}
SSB <- n*ssb
This results in SSB = 202.6667. In a similar manner the sums of squares
within (SSW ) and the f -value can be computed, as follows.
> SSW <- 0
> for (j in 1:g) {SSW <- SSW + sum((y[a==j]-gm[j])^2)}
> f <- (SSB/(g-1))/(SSW/(N-g))
5.2. ONE-WAY ANALYSIS OF VARIANCE 79
This results in SSW = 6 and an observed f -value equal to 152. Hence, the
overall p-value is
Since this is smaller than the significance level 0.05, the conclusion is to reject
the null hypothesis of equal means.
The built-in-function anova can be used to extract the so-called analysis
of variance table from an lm object.
Response: x
Df Sum Sq Mean Sq F value Pr(>F)
fact 2 202.667 101.333 152 1.159e-07 ***
Residuals 9 6.000 0.667
The estimated intercept is the mean of Group 1 (Level 1). The factLevel
2 is the difference in means between Group 2 (Level 2) and Group 1 and
factLevel 3 is the difference in means between Group 3 and Group 1. By
t-tests the null-hypothesis is tested that the mean of Group 1 is zero, the
difference in means between Group 2 and Group 1 is zero and the difference
in means between Group 3 and Group 1 is zero. That is, the null-hypotheses
are H0 : 1 = 0, H0 : 2 1 = 0, and H0 : 3 2 = 0. Since the p-values
that correspond to the t-values are smaller than the significance level 0.05, all
null-hypotheses are rejected. The last line of the output gives the f -value,
the degrees of freedom, and the corresponding overall p-value. The latter
equals that of ANOVA.
Before we analyze real gene expression data it seems well to give an ex-
ample where the means do not differ.
Example 3. Lets sample data from the normal distribution with mean
1.9 and standard deviation 0.5 corresponding to three groups of patients that
do not possess any type of differences between groups.
Note that by the $Pr[1] operator extracts the p-value from the list generated
by the anova function. The p-value implies the conclusion not to reject the
null-hypotheseis of equal means, which is consistent with the data generation
process.
5.2. ONE-WAY ANALYSIS OF VARIANCE 81
7.2
5.5
7.0
5.0
6.8
4.5
6.6
6.4
4.0
6.2
3.5
6.0
B1 B2 B3 B1 B2 B3
Figure 5.1: Plot of SKI-like onco- Figure 5.2: Plot of Ets2 expression
gene expressions for three patient values for three patient groups.
groups.
From the overall p-value 1.207 107 of the f -test the conclusion follows to
reject the hypothesis of equal means. From the t-tests we conclude that the
mean of B1 differs from zero and the differences between B2 and B1 as well
as between B3 and B2 are unequal to zero. That is, the population means
of Group B1, B2, and B3 do differ.
From the overall p-value 0.4823, the conclusion is not to reject the null hy-
pothesis of equal means. More specifically, the null-hypotheses H0 : 1 = 0
is rejected, but from the p-value 0.636 the H0 : 2 1 = 0 is not rejected,
5.3. TWO-WAY ANALYSIS OF VARIANCE 83
where i is the mean of Group i indicated by the first factor, j the mean
of Group j indicated by the second factor, ()ij the interaction effect and
ijk the error which is distributed according to N (0, 2 ). If the means of
the i groups differ, then there is a main effect of the first factor which is
expressed in a p-value smaller than 0.05. Similarly, in case the means of the
j groups differ, there is a main effect of the second factor, expressed in a
p-value smaller than 0.05. Two-way analysis of variance will briefly be illus-
trated.
Response: exprs(ALLBm)["32069_at", ]
Df Sum Sq Mean Sq F value Pr(>F)
facB 1 1.1659 1.1659 4.5999 0.0352127 *
facmolb 1 3.2162 3.2162 12.6891 0.0006433 ***
facB:facmolb 1 1.1809 1.1809 4.6592 0.0340869 *
Residuals 75 19.0094 0.2535
First the patients are selected with B-cell ALL and assigned molecular biology
of the cancer to BCR/ABL or NEG. Next the factors are constructed to group
the patients. From the p-values in the analysis of variance table it can be
concluded that there two main effects as well as an interaction effect.
One may also ask for a summary of the individual effects.
Call:
lm(formula = exprs(ALLBm)["32069_at", ] ~ facB * facmolb)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.7649 0.1073 63.026 < 2e-16 ***
facBB34 -0.5231 0.1686 -3.103 0.0027 **
facmolbNEG -0.6020 0.1458 -4.128 9.4e-05 ***
facBB34:facmolbNEG 0.5016 0.2324 2.159 0.0341 *
In bioinformatics the question often arises how many probes there are with
have significant effects. In this case we may compute the number of probes
with significant main as well as interaction effects.
The three p-values per probe are collected in a matrix. This matrix is trans-
posed so that the columns correspond to the p-values and the rows to the
probes. Using the logical AND (&) operator and summing the TRUE values
yield 47 probes with significant main and interaction effects.
> data(ALL,package="ALL");library(ALL)
> ALLB123 <- ALL[,ALL$BT %in% c("B1","B2","B3")]
> y <- exprs(ALLB123)["1866_g_at",]
> shapiro.test(residuals(lm(y ~ ALLB123$BT)))
Breusch-Pagan test
From the p-value 0.01271, the conclusion follows to reject the null hypothesis
of equal variances (homoscedasticity).
> data(ALL,package="ALL");library(ALL)
5.5. ROBUST TESTS 87
From the p-value 2.717 105 , the conclusion follows to reject the hypothesis
of equal means.
> data(ALL,package="ALL");library(ALL)
> ALLB123 <- ALL[,ALL$BT %in% c("B1","B2","B3")]
> y <- exprs(ALLB123)["1866_g_at",]
> kruskal.test(y ~ ALLB123$BT)
data: y by ALLB123$BT
Kruskal-Wallis chi-squared = 30.6666, df = 2, p-value = 2.192e-07
By the apply functionality the p-values can easily be computed for all
12625 gene expression values of the ALL data.
88 CHAPTER 5. LINEAR MODELS
5.7 Exercises
1. Analysis of gene expressions of B-cell ALL patients.
(a) Construct a data frame containing the expression values for the
B-cell ALL patients in stage B, B1, B2, B3, B4 from the ALL data.
(b) How many patients are in each group.
(c) Test the normality of the residuals from the linear model used
for analysis of variance for all gene expression values. Collect the
p-values in a vector.
(d) Do the same for the homoscedasticity assumption.
(e) How many gene expressions are normally distributed and how
many homoscedastic? For how many do both hold?
B-cell ALL patients in stage B, B1, B2, B3, B4 from the ALL data.
3. Finding the ten best best genes among gene expressions of B-cell ALL
patients. Continue with the previous data frame containing the expres-
sion values for the B-cell ALL patients in stage B, B1, B2, B3, B4 from
the ALL data.
(a) Print the p-values and the corresponding (affimetrix) gene identi-
fiers of the ten best from ANOVA.
(b) Do the same for the p-values from the Kruskal-Wallis test.
(c) Use the function intersect to find identifiers in both sets.
(a) Construct a data matrix with 10000 rows and 9 columns with data
from the normal distribution with mean zero and variance equal to
one. Such a matrix simulates gene expressions without differences
between groups (sometimes called negatives).
(b) Construct a factor for three groups each with three values.
(c) How many p-values are smaller than the significance level =
0.05?
(d) If the p-value is smaller than the significance level, then the con-
clusion is that there an experimental effect (a positive). How many
false positives do you expect and how many did you observe?
90 CHAPTER 5. LINEAR MODELS
(e) Construct a matrix with 10000 rows and 9 columns with normally
distributed data with mean zero, one and two and variance equal
to one. Assume again that there three groups each with three data
values. This data matrix simulates gene expressions with differ-
ences between groups (sometimes called positives). Use ANOVA
and kruskal-Wallis to find the number of significant genes (true
positives). report the number of true positives and false nega-
tives.
Chapter 6
91
92 CHAPTER 6. MICRO ARRAY ANALYSIS
Example 1. We will start with a built-in data set called MLL.B from the
ALLMLL package. To load it and to retrieve basic information use
> library(affy)
> data(MLL.B, package = "ALLMLL")
> MLL.B
It is very useful to print the structure of the object str(MLL.B) and its slot
names.
> slotNames(MLL.B)
[1] "cdfName" "nrow" "ncol"
[4] "assayData" "phenoData" "featureData"
[7] "experimentData" "annotation" ".__classVersion__"
> dim(exprs(MLL.B))
[1] 506944 20
> annotation(MLL.B)
[1] "hgu133b"
> probeNames(MLL.B)[1:10]
[1] "200000_s_at" "200000_s_at" "200000_s_at" "200000_s_at" "200000_s_at"
[6] "200000_s_at" "200000_s_at" "200000_s_at" "200000_s_at" "200000_s_at"
Note that the probe names are the same as those obtained by geneNames.
The PM and MM values are collected by the functions pm and mm. To print
the PM values of the first four out of the sixteen rows of the probe with
identifier 200000_s_at we may use the following.
6.1. PROBE DATA 93
> pm(MLL.B,"200000_s_at")[1:4,1:3]
JD-ALD009-v5-U133B.CEL JD-ALD051-v5-U133B.CEL JD-ALD052-v5-U133B.CEL
200000_s_at1 661.5 321.5 312.5
200000_s_at2 838.8 409.3 395.3
200000_s_at3 865.3 275.5 341.3
200000_s_at4 425.8 253.5 196.8
By function matplot a quick view on the variability of the data within and
between probes can be obtained.
> matplot(pm(MLL.B,"200000_s_at"),type="l", xlab="Probe No.",
+ ylab="PM Probe intensity")
From the resulting plot in Figure 6.1 it can be observed that the variability
is substantial.
Density plots of the log of the probe values can be obtained by hist(MLL.B).
From the density plot of the log of the intensity data in Figure 6.2 it can be
seen that these are quite skew to the right. The script to program such plots
1.2
2000
1.0
1500
0.8
PM Probe intensity
density
0.6
1000
0.4
500
0.2
0.0
2 4 6 8 10 6 8 10 12 14
Figure 6.1: Mat plot of intensity Figure 6.2: Density of MLL.B data.
values for a probe of MLL.B.
is quite brief.
> MAplot(MLL.B,pairs=TRUE, plot.method= "smoothScatter")
> image(MLL.B)
94 CHAPTER 6. MICRO ARRAY ANALYSIS
Example 3. In the sequel we shall frequently work with the ALL data
from the ALL package of Bioconductor. Here the data set is briefly introduced
(see also Section 1.1) and further processing steps are illustrated. The raw
data have been jointly normalized by RMA and are available in the form of an
exprSet object. 12625 gene expression values are available from microarrays
of 128 different persons suffering from acute lymphoblastic leukemia (ALL).
A number of interesting phenotypical co-variates are available. For instance,
the ALL$mol variable has TRUE/FALSE values for each of the 128 patients
depending on whether a reciprocal translocation occurred between the long
arms of Chromosome 9 and 22. This is casually related to chronic and acute
leukemia. One can also ask for table(ALL$BT) to obtain an overview of
the numbers of patients which are in certain phases of a disease. See also
the general help ?ALL for further information on the data or the article by
Chiaretti et al. (2004).
By this script the patients are selected with assigned molecular biology
equal to ALL1/AF4. Then ALL1 is copied in order to overwrite the expression
values in a later stage. The median and the MAD are computed per column
by the specification 2 (column index) in the apply function. Then the first
sweep function subtracts the medians from the expression values and second
divides these by the corresponding MAD. By comparing the box plots in
Figure 6.3 and 6.4 the effect of preprocessing can be observed. The medians
of the preprocessed data are equal to zero and the variation is smaller due
to the division by their MAD. Note that by box plotting a data frame a fast
overview of the distributions of columns in a data frame is obtained.
14
4
12
3
10
2
8
1
6
0
4
1
2
X04006 X16004 X24005 X28028 X31007 X04006 X16004 X24005 X28028 X31007
library("genefilter")
f1 <- function(x)(IQR(x)>0.5)
f2 <- pOverA(.25, log2(100))
f3 <- function(x) (median(2^x) > 300)
f4 <- function(x) (shapiro.test(x)$p.value > 0.05)
f5 <- function(x) (sd(x)/abs(mean(x))<0.1)
f6 <- function(x) (sqrt(10)* abs(mean(x))/sd(x) > qt(0.975,9))
ff <- filterfun(f1,f2,f3,f4,f5,f6)
library("ALL"); data(ALL)
selected <- genefilter(exprs(All[,ALL$BT=="B"]), ff)
After running this script and using sum(selected) one obtains 317 genes
that pass the combined filter. The first function returns TRUE if the in-
terquartile range is larger than 0.5, the second if 25% of the gene expression
values is larger than 6.643856, the third if the median of the expression values
taken as powers to the base two is larger than 300, the fourth if it passes the
Shapiro-Wilk normality test, the fifth if the coefficient of variation is smaller
than 0.1, and the sixth if the one-sample t-value is significant. The filter
6.3. GENE FILTERING 99
library("genefilter");library("ALL"); data(ALL)
patientB <- factor(ALL$BT %in% c("B","B1","B2","B3","B4"))
f1 <- function(x) (shapiro.test(x)$p.value > 0.05)
f2 <- function(x) (t.test(x ~ patientB)$p.value < 0.05)
sel1 <- genefilter(exprs(ALL[,patientB==TRUE]), filterfun(f1))
sel2 <- genefilter(exprs(ALL[,patientB==FALSE]), filterfun(f1))
sel3 <- genefilter(exprs(ALL), filterfun(f2))
selected <- sel1 & sel2 & sel3
ALLs <- ALL[selected,]
This gives 1817 genes which pass the three filters. For these genes it
holds that the expression values for B-cell ALL patients as well as for T-cell
ALL patients are normally distributed (in the sense of non-rejection). A
fundamental manner to visualize how the genes are divided among filters is
100 CHAPTER 6. MICRO ARRAY ANALYSIS
library(limma)
x <- matrix(as.integer(c(sel1,sel2,sel3)),ncol = 3,byrow=FALSE)
colnames(x) <- c("sel1","sel2","sel3")
vc <- vennCounts(x, include="both")
vennDiagram(vc)
From the resulting Venn diagram in Figure 6.5 it can be seen that 1817 genes
pass all three filters, 1780 genes pass none, 3406 genes pass the normality
tests but not the t-test filter, etc.
4
sel1 sel2
3
1817
1
359 1366
0
920
1
sel3 1780
X04006 X16004 X24005 X28028 X31007
Observe that the contrast matrix specifies the difference between the levels
B and B1 as well as between B1 and B2. It can be implemented as follows.
Here, we have applied a method called false discovery rate (fdr) which in-
creases the p-values somewhat in order to reduce the number false positives.
The number of genes requested equals 5.
library("annaffy");library("hgu95av2.db")
anntable <- aafTableAnn(as.character(toptabcon$ID), "hgu95av2.db", aaf.handle
saveHTML(anntable, "ALLB123.html", title = "B-cell 012 ALL")
> pmAbst2HTML(absts[[1]],filename="pmon1389_at.html")
The list idl contains 8 members of which only the first is printed to the
screen. By changing GOID into Ontology more specific information pertaining
to ontology is extracted. From the annotate package we may now select the
GO numbers which are related to a biological process.
> library(annotate)
> getOntology(go1389,"BP")
[1] "GO:0006508" "GO:0007267"
108 CHAPTER 6. MICRO ARRAY ANALYSIS
There are various types of evidence such as: inferred from genetic interaction
(IGI), inferred from electronic annotation (IEA), traceable author statement
(TAS), etc. Per GO identifier the type of evidence can be obtained.
> getEvidence(go1389)
GO:0004245 GO:0005886 GO:0005887 GO:0006508 GO:0007267 GO:0008237 GO:0008270
"IEA" "TAS" "TAS" "TAS" "TAS" "TAS" "IEA"
GO:0046872
"IEA"
> GOMFPARENTS$"GO:0003700"
isa isa
"GO:0003677" "GO:0030528"
> GOMFCHILDREN$"GO:0003700"
isa
"GO:0003705"
6.9. GENE FILTERING BY A BIOLOGICAL TERM 109
Indeed, you may end up with many genes, useful for further analysis.
We combine this with the previous filter. For this we need the annota-
tion package used in the stage of data collection. This can be obtained by
annotation(ALL). First we define a function (Gentleman, et al., 2005, p.
123) to collect appropriate GO numbers from the environment GOTERM.
library("GO"); library("annotate"); library("hgu95av2.db")
GOTerm2Tag <- function(term) {
GTL <- eapply(GOTERM, function(x) {grep(term, x@Term, value=TRUE)})
Gl <- sapply(GTL, length)
names(GTL[Gl>0])
}
> GOTerm2Tag("transcriptional repressor")
[1] "GO:0016564" "GO:0016565" "GO:0016566" "GO:0017053"
The functions eapply and sapply search an environment like GOTERM by
grep for matches of a specified term. A precaution is taken to select only
those names which are not empty. This gives the GO terms which can now
be translated to probe of the ALLs data.
tran1 <- hgu95av2GO2ALLPROBES$"GO:0016564"
tran2 <- hgu95av2GO2ALLPROBES$"GO:0016566"
tran3 <- hgu95av2GO2ALLPROBES$"GO:0017053"
tran <- c(tran1,tran2,tran3)
inboth <- tran %in% row.names(exprs(ALLs))
ALLtran <- ALLs[tran[inboth],]
The GO translated probe names are intersected with the row names of the
data giving the logical variable inboth. The variable tran[inboth] gives
the ids by which genes can be selected. Next, gene ids for which inboth
equals TRUE are selected and the corresponding data are collected in the data
frame ALLtran. More information can be obtained by GOTERM$"GO:0016564.
By dim(exprs(ALLtran)) it can be observed that 26 genes which passed the
normality filter are related to transcriptional repression.
a t-test one may wonder whether genes with significant p-values occur more
often within a certain chromosome. To test for such over or under represen-
tation the Fisher test is very useful (see Section 4.1.7).
data: f
p-value = 4.332e-11
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.757949 2.748559
sample estimates:
odds ratio
2.206211
112 CHAPTER 6. MICRO ARRAY ANALYSIS
> chisq.test(f)
data: f
X-squared = 52.3803, df = 1, p-value = 4.573e-13
6.12 Exercises
1. Gene filtering on normality per group of B-cell ALL patients.
(a) Construct a data frame containing the expression values for the
B-cell ALL patients in stage B, B1, B2, B3, B4 from the ALL data.
(b) Construct the design matrix and an appropriate contrast matrix.
(c) Compute the twenty best genes by topTable.
(d) Collect information on the twenty best gene s in an HTML page.
6.12. EXERCISES 113
3. Finding a row number. Use grep to find the row number of gene
1389_at. Hint: Use row.names or featureNames.
5. Remission achieved. For the ALL data from its ALL library the patients
are checked for achieving remission. The variable ALL$CR has values CR
(became healthy) and REF (did not respond to therapy; remain ill).
(f) Do the Fisher test on the number of oncogenes out of the total
versus the number of significant oncogenes out of the selected.
6. Gene filtering of ALL data. The data are in the library called ALL.
The persons with T-cell leukemia which are in stage T2 and T3 can
be selected by the variable ALL$BT. You may use the function table
to find the frequencies of the patient types and leukemia stages. To
answer the questions below functions from the library genefilter are
helpful.
(a) Program a gene filter step separately for T2 and T3 patients such
that only those genes pass which are normally distributed.
(b) Program a second filter step which passes only those genes with a
significant p-value from the two sample T -test.
(c) How many genes pass all filter steps?
(d) How many genes pass normality?
7. Stages of B-cell ALL in the ALL data. Use the limma package to answer
the questions below.
(a) Select the persons with T-cell leukemia which are in stage B1, B2,
B3, and B4.
(b) What type of contrast matrix would you like to suggest in this
situation? Give its code.
(c) Perform analysis of variance to test the hypothesis of equal pop-
ulation means. Use the Benjamini & Hochberg (1995) (BH)
adjustment method for the false discovery rate and topTable to
report the five best genes.
(d) For how many genes is the null-hypothesis to be rejected?
(a) Download GDS486 and transform it into eset form. Here we meet
a missing data problem. A manner to solve it is as follows. Use
the function function(x) sum(is.na(x)) in apply on the rows
to count the number of missing values per row. Select the rows
without missing value to perform a two-sample t-test with the
6.12. EXERCISES 115
(a) Select the patients on the covariate mol.biol with values ALL1/AF4,
BCR/ABL, and NEG.
(b) Collect the ANOVA p-values with contrast between NEG and ALL1/AF4,
and between NEG and BCR/ABL. Report the number of significant
affy IDs and the total. Hint: Re-order the columns into NEG,
ALL1/AF4, and BCR/ABL.
(c) Find the GO IDs refereing to the term protein-tyrosine kinase
since it mediates many steps due to BCR/ABL translocation.
(d) Select the affy IDs corresponding to the GO IDs and report its
number and the number of significant genes.
(e) Perform Fisher exact to test the odds ratio equal to one hypoth-
esis.
116 CHAPTER 6. MICRO ARRAY ANALYSIS
Chapter 7
Given the expression values of several genes, a problem which often arises
is to find genes which are similar or close. Genes with expressions in small
distance may have similar functions and may be potentially interesting for
further research. In order to discover genes which form a group there are sev-
eral methods developed called cluster analysis. These methods are based on
a distance function and an algorithm to join data points to clusters. The so-
called single linkage cluster analysis is intuitively appealing and often applied
in bioinformatics. By this method several clusters of genes can be discov-
ered without specifying the number of clusters on beforehand. The latter is
necessary for another method called k-means cluster analysis. Each analysis
produces a tree which represents similar genes as close leaves and dissimilar
ones on different edges.
An other measure to investigate similarity or dependency of pairs of gene
expressions is the correlation coefficient. Various examples of applications
will be given. It prepares the way to searching a data set for directions of
large variance. That is, since gene expression data sets tend to be large,
it is of importance to have a method available which discovers important
directions in the data. A frequently used method to find such directions is
that by principal components analysis. Its basic properties will be explained
as well as how it can be applied in combination with cluster analysis.
In applications where it is difficult to formulate distributional assumptions
of the statistic it may still be of importance to construct a confidence interval.
It will be illustrated by several examples how the bootstrap can be applied to
construct 95% confidence intervals. Many examples are given to clarify the
application of cluster analysis and principal components analysis.
117
118 CHAPTER 7. CLUSTER ANALYSIS AND TREES
7.1 Distance
The concept of distance plays a crucial role in all types of cluster analysis.
For real numbers a and b a distance function d is defined as the absolute
value of their difference
p
d(a, b) = |a b| = (a b)2 .
In other words, d(a, b) = d(b, a), the distance between a and b equals that
between b and a. Furthermore, it holds for all points c between a and b that
d(a, b) = d(a, c) + d(c, b). For all points c not between a and b, it follows that
d(a, b) < d(a, c) + d(c, b). The latter two notions can be summarized by the
so-called triangle inequality. That is, for all real c it holds that
Directly going from a to b is shorter than via c. Finally, the distance between
two points a and b should increase as these move further apart.
For the situation where gene expression values for several patients are
available, it is of importance to define a distance for vectors of gene expres-
sions such as a = (a1 , , an ) and b = (b1 , , bn ). We shall concentrate
mainly on the Euclidian distance, which is defined as the root of the sum of
the squared differences
v
u n
uX
d(a, b) = t (ai bi )2 .
i=1
7.1. DISTANCE 119
Since the differences are squared it is immediate that d(a, b) = d(b, a), the
distance from a to bequals that from b to a. For c = (c1 , c2 ) = (2, 2) we
have that d(a, c) = 2, d(b, c) = 22 + 32 = 13. Hence,
d(a, b) = 5 < 2 + 13 = d(a, c) + d(b, c),
so that the triangle inequality is strict. This is in line with our intuitive idea
that the road directly from a to b is shorter than from a to b via c.
Example 5. Finding the ten closest genes to a given one. After selecting
certain genes it often happens that one wants to find genes which are close
to the selected ones. This can be done with the genefinder functionality
by specifying either an index or a name (consistent with the geneNames of
the exprSet). To find genes from the ALL data (Chiaretti et al., 2004) close
to the MME expression values of the probe with identifier 1389_at, we may
use the following.
library("genefilter"); library("ALL"); data(ALL)
closeto1389_at <- genefinder(ALL, "1389_at", 10, method = "euc")
closeto1389_at[[1]]$indices
round(closeto1389_at[[1]]$dists,1)
featureNames(ALL)[closeto1389_at[[1]]$indices]
7.2. TWO TYPES OF CLUSTER ANALYSIS 121
The function genefilter produces a list from which the selected row num-
bers can be extracted as well as the probe names can be found.1 If desired,
these can be used for further analysis. From the output it can be observed
that the gene expressions of row 2653 with probe identifier 32629_f_at has
the smallest distance (12.6) to those of 1389_at.
Hence, the distance between the two clusters is the same as that of the nearest
neighbors. The algorithm of single linkage cluster analysis starts with creat-
ing as many clusters as data points. Next, the nearest two are determined
and these are merged into one cluster. Then the next two nearest clusters
are determined and merged into one cluster. This process continuous until
all points belong to one cluster.
Cluster Dendrogram
3.5
x5
5
3.0
x5
2.5
4
2.0
Height
1.5
a2
1.0
x4
0.5
x3
2
0.0
x3
x4
x1
x2
x2
x1
1
1 2 3 4 5
Figure 7.1: Plot of five points to Figure 7.2: Tree of single linkage
be clustered. cluster analysis.
At the start each data point is seen as a separate cluster. Then the nearest
two points (genes) from the Euclidian distance matrix are g 1 and g 2 , having
d(g 1 , g 2 ) = 0.10. These two data points are merged into one cluster, say
I = {g 1 , g 2 }. In Figure 7.2 this is illustrated by the horizontal line at height
0.10 in the tree. The other three data points g 3 , g 4 , g 5 are seen as three
different clusters. Next, the minimal distance between clusters can be read
from the Euclidian distance matrix. Since the smallest is d(g 3 , g 4 ) = 0.30,
the new cluster J = {g 3 , g 4 }, corresponding to the horizontal line at height
7.2. TWO TYPES OF CLUSTER ANALYSIS 123
0.30. Now there are three clusters, I, J, and K = {x5 }. From the Euclidian
distance matrix, it can be observed that the distance between cluster I and
J is 2.19, see the corresponding horizontal line at this height. Hence, the
cluster I and J are merged into one. Finally, the distance between cluster
{g 1 , g 2 , g 3 , g 4 }, and the data point g 5 equals d(g 4 , g 5 ) = 3.36, see the corre-
sponding horizontal line at this height.
5
1.0
4
0.8
3
0.6
6
12
9
Height
Height
0.4
2
0.2
1
20
0.0
22
7
13
4
23
3
5
12
26
27
18
17
2
8
14
30
24
21
29
11
14
11
16
16
19
1
15
15
4
1
9
3
8
13
18
19
20
5
6
7
10
10
17
25
28
Figure 7.3: Example of three with- Figure 7.4: Three clusters with dif-
out clusters. ferent standard deviations.
sl.out<-hclust(dist(rnorm(20,0,1),method="euclidian"),method="single")
plot(sl.out)
From the resulting tree in Figure 7.3 one might get the impression that
there are five separate clusters in the data. Note, however, that there is no
underlying data generation process which produces separate clusters from
different populations.
If, however, the data are generated by different normal distributions,
then there are different processes producing separate clusters. To illustrate
124 CHAPTER 7. CLUSTER ANALYSIS AND TREES
this, ten data points were sampled from the N (0, 0.1) population, ten from
N (3, 0.5), and ten from N (10, 1).
x <- c(rnorm(10,0,0.1),rnorm(10,3,0.5),rnorm(10,10,1.0))
plot(hclust(dist(x,method="euclidian"),method="single"))
plot(sl.out)
From the tree in Figure 7.4, it can be observed that there clearly exist three
clusters.
These examples illustrate that results from cluster analysis may very well
reveal population properties, but that some caution is indeed in order.
Cluster Dendrogram
1.2
ALL
2
AML
21
1.0
35
0.8
1
0.6
Height
Zyxin
29
0.4
0
34
30
0.2
25
26
4
38
22
2
14
17
18
24
37
13
15
1
33
28
31
0.0
32
36
16
7
9
11
23
20
8
27
1
10
19
3
6
5
12
Figure 7.5: Plot of gene CCND3 Figure 7.6: Single linkage cluster
Cyclin D3 and Zyxin expres- diagram from gene CCND3 Cy-
sions for ALL and AML patients. clin D3 and Zyxin expressions
values.
Example 3. Application to the Golub (1999) data. Recall that the first
twenty seven patients belong to ALL and the remaining eleven to AML and
that we found earlier that the expression values of the genes CCND3 Cyclin
D3 and Zyxin differ between the patient groups ALL and AML. Figure
7.5 illustrates that the patient groups differ with respect to gene expression
values. How to produce this plot and a single linkage cluster analysis is shown
by the script below.
7.2. TWO TYPES OF CLUSTER ANALYSIS 125
data(golub, package="multtest")
clusdata <- data.frame(golub[1042,],golub[2124,])
colnames(clusdata)<-c("CCND3 Cyclin D3","Zyxin")
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
plot(clusdata, pch=as.numeric(gol.fac))
legend("topright",legend=c("ALL","AML"),pch=1:2)
plot(hclust(dist(clusdata,method="euclidian"),method="single"))
Figure 7.6 gives the tree from single linkage cluster analysis. Apart from
three expressions the tree shows two clusters corresponding to the two pa-
tient groups.
7.2.2 k-means
K-means cluster analysis is a popular method in bioinfomatics. It is defined
by minimizing the within cluster sum of squares over K clusters. That is,
given the data points x1 , , xn the method seeks to minimize the function
nk
K X
X
d2 (xi , ak )
k=1 iIk
Cluster Dendrogram
3.5
3.0
3
x5
2.5
2.0
2
data[,2]
Height
1.5
1
1.0
0.5
0
0.0
x3
x4
1
x1
x2
1 0 1 2 3
Figure 7.7: K-means cluster anal- Figure 7.8: Tree of single linkage
ysis. cluster analysis.
Cluster means:
[,1] [,2]
1 1.87304978 2.01940342
2 0.01720177 0.07320413
Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
7.2. TWO TYPES OF CLUSTER ANALYSIS 127
[38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Available components:
[1] "cluster" "centers" "withinss" "size"
The output of k-means cluster analysis is assigned to a list called cl. Observe
that the cluster means are fairly close to the population means (0, 0) and
(2, 2). The Clustering vector indicates to which cluster each data point
(gene) belongs and that these correspond exactly to the two populations
from which the data are sampled. The variable cl$cluster contains cluster
membership and can be used to specify the color of each data point a plot,
as follows.
> plot(data, col = cl$cluster)
> points(cl$centers, col = 1:2, pch = 8, cex=2)
The data points are plotted in red and black circles and the cluster means by
a star, see Figure 7.7. The sum of the within cluster sum of squares equals
the minimal function value obtained by the algorithm.
Before performing a k-means cluster analysis a plot from a single linkage
cluster analysis may reveal the number of clusters. If the number of clusters is
not at all clear, then it becomes questionable whether k-means is appropriate.
For cases where the number of clusters is only moderately clear, the algorithm
is more sensible to get stuck into a solution which is only locally optimal.
Such solutions are of limited scientific value. To cope with the danger of
suboptimal solutions one may simply run the algorithm repeatedly by using
the nstart option. Another possibility is to use rational initial starting
values for the cluster means. In particular, the sample means of potential
clusters or the hypothesized population means can be used.
> initial <- matrix(c(0,0,2,2), nrow = 2, ncol=2, byrow=TRUE)
> cl<- kmeans(data, initial, nstart = 10)
The so-called bootstrap (Efron, 1979) can be used to estimate 95% confidence
intervals around cluster means. The idea is to re-sample with replacement
from the given sample one thousand times with replacement and to compute
quantiles for the corresponding confidence intervals.
128 CHAPTER 7. CLUSTER ANALYSIS AND TREES
Cluster means:
CCND3 Cyclin D3 Zyxin
1 0.6355909 1.5866682
2 1.8938826 -0.2947926
7.2. TWO TYPES OF CLUSTER ANALYSIS 129
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
27 28 29 30 31 32 33 34 35 36 37 38
2 1 1 1 1 1 1 1 1 1 1 1
The two clusters discriminate exactly the ALL patients from the AML pa-
tients. This can also be seen from Figure 7.9, where expression values of
CCND3 Cyclin D3 are depicted on the horizontal axis and those of Zyxin
on the vertical, and the ALL patients are in red and the AML patients in
black. By the bootstrap the cluster means and their confidence intervals can
be estimated.
> mean(data.frame(boot.cl))
X1 X2 X3 X4
0.6381860 1.5707477 1.8945878 -0.2989426
> quantile(boot.cl[,1],c(0.025,0.975))
2.5% 97.5%
0.2548907 0.9835898
> quantile(boot.cl[,2],c(0.025,0.975))
2.5% 97.5%
1.259608 1.800581
> quantile(boot.cl[,3],c(0.025,0.975))
2.5% 97.5%
1.692813 2.092361
> quantile(boot.cl[,4],c(0.025,0.975))
2.5% 97.5%
-0.60802142 -0.02420802
The difference between the bootstrap means and the k-means from the orig-
inal data gives an estimate of the estimation bias. It can be observed that
the bias is small. The estimation is quite precise because the 95% bootstrap
confidence intervals are fairly small.
130 CHAPTER 7. CLUSTER ANALYSIS AND TREES
2
1
Zyxin
0
1
CCND3 Cyclin D3
Figure 7.9: Plot of kmeans (stars) cluster analysis on CCND3 Cyclin D3 and
Zyxin discriminating between ALL (red) and AML (black) patients.
The value of the correlation coefficient is always between minus one and plus
one. If the value is close to either of these values, then the variables are
7.3. THE CORRELATION COEFFICIENT 131
linearly related in the sense that the first is a linear transformation of the
second. That is, there are constants a and b such that axi + b = yi for all
i. By the function cor.test, the null hypothesis H0 : = 0 can be tested
against the alternative H0 : 6= 0.
The value is positive which means that larger values of x occur together with
larger values of y and vice versa. This can also be observed by plot(x,y). By
132 CHAPTER 7. CLUSTER ANALYSIS AND TREES
> cor.test(x,y)
data: x and y
t = 4.9662, df = 36, p-value = 1.666e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3993383 0.7952115
sample estimates:
cor
0.6376217
The test is based on the normality assumption and prints therefore a t-value.
Since the corresponding p-value is very small, we reject the null hypothesis
of zero correlation. The left bound of the confidence interval falls far to the
right hand side of zero.
Observe that the 95% confidence interval is larger than that found by cor.test.
This indicates that the assumption of normality may not be completely valid
here. Since the confidence interval does not contain zero, we reject the null-
hypothesis of zero correlation.
Example 5. Application to the Golub (1999) data. The ALL and AML
patients of the Golub et al. (1999) data are indicated by zero and ones of
the binary vector golub.cl. A manner to select genes it by the correlation
of the expression values with this binary vector. Such can be computed by
using the apply functionality.
Var 1 Var 2
gene 1 1.63 1.22
gene 2 0.40 0.79
gene 3 0.93 0.97
gene 4 1.38 1.08
gene 5 0.17 0.96
gene 6 0.61 0.93
can be placed in a matrix R, which has ones on the diagonal and the value
0.8 elsewhere.
To illustrate a direction lets try the linear combination k = (2, 1)2 of the
sample correlation matrix R. This gives
1 0.8 2 2.8
Rk = = .
0.8 1 1 2.6
Both vectors k and Rk can be plotted in the xy-plane. The vector (2, 1)
is plotted by drawing an arrow from (0,0) to the point with x = 2 and
y = 1. This is done completely similar for (2.8, 2.6) in Figure 7.10. It can
be observed that the two vectors (arrows) do not fall on the same line and
therefore have different directions. The crux of principal components analysis
is that a linear combination with the same direction as the weights represent
the direction of maximum variation. Such is the case if Rk differs from k
only by a constant of multiplication, that is there exists a constant d such
that Rk = dk. We shall determine such a constant by finding the weights
vector first. To do so observe from our correlations matrix that the sum of
both rows equals 1.8. Taking k = (1, 1) yields
1 0.8 1 1.8 1
Rk = = = 1.8 = 1.8k.
0.8 1 1 1.8 1
yields
1 0.8 1 0.2 1
Rk = = = 0.2 = 0.2k.
0.8 1 1 0.2 1
2
2.5
1
2.0
V[,2]
Z[,2]
1.5
0
1.0
1
0.5
0.0
V[,1] Z[,1]
Figure 7.10: Vectors of linear com- Figure 7.11: First principal com-
binations. ponent with projections of data.
The output is stored as an object called K which can be printed to the screen
in two digits.
> print(K,digits=2)
$values
[1] 1.8 0.2
$vectors
[,1] [,2]
[1,] 0.71 0.71
[2,] 0.71 -0.71
The eigenvalues are assigned to K$values and the eigenvectors are the columns
of K$vectors. To compute the principal components we use the matrix mul-
tiplication operator %*%. Then the first principal component is defined as the
linear combination of the data with the first eigenvector, Z%*%K$vec[,1]. To
print the scores on the first and the second principal component one can use
the following.
To illustrate the first principal component the six data points from the Z
matrix are plotted as small circles in Figure 7.11. Gene 1, for instance, has
x coordinate 1.63 and y coordinate 1.22 and appears therefore in the right
upper corner.
A convenient manner to perform principal components analysis is by using
the built-in-function princomp, as follows.
The scores are the component scores and the loadings from princomp are
the eigenvectors.
Example 2. Application to the Golub (1999) data. The first five eigenvalues
from the correlation matrix of golub can be printed by the following.
> eigen(cor(golub))$values[1:5]
[1] 25.4382629 2.0757158 1.2484411 1.0713373 0.7365232
Because the eigenvalues are arranged in decreasing order the sixth to the 38th
are smaller than one. Reason for which these will be neglected. The first
eigenvalue is by far the largest, indicating that the persons are dependent to
a large extent. Applying the previous bootstrap methods to estimate 95%
confidence intervals for the eigenvalues we obtain the following intervals.
Cluster Dendrogram
10
3.0
2.5
2459
5
2397
2.0
1882
893 892 2289
2673
1616
2459
182
1.5
Height
2289
1756
893
1.0
2611 1754
1910
450 2874 2350 1754 1798
2233
1798
2233
2430
2321808 1911
2749
1737
2611
450
792
2761
885
849
68
0.5
316
1616
182
2653
892
1882
2350
5
1101
2673
504
313
1756
68
808
1737
0.0
1910
2874
2761
792
2321
1911
2749
10
10 5 0 5 10 15
Figure 7.12: Scatter plot of se- Figure 7.13: Single linkage cluster
lected genes with row labels on the diagram of selected gene expres-
first two principal components. sion values.
It can be checked that all correlations between the patients are positive.
7.4. PRINCIPAL COMPONENTS ANALYSIS 139
This implies that large expression values on gene i co-vary positively with
large deviations of gene j. The positivity of the correlations also implies that
the weights of the first eigenvector have the same sign, so that these can be
taken to be positive for all patients (Horn & Johnson, 1985). Unfortunately,
this is not automatic in R so that caution is in order with respect to inter-
pretation of the components. By using -eigen(cor(golub))$vec[,1:2] to
print the weights to the screen it can be observed that those that correspond
to the first component are positive. All weights of the first eigenvector are
positive and have very similar size (all are between 0.13 and 0.17). Thus the
first component is almost equal to the sum of the variables (the correlation
equals 0.9999). The weights of the second component have a very interesting
pattern. Namely, almost all of the first 27 weights are positive and the last 11
weights are negative. Thus the second component contrasts the ALL patients
with the AML patients. By contrasting ALL patients with AML patients a
second to the largest amount of variance is explained in the data. Hence,
the AML-ALL distinction is discovered by the second component, which is
in line with findings of Golub et al. (1999).
Obviously the genes with the largest expression values from the first com-
ponent can be printed. We shall, however, concentrate on the second compo-
nent because it appears to be more directly related to the research intentions
of Golub et. al. (1999). The first and the last ten gene names with respect
to the values on the second component can be printed by the following.
Example 3. Biplot. A useful manner to plot both genes (cases) and patients
(variables) is the biplot, which is based on a two-dimensional approximation
of the data very similar to principal components analysis. Here, we illustrate
how it can be combined with principal components analysis.
> biplot(princomp(data,cor=TRUE),pc.biplot=TRUE,cex=0.5,expand=0.8)
The resulting plot is given by Figure 7.14. The left and bottom axis refer
to the component scores and the top and right to the patient scores, which
140 CHAPTER 7. CLUSTER ANALYSIS AND TREES
are scaled to unit length by the specification cor. It can be seen that the
patients are clearly divided in two groups corresponding to ALL and AML.
Example 4. Critical for S-phase. Golub et al. (1999) mention that among
genes which are useful for tumor class prediction there are genes that encode
for proteins critical for S-phase cell cycle progression such as Cyclin D3,
Op18, and MCM3. We select genes which carry CD, Op, or MCM in
their names and collect the corresponding row numbers.
This yields 110 genes. In order to select those that do have an experimental
effect, we use a two-sample t-test.
This yields 34 genes, of which the row numbers are selected in the vector
oo. In order to identify genes in directions of large variation we use the scores
on the first two principal components.
The scores on the first two principal components of the selected genes are
stored in the data frame leu. From the plotted component scores in Figure
7.12, it seems that there are several sub-clusters of genes. The genes that
belong to these clusters can be identified by hiearchical cluster analysis.
cl <- hclust(dist(leu,method="euclidian"),method="single")
plot(cl)
7.5. OVERVIEW AND CONCLUDING REMARKS 141
From the tree (dendrogram) in Figure 7.13 various clusters of genes are
apparent that also appear in Figure 7.12. 3 The ordered genes can be
obtained from the object cl as follows.
The cluster with rows 504, 313, 1756, and 893 consists of antigenes. The
genes MCM3 Minichromosome maintenance deficient (S. cerevisiae) 3 with
row numbers 2289 and 2430 appear adjacent to each other. This illustrates
that genes with similar functions may indeed be close with respect to their
gene expression values.
3
Unfortunately, some row numbers of genes are less readable because the points are
very close.
142 CHAPTER 7. CLUSTER ANALYSIS AND TREES
7.6 Exercises
1. Cluster analysis on the Zyxin expression values of the Golub et al.
(1999) data.
(a) Produce a chatter plot of the gene expression values using showing
different symbols for the two groups.
(b) Use single linkage cluster analysis to see whether the three indi-
cates two different groups.
(c) Use k-means cluster analysis. Are the two clusters according to
the diagnosis of the patient groups?
(d) Perform a bootstrap on the cluster means. You will have to modify
the code here and there. Do the confidence intervals for the cluster
means overlap?
(a) Use genefilter to find the ten closed genes to the expression
values of CCND3 Cyclin D3. Give their probe as well as their
biological names.
(b) Produce of combined boxplot separately for the ALL and the AML
expression values. Compare it with that on the basis of CCND3
Cyclin D3 and comment of the similarities.
(c) Compare the smallest distances with those among the Cyclin genes
computed above. What is your conclusion?
(a) Plot the data and invent a manner to find the row number of the
outlier.
(b) Remove the outlier, test the correlation coefficient. Compare the
results to those above.
(c) Perform the bootstrap to construct a confidence interval.
(a) Select the oncogenes from the Golub data and plot the tree from
a single linkage cluster analysis.
(b) Do you observe meaningful clusters.
(c) Select the antigenes and answer the same questions.
(d) select the receptor genes and answer the same questions.
(a) Construct an expression set with the patients with B-cell in stage
B1, B2, and B3. Compute the corresponding ANOVA p-values
of all gene expressions. Construct the expression set with the p-
values smaller than 0.001. Report the dimensionality of the data
matrix with gene expressions.
(b) Are the correlations between the patients positive?
(c) Compute the eigenvalues of the correlation matrix. Report the
largest five. Are the first three larger than one?
(d) Program a bootstrap of the largest five eigenvalues. Report the
bootstrap 95% confidence intervals and draw relevant conclusions.
(e) Plot the genes in a plot of the first two principal components.
(a) Verify that the eigenvalues of the matrices are 1.8, 0.2, 2.6,
0.2, 0.2, and 1.500000e+00, 1.500000e+00, -7.644529e-17.
(b) How much variance represents the first component corresponding
to the second matrix?
(c) Verify that the first eigen vector of the second correlation matrix
has identical signs.
144 CHAPTER 7. CLUSTER ANALYSIS AND TREES
2065
2
20 2459
1162 377
15 13
24 1030
0.5
16 738
2489
4
26
19 21
127 1334
2266 1206
1882
2939717
1
1042
5 2386 839 1995 1732
2020 345 2829
746
866 8 1037
523
7 394 2673 1109
22 3046 28511598
1585
422
2645 96 1909 801
2702 1081703
1455
462
31834 32912712418
1939 2289 2801 1060
2794
563
9 1368 2307
621 323 20022616 1959
1978 1817
1640
2343
648
1653 2950 24101916
23471006
2311 1920
561
259 1524
2179
713
1086
1542922
2627
725
522
18 1869 1638 182
704357
1045 126
202 2889
1642
23
2466
1145 1025 2122
838
246
546 571 1245
1348 1445
376
2593 984
0.0
6 2786 1856
2356 2955
313
494 244
2265
0
1887
Comp.2
14 17 1110
1754
932
1448 896 2589
2172 888
937
968 566 2749 2198 1774 108
1911 1665
0.5
803766
26002813 808
68
3235
28
38 1829
2920
2922
2921 2553 2124 1901
2734 1778
31
2
34
29 2656
30
33 1069 1413
37
36 378
1.0
2663
3
2664 829
3 2 1 0 1 2
Comp.1
Classification Methods
In medical settings groups of patients are often diagnosed into classes corre-
sponding to types of diseases. In bioinformatics the question arises whether
the diagnosis of a patient can be predicted by gene expression values? Re-
lated is the question which genes play an important role in the prediction of
class membership. A similar question is the prediction of micro RNAs from
values of folding energy. More generally, for objects like proteins, mRNAs,
or microRNAs it may be of importance to classify these on the basis of
certain measurements.
In this chapter you learn what recursive partitioning is and how to use it.
To evaluate the quality of prediction the fundamental concepts of sensitivity
and specificity are frequently used. The specificity can be summarized in a
single number by the area under the curve of a receiver operator curve. This
will be explained and illustrated. Two other methods to predict disease class
from gene expression data are the support vector machine and the neural
network. It will briefly be explained what these methods are about and how
these can be applied. A validation set will be used to evaluate the predictive
accuracy.
145
146 CHAPTER 8. CLASSIFICATION METHODS
1
I am obliged to Sven Warris for computing the minimum energy values.
8.2. ROC TYPES OF CURVES 147
From the frequency Table 8.1, the sensitivity, the specificity, and the
predictive power can be computed in order to evaluate the quality of the
test. The sensitivity is the probability that the test is positive given that the
sequence is a microRNA (true positive). Thus
2973
sensitivity = P (true positive) = P (test positive|microRNA) = = 0.8682.
3424
The specificity is the probability that the test is negative given that the
sequence is not a microRNA (true negative). Thus
3391
specificity = P (true negative) = P (test negative|no microRNA) = = 0.9903.
3424
For practical applications of a test the predictive power is of crucial impor-
tance. In particular, the predictive value positive is the probability that the
sequence is a microRNA given that the test is positive. That is,
2973
Predictive value positive = P V + = P (microRNA|test positive) = = 0.9890
3006
Thus when the test is positive we are 98.90% certain that the sequence is
indeed a microRNA. The predictive value negative is the probability that the
sequence is not a microRNA given that the test is negative.
3391
Predictive value negative = P V = P (no microRNA|test negative) = = 0.8826.
3842
Thus when the test is negative we are 88.26% certain that the sequence
is not a microRNA. From the estimated conditional probabilities it can be
concluded that the test performs quite well in discriminating between mi-
croRNAs from non-microRNAs.
so the corresponding true and false positives can be computed for each cutoff
value. To briefly indicate the origin of the terminology, imagine that the test
results are a characteristic received by an operator. The receiver operator
characteristic (ROC) is a curve where the false positive rates are depicted
horizontally and the true positive rates vertically. The larger the area under
the ROC curve, the better the test is because then low false positive rates
go together with large true positive rates.2 These ideas are illustrated by
several examples.
There are 25 ALL patients with expression values greater than or equal to
1.27, so that the true positive rate is 25/27=0.93. For this cutoff value there
is one false positive because one patient without ALL has a score larger than
1.27. Hence, the false positive rate is 1/11 = 0.09.
Example 2. The expression values for gene CCND3 Cyclin D3 from the
Golub et al. (1999) data are sorted in decreasing order, see Table 8.2. The
procedure to draw the ROC curve starts with cutoff point infinity. Obviously,
there are no expression values equal to infinity, so there is no patient tested
positive. Next, the cut off point 2.77 is taken and values greater than or
equal to 2.77 are tested as positive. This yields one true positive implying a
2
More detailed information can be obtained from a wikipedia search using ROC
curve.
8.2. ROC TYPES OF CURVES 149
true positive rate of 1/27, see second row of Table 8.2. For this cutoff value
there are no negatives so that the false positive rate is zero.
Now consider cutoff point 1.52. There are 22 ALL patients with expres-
sion values greater than or equal to 1.52, so that the true positive rate is
22/27=0.81. For this cutoff value there are no false positives because all pa-
tients without ALL have expression values lower than 1.51. Hence, the false
positive rate is 0 and the true positive rate is 0.81. To indicate this there is
a vertical line drawn in the ROC curve from point (0, 0) to point (0, 0.81)
in Figure 8.1. Now consider the next cutoff point 1.45. There are 22 ALL
patients with expression values greater than or equal to 1.45, so that the
true positive rate is again 22/27=0.81. However, there is one patient with-
out ALL having expression value 1.45, whom receives therefore a positive
test. Hence, the number of false positives increases from zero to one, which
implies a false positive rate of 1/11=0.09. In the ROC curve this is indicated
by point (0.09, 0.81) and the horizontal line from (0, 0.81) to (0.09, 0.81), see
Figure 8.1.
This process goes on (see Table 8.2) until the smallest data point -0.74 is
taken as cutoff point. For this point all patients are tested positive, so that
the false positive rate is 11/11 and the true positive rate is 27/27. This is
indicated by the end point (1, 1) in the plot at the top on the right hand side.
1.0
1.0
0.8
0.8
True positive rate
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 8.1: ROC plot for expres- Figure 8.2: ROC plot for expres-
sion values of CCND3 Cyclin D3. sion values of gene Gdf5.
150 CHAPTER 8. CLASSIFICATION METHODS
library(ROCR)
gol.true <- factor(golub.cl,levels=0:1,labels= c("TRUE","FALSE"))
pred <- prediction(golub[1042,], gol.true)
perf <- performance(pred, "tpr", "fpr" )
plot(perf)
It seems clear that the expression values are better in testing for ALL when
the curve is very steep in the beginning and attains its maximum value soon.
In such a case the true positive rate is large for a small false positive rate. A
manner to express the predictive accuracy of a test in a single number is by
the area under the curve. Using the function performance(pred,"auc") we
obtain that the area under the curve is 0.96, which is large. Hence, the ex-
pression values of CCND3 Cyclin D3 are suitable for discrimination between
ALL and not ALL (AML). The ROC curve for the expression values of gene
Gdf5 is given by Figure 8.2. It can be observed that the true positive rate
is much lower as one moves on the horizontal axis from the left to the right.
This corresponds to the area under the curve of 0.35, which is small. This
illustrates that genes may express large differences with respect to prediction
of the disease status of patients.
can serve as predictors to form a decision tree. For instance, if xj < t, then
patient j is AML, and otherwise if xj t, then patient j is ALL. Obvi-
ously, the threshold value t on which the decision is based should be optimal
given the predictor. Such can be estimated by a regression tree (Breiman
et al., 1984; Chambers & Hastie, 1992; Venables, & Ripley, 2000), which is
implemented in the rpart package (Therneau & Atkinson, 1997).
A training set is used to estimate the threshold values that construct the
tree. When many predictor variables are involved, 3051 for instance, then we
have a tremendous gene (variable) selection problem. The rpart package au-
tomatically selects genes which are important for classification and neglects
others. A further problem is that of overfitting where additional nodes of a
tree are added to increase prediction accuracy. When such nodes are specific
for the training sample set, these can not be generalized to other samples
so that these are of limited scientific value. Prevention of such overfitting is
called pruning and is automatically done by the rpart function. Many basic
ideas are illustrated by an elementary example.
genea< 0.9371
4
3
genea< 3.025
ALL1
2
10/0/0
1
0
ALL2 AML
0/10/0 0/0/10
leukemia abbreviated as ALL1, ALL2, and AML. Gene A has expression val-
ues from the populations (patient groups) N (0, 0.52 ) for ALL1, N (2, 0.52 ) for
ALL2, and N (4, 0.52 ) for AML. The script below generates thirty expression
values for gene A, the patients of the three disease classes, and the estimates
of the classification tree.
set.seed(123); n<-10 ; sigma <- 0.5
fac <- factor(c(rep(1,n),rep(2,n),rep(3,n)))
levels(fac) <- c("ALL1","ALL2","AML")
geneA <- c(rnorm(10,0,sigma),rnorm(10,2,sigma),rnorm(10,4,sigma))
dat <- data.frame(fac,geneA)
library(rpart)
rp <- rpart(fac ~ geneA, method="class",data=dat)
plot(rp, branch=0,margin=0.1); text(rp, digits=3, use.n=TRUE)
From the boxplot in Figure 8.3 it can be observed that there is no overlap of
gene expressions between classes. This makes gene A an ideal predictor for
separating patients into classes. By the construction of the gene expression
values x1 , , x30 we expect the following partition. If xi < 1, then ALL1,
if xi is in interval [1, 3], then ALL2, and if xi > 3, then AML. From the
estimated tree in Figure 8.4 it can be observed that the estimated splits are
close to our expectations: If xi < 0.971, then ALL1, if xi is in [0.9371, 3.025],
then ALL2, and if xi > 3.025, then AML. The tree consists of three leaves
(nodes) and two splits. The prediction of patients into the three classes per-
fectly matches the true disease status.
Obviously, such an ideal gene need not exist because the expression values
overlap between the disease classes. In such a case more genes may be used
to build the classification tree.
set.seed(123)
n<-10 ; sigma <- 0.5
fac <- factor(c(rep(1,n),rep(2,n),rep(3,n)))
levels(fac) <- c("ALL1","ALL2","AML")
geneA <- c(rnorm(20,0,sigma),rnorm(10,2,sigma))
geneB <- c(rnorm(10,0,sigma),rnorm(20,2,sigma))
geneC <- c(rnorm(30,1,sigma))
dat <- data.frame(fac,geneA,geneB,geneC)
library(rpart)
rp <- rpart(fac ~ geneA + geneB + geneC, method="class",data=dat)
Note the addition in the model notation for the rpart function.3 It is con-
venient to collect the data in the form of a data frame.4
From the boxplot in Figure 8.5 it can be seen that Gene A discriminates
well between ALL and AML, but not between ALL1 and ALL2. The expres-
sion values for Gene B discriminate well between ALL1 and ALL2, whereas
those of Gene C do not discriminate at all. The latter can also be seen from
the estimated tree in Figure 8.6, where Gene C plays no role at all. This il-
lustrates that rpart automatically selects the genes (variables) which play a
role in the classification tree. Expression values on Gene A larger than 1.025
are predicted as AML and smaller ones as ALL. Expression values on Gene
B smaller than 0.9074 are predicted as ALL1 and larger as ALL2. Hence,
Gene A separates well within the ALL class.
genea< 1.025
2
1
geneb< 0.9074
AML
0/0/10
0
ALL1 ALL2
10/0/0 0/10/0
1
Note that (25 + 10)/38 100% = 92.10% of the ALL/AML patients are cor-
rectly classified by gene CCND3 Cyclin D3. By the function predict(gol.rp,type="class")
the predictions from the regression tree of the patients in the two classes can
be obtained. The factor gol.fac contains the levels ALL and AML corre-
sponding to the diagnosis to be predicted. The predictor variable consists
of the expression values of gene CCND3 Cyclin D3. The output of recursive
partitioning is assigned to an object called gol.rp, a list from which fur-
ther information can be extracted by suitable functions. A summary can be
obtained as follows.
> summary(gol.rp)
Call:
8.3. CLASSIFICATION TREES 155
26
[1] 0.03846154
The expected loss in prediction accuracy of Node number 2 is 1/26 and that
of Node number 3 is 2/12. This equals the probabilities from the class counts.
The primary splits gives the estimated threshold value. To predict the class
of the individual patients one may use the function predict, as follows.
> predict(gol.rp,type="class")
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML ALL ALL ALL
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
AML ALL ALL ALL ALL ALL ALL AML ALL AML AML AML AML AML AML AML AML AML
Levels: ALL AML
156 CHAPTER 8. CLASSIFICATION METHODS
golub[1042, ]>=1.199
2.5
2.0
1.5
1.0
0.5
0.0
0.5
ALL AML
25/1 2/10
ALL AML
library(rpart);data(golub); library(multtest)
row.names(golub)<- paste("gene", 1:3051, sep = "")
goldata <- data.frame(t(golub[1:3051,]))
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
gol.rp <- rpart(gol.fac~., data=goldata, method="class", cp=0.001)
plot(gol.rp, branch=0,margin=0.1); text(gol.rp, digits=3, use.n=TRUE)
golub.gnames[896,]
library("hgu95av2.db");library(ALL);data(ALL)
ALLB123 <- ALL[,ALL$BT %in% c("B1","B2","B3")]
pano <- apply(exprs(ALLB123), 1, function(x) anova(lm(x ~ ALLB123$BT))$Pr[1])
names <- featureNames(ALL)[pano<0.000001]
symb <- mget(names, env = hgu95av2SYMBOL)
ALLBTnames <- ALLB123[names, ]
probedat <- as.matrix(exprs(ALLBTnames))
row.names(probedat)<-unlist(symb)
The probe symbols are extracted from the hgu95av2SYMBOL environment and
used as row names to facilitate readability of the resulting tree. There are 78
patients selected and 29 probes. The recursive partitioning to find the tree
can be performed by the following script.
For instance, the sixth patient is with probability .90 in class B3 and with
probability .05 in class B1, which is the diagnosed disease state.
rf1
MME< 8.395
1389_at 1389_at
38032_at 36711_at
40440_at 38032_at
307_at 40440_at
37544_at 40493_at
36711_at 36829_at
B1 LSM6< 4.192
34378_at 34891_at
17/2/0
32977_at 37544_at
32116_at 37043_at
32716_at 35769_at
36829_at 34347_at
40729_s_at 34333_at
37320_at 307_at
B2 B3 1173_g_at 32716_at
1/33/5 1/1/18
40480_s_at 32977_at
Note the reduction in variables from twenty nine to two in the actual
construction of the tree. In a construction like this the gene expressions
(variables) are linearly dependent in the sense that once the first gene is
selected for the first split, then highly similar ones are not selected anymore.
It can be instructive to leave out the variables selected from the data and to
redo the analysis.
A generally applied manner to evaluate an estimated model is by its pre-
dictive accuracy with respect to a future data set. When such a future data
set is not available, it is common practice to split the available data in two
parts: A training set and a validation set. Then the model is estimated from
the training set and this is used to predict the class of the patients in the
validation set. Then a confusion matrix is constructed with the frequencies
of true classes against predicted classes. Next, the misclassification rate can
be computed to evaluate the predictive accuracy. This can very well be seen
160 CHAPTER 8. CLASSIFICATION METHODS
as a method to detect for over fitting where the model estimates are so data
specific that generalization to future data sets is in danger.
The misclassification rate in the training set is 2/39 = 0.05 and in the val-
idation set is 7/39 = 0.18. Note that the differences mainly occur between
State 2 and 3. Generally the prediction of disease state from the training set
is better because the model is estimated from these data.
The same split of the data into training and validation set will be used
for other methods as well.
library(e1071)
df <- data.frame(Y = factor(ALLBTnames$BT), X =t(probedat))
Y <- factor(ALLBTnames$BT);X <- t(probedat)
svmest <- svm(X, Y, data=df, type = "C-classification", kernel = "linear")
svmpred <- predict(svmest, X, probability=TRUE)
> table(svmpred, factor(ALLBTnames$BT))
svmpred B1 B2 B3
B1 19 0 0
B2 0 36 1
B3 0 0 22
The confusion matrix shows that the misclassification rate of the three classes
of B-cell ALL is 1/78=0.0128 is very small, so that the prediction is almost
perfect. Note, however, from summary(svmest) that the number of support
vectors per class equals 20, 9, and 11, for class B1, B2, and B3, respectively.
These have values for all input variables (genes) as can be obtained from
dim(svmest$SV) and the coefficient vectors dim(svmest$coefs). Hence,
the excellent prediction properties are obtained by a very large number of
estimated parameters.
The predictions of the disease states of the patients from the training set per-
fectly match the diagnosed states. The predictions, however, of the classes
of the patients from the validation set have misclassification rate 10/39=0.25
and are therefore less accurate. Hence, the parameter estimates from the
training set are sample specific and do not generalize with the same accuracy
to the validation set.
> library(nnet)
> df <- data.frame(Y = Y, X = X[, sample(ncol(X), 20)])
> nnest <- nnet(Y ~ .,data = df, size = 5, maxit = 500, decay = 0.01,
+ MaxNWts = 5000)
> pred <- predict(nnest, type = "class")
> table(pred, Y) # prints confusion ma
Y
pred B1 B2 B3
B1 19 0 0
B2 0 36 0
B3 0 0 23
The confusion matrix shows that zero out of 78 patients are mis-classified.
The predictions on the training set have misclassification rate zero and that
on the validation set 13/39=0.33.
164 CHAPTER 8. CLASSIFICATION METHODS
Rather than going deeper into the details, the usefulness of generalized linear
models will be illustrated with two examples.
library(faraway)
logitmod <- glm((-golub.cl + 1) ~ golub[1042,],
family=binomial(link = "logit"))
pchisq(deviance(logitmod),df.residual(logitmod),lower=FALSE)
plot((-golub.cl + 1) ~ golub[1042,], xlim=c(-2,5), ylim = c(0,1),
xlab="CCND3 expression values ", ylab="Probability of ALL")
x <- seq(-2,5,.1)
lines(x,ilogit(-4.844124 + 4.439953*x))
pchisq(deviance(logitmod),df.residual(logitmod),lower=FALSE)
5
One may also conveniently use a factor as response variable
8.6. GENERALIZED LINEAR MODELS 165
> summary(logitmod)
Call:
glm(formula = (-golub.cl + 1) ~ golub[1042, ], family = binomial(link = "logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.844 1.849 -2.620 0.00880 **
golub[1042, ] 4.440 1.488 2.984 0.00284 **
From Figure 8.11 it can be seen that the logit curve fits the data fairly well.
From the summary of the output it can be seen that the estimated intercept
is -4.844 and the estimated slope is 4.440. Both coefficients are significantly
different from zero. The goodness-of-fit value of the model is computed from
the chi-square distribution and equals .99. The model fits the data well. The
predictive accuracy of the model may be obtained as follows.
library(nnet);library("hgu95av2.db");library(ALL);data(ALL)
166 CHAPTER 8. CLASSIFICATION METHODS
Coefficients:
(Intercept) MME LSM6 SERBP1
B2 14.36158 4.14002 -0.8494635 -5.104337
B3 -12.90584 4.94908 4.7415802 -5.655420
Std. Errors:
(Intercept) MME LSM6 SERBP1
B2 16.36959 1.367058 1.716513 2.222486
B3 17.97896 1.424526 1.744425 2.313700
Apart from the intercepts, the estimated coefficients are significantly different
from zero.
has the advantage that confidence intervals or the significance of the param-
eters can be estimated. On the other hand, using statistical model building
procedures may not designed to handle large amounts of predictor variables.
Hence, the researcher does need to have some idea on which gene expressions
(s)he want to use as predictors. Indeed, better models than those estimated
in the above example may certainly exist.
8.8 Exercises
1. Classification tree of Golub data. Use recursive partitioning in rpart
6
Some people may want to use the ade4TkGUI()
168 CHAPTER 8. CLASSIFICATION METHODS
(a) Find a manner to identify an optimal gene with respect the Golub
data to prediction of the ALL AML patients.
(b) Explain what the code does.
(c) Use rpart to construct the classification tree with the genes that
you found. Does it have perfect predictions?
(d) Find the row number of gene Gdf5, which is supposed not to have
any relationship with leukemia. Estimate a classification tree and
report the probability of misclassification. Give explanations of
the results.
(a) Produce a sensitivity versus specificity plot for the gene expression
values of CCND3 Cyclin D3.
(b) In what sense does it resemble Figure 8.2.
(c) Compute the area under the curve for sensitivity versus specificity
curve.
(a) Construct a factor with 100 values one and two and a matrix
with predictor variables of 500 by 4 with values from the normal
distribution. Use the first four letters of the alphabet for the
column names.
(b) Use rpart to construct a recursive tree and report the misclassi-
fication rate. Comment on the results.
(c) Do the same for support vector machines.
(d) Do the same for neural networks.
(e) Think through your results and comment on these.
4. Prediction of achieved remission. For the ALL data from its ALL library
the patients are checked for achieving remission. The variable ALL$CR
has values CR (became healthy) and REF (did not respond to therapy;
remain ill).
8.8. EXERCISES 169
6. Classification Tree for Ecoli. The ecoli data can be download by the
following: (Hint: Copy two separated lines into one before running it.)
Table 8.2: Ordered expression values of gene CCND3 Cyclin D3, index 2
indicates ALL, 1 indicates AML, cutoff points, number of false positives,
false positive rate, number of true positives, true positive rate.
1.0
0.8
Probability of ALL
0.6
0.4
0.2
0.0
2 1 0 1 2 3 4 5
Analyzing Sequences
> library(seqinr)
> choosebank()
[1] "genbank" "embl" "emblwgs" "swissprot" "ensembl"
[6] "refseq" "nrsub" "hobacnucl" "hobacprot" "hovergendna"
[11] "hovergen" "hogenom" "hogenomdna" "hogennucl" "hogenprot"
[16] "hoverclnu" "hoverclpr" "homolens" "homolensdna" "greview"
173
174 CHAPTER 9. ANALYZING SEQUENCES
Lets obtain the first sequence and print its first fifteen nucleotides to the
screen. 2
> getSequence(ccnd3hs$req[[1]])[1:15]
[1] "a" "t" "g" "g" "a" "g" "c" "t" "g" "c" "t" "g" "t" "g" "t"
> getTrans(ccnd3hs$req[[1]])[1:15]
[1] "M" "E" "L" "L" "C" "C" "E" "G" "T" "R" "H" "A" "P" "R" "A"
> getAnnot(ccnd3hs$req[[1]])
[1] " CDS join(1051..1248,2115..2330,5306..5465,6005..6141,"
[2] " 6593..6760)"
[3] " /gene=\"CCND3\""
[4] " /codon_start=1"
[5] " /product=\"cyclin D3\""
[6] " /protein_id=\"AAM51826.1\""
[7] " /db_xref=\"GI:21397158\""
[8] " /translation=\"MELLCCEGTRHAPRAGPDPRLLGDQRVLQSLLRLEERYVPRASY"
2
Use double brackets to extract a sequence from a list.
176 CHAPTER 9. ANALYZING SEQUENCES
> table(getSequence(ccnd3hs$req[[1]]))
a c g t
162 288 267 162
This table can also be computed by the seqinr function count, which is
more general in the sense that frequencies of dinucleotides can be computed.
> count(getSequence(ccnd3hs$req[[1]]),2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
25 44 64 29 68 97 45 78 52 104 76 34 16 43 82 21
This will be quite useful in the next chapter. Indeed, changing 2 into 3 makes
it possible to count trinucleotides.
> GC(getSequence(ccnd3hs$req[[1]]))
[1] 0.6313993
9.3. COMPUTATIONS ON SEQUENCES 177
> GC1(getSequence(ccnd3hs$req[[1]]))
[1] 0.6484642
> GC2(getSequence(ccnd3hs$req[[1]]))
[1] 0.4641638
> GC3(getSequence(ccnd3hs$req[[1]]))
[1] 0.78157
Hence, the G + C percentage is largest when started at position three. It
is also possible to compute the G + C fraction in a window of length 50 nt,
say, and to plot it along the sequence.
GCperc <- double()
n <- length(ccnd3[[1]])
for (i in 1:(n - 50)) GCperc[i] <- GC(ccnd3[[1]][i:(i+50)])
plot(GCperc,type="l")
By double() we first create a vector. From Figure 9.1 it can be seen that
the G + C fraction changes drastically along a window of 50 nucleotides.
where fxy , fx , and fy are the frequencies of the (di)nucleotide xy, x, and y,
respectively. The z-score is computed by subtracting the mean and dividing
by the standard deviation (Palmeira, et al., 2006). The latter is somewhat
more sensitive for over and under representation.
Example 3. Rho and z-scores. The coefficient rho and the corresponding
z-scores will be computed from the sequence with NCBI accession number
AF517525.CCND3.
> round(rho(getSequence(ccnd3hs$req[[1]])),2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
0.84 0.83 1.30 0.97 1.28 1.03 0.51 1.47 1.06 1.19 0.94 0.69 0.54 0.81 1.67 0.70
> round(zscore(getSequence(ccnd3hs$req[[1]]),modele=base),2)
178 CHAPTER 9. ANALYZING SEQUENCES
0.9
0.8
0.7
GCperc
0.6
0.5
Index
aa ac ag at ca cc cg ct ga gc gg gt ta
-1.08 -1.67 2.81 -0.18 2.78 0.42 -6.63 4.64 0.54 2.60 -0.80 -2.87 -3.10
tc tg tt
-1.86 6.22 -1.98
The rho value for CG is not extreme, but its z-score certainly is.
plot expressing their frequencies. Such can be useful for a first impression on
sequence similarity.
Leu Leu
Ala Ala
Arg Arg
Glu Glu
Ser Ser
Thr Thr
Pro Pro
Gln Gln
Val Val
Gly Gly
Asp Asp
Cys Cys
Lys Lys
Ile Ile
Met Met
Tyr Tyr
His His
Phe Phe
Trp Trp
Asn Asn
Stp Stp
0 10 20 30 40 0 10 20 30 40
return(res)
}
kdath <- linform(ccnd3transl, kdc)
> print(kdath,digits=3)
[1] 0.0874 0.0962 0.0189 0.1496 0.0962 0.0874 0.0874 0.2659 0.2220
Indeed, the largest score is still much smaller than one, so the conclusion is
that there are no hydrophilic proteins among our sequences.
The data set aaindex of the seqinr library contains more than five hun-
dred sets of coefficients for computing specific quantities with respect to
proteins.
library(seqinr)
choosebank("genbank")
query("ccnd3hs","sp=homo sapiens AND k=ccnd3@")
ccnd3 <- sapply(ccnd3hs$req, getSequence)
ccnd3nr1 <- c2s(ccnd3[[1]])
> ccnd3nr1
[1] "atggagctgctgtgttgcgaaggcacccggcacgcgccccgggccgggccggacccgcgg"...
> subseq <- "cccggg"
> countPattern(subseq, ccnd3nr1, mismatch = 0)
[1] 2
> matchPattern(subseq, ccnd3nr1, mismatch = 0)
Views on a 879-letter BString subject
182 CHAPTER 9. ANALYZING SEQUENCES
Subject: atggagctgctgtgttgcgaaggcacccggcacg...actcctacagatgtcacag
Views:
start end width
[1] 38 43 6 [cccggg]
[2] 809 814 6 [cccggg]
> matchPattern(subseq, ccnd3nr1, mismatch = 1)
Views on a 879-letter BString subject
Subject: atggagctgctgtgttgcgaaggcacccggcacg...actcctacagatgtcacag
Views:
start end width
[1] 26 31 6 [cccggc]
[2] 37 42 6 [ccccgg]
[3] 38 43 6 [cccggg]
[4] 43 48 6 [gccggg]
[5] 54 59 6 [cccgcg]
[6] 119 124 6 [cccgcg]
[7] 236 241 6 [ccctgg]
[8] 303 308 6 [cctggg]
[9] 512 517 6 [cccgtg]
[10] 612 617 6 [cacggg]
[11] 642 647 6 [cctggg]
[12] 661 666 6 [tccggg]
[13] 662 667 6 [ccgggg]
[14] 808 813 6 [ccccgg]
[15] 809 814 6 [cccggg]
[16] 810 815 6 [ccgggg]
xi = xi1 + 1.
> x<-double();x[1]<-1
> for (i in 2:10) {x[i]<- 2*x[i-1]-10}
> x[10]
[1] -4598
Note, however, that this will not yet work because we have not defined any
initial values. In fact we will agree to start with F (0, 0) = 0 and due to the
gap penalties we take F (i, 0) = id for the first column and F (0, j) = jd
for the first row. Then, the final score F (n, m) is the optimal score and the
values of the matrix F (i, j) indicates the optimal path. By informaticians
this recursive scheme is often called a dynamic programming algorithm.
T -1 -1 -1 2 2 -1
A -1 2 2 -1 -1 -1
To initialize the first row and column of the matrix F (i, j), it is convenient
to use the function seq. The purpose of the max function seems obvious.
F <- matrix(data=NA,nrow=(length(y)+1),ncol=(length(x)+1))
rownames(F) <- c("",y); colnames(F) <- c("",x)
F[,1] <- -seq(0,length(y)*d,d); F[1,] <- -seq(0,length(x)*d,d)
for (i in 2:(nrow(F)))
for (j in 2:(ncol(F)))
{F[i,j] <- max(c(F[i-1,j-1]+s[i-1,j-1],F[i-1,j]-d,F[i,j-1]-d))}
> F
G A A T T C
0 -2 -4 -6 -8 -10 -12
G -2 2 0 -2 -4 -6 -8
A -4 0 4 2 0 -2 -4
T -6 -2 2 3 4 2 0
T -8 -4 0 1 5 6 4
A -10 -6 -2 2 3 4 5
From the lower corner to the right hand side we see that the optimal score
is indeed 5.
Optimal alignment for pairs of amino acid sequences are often considered
to be more relevant because these are more closely related to biological func-
tions. For this purpose we may modify the previous scheme by changing the
gap penalty d and the (mis)match scores s(i, j). In particular, we shall use
the gap penalty d = 8 and for the (mis)match the scores from the so-called
BLOSUM50 matrix.
A R N D C Q E G H I L K M F P S T W Y V
A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0
R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3
N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3
D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4
C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1
Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3
E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3
G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4
H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4
I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4
L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1
K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3
M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1
F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3
S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3
Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1
V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5
or load a BLOSUM matrix from the Biostrings package. For the sake of
clarity we shall conveniently construct the matrix s(i, j) without any concern
about computer memory.
library(seqinr);library(Biostrings);data(BLOSUM50)
x <- s2c("HEAGAWGHEE"); y <- s2c("PAWHEAE"); s <- BLOSUM50[y,x]; d <- 8
F <- matrix(data=NA,nrow=(length(y)+1),ncol=(length(x)+1))
F[1,] <- -seq(0,80,8); F[,1] <- -seq(0,56,8)
rownames(F) <- c("",y); colnames(F) <- c("",x)
for (i in 2:(nrow(F)))
for (j in 2:(ncol(F)))
{F[i,j] <- max(c(F[i-1,j-1]+s[i-1,j-1],F[i-1,j]-d,F[i,j-1]-d))}
9.5. PAIRWISE ALIGNMENTS 187
> F
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Hence, from the lower-right corner we observe that the optimal score equals
one.
library(Biostrings);data(BLOSUM50)
> pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"),
+ substitutionMatrix = "BLOSUM50",gapOpening = 0, gapExtension = -8,
+ scoreOnly = FALSE)
Global Pairwise Alignment
1: --P-AW-HEAE
2: HEAGAWGHE-E
Score: 1
compute the maximum alignment score. This is repeated 1000 times and the
probability of optimal alignment scores greater than 1 is estimated by the
corresponding proportion.
library(seqinr);library(Biostrings);data(BLOSUM50)
randallscore <- double()
for (i in 1:1000) {
x <- c2s(sample(rownames(BLOSUM50),7, replace=TRUE))
y <- c2s(sample(rownames(BLOSUM50),10, replace=TRUE))
randallscore[i] <- pairwiseAlignment(AAString(x), AAString(y),
substitutionMatrix = "BLOSUM50",gapOpening = 0, gapExtension = -8,
scoreOnly = TRUE)
}
> sum(randallscore>1)/1000
[1] 0.003
By the option scoreOnly = TRUE the optimal score is written to the vector
randallscore. The probability of scores larger than 1 equals 0.003 and is
therefore small and the alignment is stronger than expected from randomly
constructed sequences.
choosebank("genbank"); library(seqinr)
query("ccnd3hs","sp=homo sapiens AND k=ccnd3@")
ccnd3 <- sapply(ccnd3hs$req, getSequence)
ccnd3transl <- sapply(ccnd3, getTrans)
x <- c2s(ccnd3transl[[1]])
y <- c2s(ccnd3transl[[1]][50:70])
nwscore <- double() ; n <- length(ccnd3transl[[1]])
for (i in 1:(n-21))
nwscore[i] <-
pairwiseAlignment(AAString(c2s(ccnd3transl[[1]][i:(i+20)])),
AAString(y),substitutionMatrix = "BLOSUM50",gapOpening = 0,
gapExtension = -8, scoreOnly = TRUE)
> pairwiseAlignment(AAString(y), AAString(y),
substitutionMatrix = "BLOSUM50", gapOpening = 0, gapExtension = -8,
9.6. OVERVIEW AND CONCLUDING REMARKS 189
+ scoreOnly = TRUE)
[1] 152
> max(nwscore)
[1] 152
> which.max(nwscore)
[1] 50
Note that the maximum occurs when the subsequences are identical. The
value of the maximum is 152 which occurs at position 50.
9.7 Exercises
1. Writing to a FASTA file. Read, similar to the above, the ccnd3 se-
quences using the query language and write the first sequence to a files
in FASTA format. Also try to write them all to FASTA format.
(a) Construct two random sequence of size 100 and plot the first
against second and the first against the first.
(b) Construct a plot of the first against the first and the first against
the first in reverse order.
190 CHAPTER 9. ANALYZING SEQUENCES
(c) Download the sequences related to the species homo sapiens and
a gene name like CCND3 Cyclin D3. Construct a dotplot of
the most similar and the least similar sequences. Report your
observations.
The algorithm allows the score zero if the others have negative values.
The idea is that the maximum alignment can occur anywhere in the
matrix, optimal alignment is defines as the maximum over the whole
matrix. Program the Smith-Waterman algorithm and find the optimal
local alignment of the sequences PAWHEAE and HEAGAWGHEE.
Markov Models
The idea of a Markov process forms the basis of many important models in
bioinformatics such as (Hidden) Markov Models, models for sequence align-
ment, and models for phylogenetic trees. By the latter it is possible to
estimate distances between several sequences and to visualize these in a tree.
Classical matrices for sequence alignment such as BLOSUM and PAM are
constructed on the basis of a Markov process. By (Hidden) Markov Mod-
els the specific repetitive order of DNA sequences can be modeled so that
predictions of families becomes possible.
In this chapter you learn what a probability transition matrix is and which
role it plays in a Markov process to construct specific sequences. Various
models for phylogenetic trees are explained in terms of the rate matrix as
well as the probability transition matrix. The basic ideas of the Hidden
Markov Model are briefly explained and illustrated by an example1 .
1
This chapter is somewhat more technical in its notation with respect to e.g. conditional
probability. This is, however, inevitable for the understanding of Markov processes.
193
194 CHAPTER 10. MARKOV MODELS
Example 1. Throwing a coin. A fair coin X attains Head and Tail with
probability 1/2. Thus we may write P (X = H) = 0.5 and P (X = T ) = 0.5.
With such a random variable there always correspond a population as well
as a sampling scheme which can be simulated on a computer (e.g. Press, et
al., 1992).
> sample(c("H","T"),30,rep=TRUE,prob=c(0.5,0.5))
[1] "H" "H" "T" "T" "T" "H" "H" "T" "T" "H" "H" "H" "T" "T" "H" "T"
[20] "H" "T" "T" "T" "H" "T" "H" "T" "T" "T" "T"
Thus the sampled values Head and Tail correspond to the process of actu-
ally throwing with a fair coin. The function sample randomly draws thirty
times one of the values c("H","T") with replacement (rep=TRUE) and equal
probabilities (prob=c(0.5,0.5)).
For these sampling schemes it holds that the events occur independently
from the previous.
state i to state j and the value of pij . For the current example the transition
graph is given by Figure 10.12 . The values 1 and 2 of the process are written
within the circles and the transition probabilities are written near the arrows.
To actually generate a sequences with values equal to 1 and 2 according the
1/6
1/2
1
0
5/6
1/2
The function starts with sampling just once from the distribution with equal
probabilities pi0. It conveniently uses the the column and row names of the
probability transition matrix for the sampling. The probabilities to go from
a or t to c or g are large and as well as that to stay within c
or g. From the frequency table it can be observed that the majority of
residues are c or g.
* A C D F H I L M N P R S T V
2 1 75 2 5205 24 76 2260 1 2 19 26 2154 1 91
Y
60
From the table it is clear that the F frequency is the largest among the gen-
erated amino acids.
these do not occur. This estimation procedure can easily be applied to DNA
sequences.
Using R its operator %*% for matrix multiplication, the product T0 P can be
computed as follows.
> P <- matrix(c(5/6,1/6,0.5,0.5),2,2,byrow=T)
> pi0 <- c(2/3,1/3)
> pi0 %*% P
[,1] [,2]
[1,] 0.7222222 0.2777778
where the latter is element (1, 1) of the matrix4 P 2 . In general, we have that
which is element i, j of P n .
4
For a brief definition of matrix multiplication, see Pevsner (2003, p.56) or wikipedia
using the search string wiki matrix multiplication.
10.4. STATIONARY DISTRIBUTION 201
We can now compute the transition probability matrix after five generations.
P <- matrix(c(1,0,0, 1/4,1/2,1/4,0,0,1),3,3,byrow=T)
V <- eigen(P,symmetric = FALSE)
> V$vec %*% diag(V$va)^(5) %*% solve(V$vec)
[,1] [,2] [,3]
[1,] 1.000000 0.00000 0.000000
10.5. PHYLOGENETIC DISTANCE 203
A G C T
A 0.60 0.20 0.20 0.20
Q = G 0.20 0.60 0.20 0.20 .
C 0.20 0.20 0.60 0.20
T 0.20 0.20 0.20 0.60
Thus within a certain time period a proportion of 0.20 A changes into G,
0.20 A into C, and 0.20 A into T . Consequently, a proportion of 0.60 of the
204 CHAPTER 10. MARKOV MODELS
residues goes back to A. Given this rate matrix, we can find the probabil-
ity transition matrix P = exp(Q) by using the function expm(Q) from the
package Matrix.
library(Matrix)
Q <- 0.2 * Matrix(c(-3,1,1,1,1,-3,1,1,1,1,-3,1,1,1,1,-3),4)
rownames(Q) <- colnames(Q) <- c("A","G","C","T")
P <- as.matrix(expm(Q))
> round(P,2)
A G C T
A 0.59 0.14 0.14 0.14
G 0.14 0.59 0.14 0.14
C 0.14 0.14 0.59 0.14
T 0.14 0.14 0.14 0.59
Thus the probability that the state changes from A to A is 0.59, from A to
G is 0.14, etc.
A G C T
A
QJC69 = G
.
C
T
The sum of each row of a rate matrix equals zero, so that from this require-
ment the diagonal elements of the JC69 model are equal to 3. Further-
more, the non-diagonal substitution rates of the JC69 model all have the
same value . That is, the change from i to j equals that from j to i, so
that the rate matrix is symmetric. Also the probability that the sequence
equals one of the nucleotides is 1/4. This assumption, however, is unrealistic
is many cases.
Transitions are substitutions of nucleotides within types of nucleotides,
thus purine to purine or pyrmidine to pyrmidine (A G or C T ).
Transversions are substitutions between nucleotide type (A T , G
T ,A C, and C G). In the JC69 model a transition is assumed to
10.5. PHYLOGENETIC DISTANCE 205
happen with equal probability as a transversion. That is, it does not account
for the fact that transitions are more common that transversions. To cover
this for more general type of models are proposed by Kimura (1980, 1981),
which are commonly abbreviated by K80 and K81. In terms of the rate
matrix these models can we written as
QK80 = ,
QK81 = .
In the K80 model a change within type (transition) occurs at rate and
between type (transversion) at rate . In the K81 model all changes occur
at a different though symmetric rate; the rate of change A G is and
equals that of A G. If is large, then the amount of transitions is large;
if both and are very small, then the number of transversions is small.
A model is called nested if it is a special case of a more general model.
For instance, the K80 model is nested in the K81 model because when we
take = , then we obtain the K80 model. Similarly, the JC69 model is
nested in the K80 model because if we take = , then we obtain the JC69
model.
Some examples of models with even more parameters are the Hasegawa,
Kishino, and Yano (1985) (HKY85) model and the General Time-Reversable
Model (GTR) model
G C T G C T
A C T A C T
QHKY 85 = A G , QGT R =
.
T A G T
A G C A G C
The distance between DNA sequences is defined on the basis of these models.
From these distances the phylogenetic tree is computed by a neighbor-joining
algorithm such that it has the smallest total branch length.
Example 2. The K81 model. To compute the rate matrix of the K81
model with = 3/6, = 2/6, = 1/6 we may use the following.
alpha <- 3/6; beta <- 2/6; gamma<- 1/6; Q <- matrix(data=NA,4,4)
Q[1,2] <- Q[2,1] <- Q[3,4] <- Q[4,3] <- alpha
206 CHAPTER 10. MARKOV MODELS
library(Matrix)
alpha <- 1/5; Q <- matrix(rep(alpha,16),4,4)
diag(Q) <- -3 * alpha
Q <- Matrix(Q)
P <- as.matrix(expm(Q))
V <- eigen(P,symmetric = FALSE)
> V$vec %*% diag(V$va)^(50) %*% solve(V$vec)
[,1] [,2] [,3] [,4]
[1,] 0.25 0.25 0.25 0.25
[2,] 0.25 0.25 0.25 0.25
[3,] 0.25 0.25 0.25 0.25
[4,] 0.25 0.25 0.25 0.25
10.5. PHYLOGENETIC DISTANCE 207
Hence, the stationary distribution is T = (0.25, 0.25, 0.25, 0.25) (cf. Ewens
& Grant, 2005, p. 477).
3
d = log(1 4p/3),
4
where p is the proportion of different nucleotides of the two sequences. The
pairwise distances between DNA sequences can be computed by the function
dist.dna from the ape package.
> library(ape);library(seqinr)
> accnr <- paste("AJ5345",26:27,sep="")
> seqbin <- read.GenBank(accnr, species.names = TRUE, as.character = FALSE)
> dist.dna(seqbin, model = "JC69")
AJ534526
AJ534527 0.1326839
Hence, the distance is 0.133. Over a total of 1143 nucleotides there are 139
differences, som that the proportion of different nucleotides 139/1143 = p.
Inserting this into the previous distance formula gives the distance. This can
be verified as follows.
library(ape);library(seqinr)
accnr <- paste("AJ5345",26:35,sep="")
seq <- read.GenBank(accnr)
names(seq) <- attr(seq, "species")
dist <- dist.dna(seq, model = "K80")
plot(nj(dist))
Obviously, in this manner various trees can be computed and their plots
compared.
When various different models are defined the question becomes appar-
ent which of these fits best to the data relative to the number of parameters
(symbols) of the model. When the models are estimated by maximum likeli-
hood, then the Akaike information criterion (AIC = -2 loglik + 2 number
of free parameters) can be used to select models. The best model is the one
with the smallest AIC value.
> setwd("/share/home/wim/bin")
> write.dna(seq,"seq.txt", format ="interleaved")
> out <-phymltest("seq.txt",format = "interleaved", execname ="phyml_linux")
> print(out)
nb.free.para loglik AIC
JC69 1 -4605.966 9213.931
JC69+I 2 -4425.602 8855.203
JC69+G 2 -4421.304 8846.608
JC69+I+G 3 -4421.000 8848.001
K80 2 -4423.727 8851.455
K80+I 3 -4230.539 8467.079
K80+G 3 -4224.457 8454.915
K80+I+G 4 -4223.136 8454.272
F81 4 -4514.331 9036.662
10.6. HIDDEN MARKOV MODELS 209
tr <- read.tree("seq.txt_phyml_tree.txt")
plot(tr[[27]])
add.scale.bar(length=0.01)
Thus the probability is 0.95 that the die is fair at time point i, given that
it is fair at time point i 1. The probability that it will switch from fair to
unfair is 0.05. The probability that it will switch from loaded to fair is 0.10
and that it stays loaded is 0.90. With this emission matrix we can generate a
sequence of hidden states, where the values 1 and 2 indicate whether the die
is fair (1) or loaded (2). Given the fairness of the die we define the probability
transition matrix.
P (Oi = 1|Di = 1) P (Oi = 2|Di = 1) P (Oi = 3|Di = 1)
A =
P (Oi = 1|Di = 2) P (Oi = 2|Di = 2) P (Oi = 3|Di = 2)
1/6 1/6 1/6 1/6 1/6 1/6
= . (10.4)
1/10 1/10 1/10 1/10 1/10 1/2
Thus given that the die is fair, the probability of any outcome equals 1/6.
However, given that the die is unfair (loaded), the probability of outcome 6
equals 1/2 and that of any other outcome equals 1/10.
The HMM with this transition and emission matrix can be programmed.
After sampling the hidden states from a Markov chain and the outcomes of
the die are sampled according to the value of the hidden state (die type).
h[1]<-1; x[1]<-sample(observationset,1,replace=T,E[h[1],])
h <- markov(hiddenset,A,n)
for(k in 1:(n-1)){x[k+1] <- sample(observationset,1,replace=T,E[h[k],])}
out <- matrix(c(x,h),nrow=n,ncol=2,byrow=FALSE)
return(out)
}
E <- matrix(c(rep(1/6,6),rep(1/10,5),1/2),2,6,byrow=T) #emission matrix
A <- matrix(c(0.95,0.05,0.1,0.9),2,2,byrow=TRUE) #transition matrix
dat <- hmmdat(A,E,100)
colnames(dat) <- c("observation","hidden_state")
rownames(dat) <- 1:100
> t(dat)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
observations 5 2 3 1 6 1 3 1 1 5 6 6 2 2 3 5 4 6 1 2 4 4 3 2 3
hidden_states 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
observations 4 3 2 4 1 6 6 6 6 6 5 5 3 6 1 6 5 2 4 1 4 2
hidden_states 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
observations 5 6 5 2 3 3 1 3 3 5 6 6 2 4 5 4 6 1 6 5 2 6
hidden_states 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
observations 1 1 4 4 1 5 6 4 3 5 4 2 6 1 3 6 5 2 2 6 6 1
hidden_states 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1
92 93 94 95 96 97 98 99 100
observations 4 1 6 5 5 6 5 3 4
hidden_states 1 1 1 1 1 1 1 1 1
For each row of the matrix the maximum is taken as the best predictor of
the hidden state.
predicted state 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
observation 5 2 6 1 1 4 4 1 5 6 4 3 5 4 2 6 1 3 6 5 2
hidden_state 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
predicted state 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
88 89 90 91 92 93 94 95 96 97 98 99 100
observation 2 6 6 1 4 1 6 5 5 6 5 3 4
hidden_state 2 2 1 1 1 1 1 1 1 1 1 1 1
predicted state 1 1 1 1 1 1 1 1 1 1 1 1 1
The misclassification rate is 0.27 which is quite large given the fact that we
used the true transition and emission matrix. An important observation is
that after a transition of a hidden state, it takes a few values for the predic-
tion to change. This is caused by the recursive nature of the algorithm.
10.7 Appendix
The probability that the process is in State 1 at time point 1 can be computed
as follows.
10.9 Exercises
1. Visualize by a transition graph the following transition matrices. For
the process with four states take the names of the nucleotides in the
order A, G, T, and C.
1 2 1 3
1 2 4 4
0 14 4 4
0 0
1 0 0 1 1 2 2 1 1 5
0 0
3
3
3
1 , , , 6 6 6 6 6
0 2 5 0 , 0 0 5 2
6 .
4 4
0 1 1 0 7 7 7 7
1 1 2 4 3 5
8 8 8 8
0 0 8 8
(a) Down load the sequences AJ534526 and AJ534527. Hint: Use
as.character = TRUE in the read.GenBank function.
(b) Compute the proportion of different nucleotides.
(c) Use this proportion to verify the distances between these sequences
according to the JC69 model.
216 CHAPTER 10. MARKOV MODELS
9200 JC69
F81
JC69 + I
K80
JC69 + I +
JC69 +
9000 F84
HKY85
TN93
F81 + I
F81 + I +
8800 F81 +
GTR
K80 + I
K80 +
K80 + I +
8600 F84 + I
F84 +
F84 + I +
HKY85 + I
TN93 + I
8400 HKY85 + I +
HKY85 +
TN93 + I +
TN93 +
GTR + I
8200 GTR + I +
GTR +
Sylvia crassirostris
Sylvia hortensis
Sylvia leucomelaena
Sylvia lugens
Sylvia buryi
Sylvia boehmi
Sylvia subcaeruleum
Sylvia layardi
Sylvia nisoria
Answers to exercises
2. gendat
(a) apply(gendat,2,sd).
(b) apply(gendat,1,sd).
(c) To order the data frame according to the gene standard deviations.
sdexprsval <- apply(gendat,1,sd)
o <- order(sdexprsval,decreasing=TRUE)
gendat[o,]
219
220 APPENDIX A. ANSWERS TO EXERCISES
(d) gene1
(a) The standard deviation per gene can be computed by sdgol <-
apply(golub,1,sd).
(b) The gene with standard deviation larger than 0.5 can be selected
by golubsd <- golub[sdgol>0.5,].
(c) sum(sdgol>0.5) gives that the number of genes having sd larger
than 0.5 is 1498.
6. Constructing a factor.
(a) gl(2,4).
(b) gl(5,3).
(c) gl(3,5).
library(ALL); data(ALL)
meanB1 <- apply(exprs(ALL[,ALL$BT=="B1"]),1, mean)
o <- order(meanB1,decreasing=TRUE)
> meanB1[o[1:3]]
AFFX-hum_alu_at 31962_at 31957_r_at
13.41648 13.16671 13.15995
3. Effect size.
(a) The size 11 is large, because the mean is eleven times larger than
the standard deviation.
data(golub, package="multtest")
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
efs <- apply(golub[,gol.fac=="ALL"],1,function(x) mean(x)/sd(x))
o <- order(efs,decreasing=TRUE)
> efs[o[1:5]]
[1] 11.138128 10.638308 9.155108 8.954115 8.695353
> golub.gnames[o[1:5],2]
[1] "YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activa
[2] "ZNF91 Zinc finger protein 91 (HPF7, HTF10)"
[3] "HnRNP-E2 mRNA"
[4] "54 kDa protein mRNA"
[5] "Immunophilin homolog ARA9 mRNA"
223
(b) The robust variant can be defined by dividing the median by the
MAD. An alternative would be to divide the median by the IQR.
This gives other best genes indicating that the some genes may
have outliers that influence the outcome.
refs <- apply(golub[,gol.fac=="ALL"],1,function(x) median(x)/mad(x))
o <- order(refs,decreasing=TRUE)
> refs[o[1:5]]
[1] 14.51217 13.57425 13.27698 13.14419 12.91608
> golub.gnames[o[1:5],2]
[1] "COX6B gene (COXG) extracted from Human DNA from overlapping chromosome 1
F25451, and R31076 containing COX6B and UPKA, genomic sequence"
[2] "AFFX-HSAC07/X00351_M_at (endogenous control)"
[3] "ATP5A1 ATP synthase, H+ transporting, mitochondrial F1 complex, alpha su
isoform 1, cardiac muscle"
[4] "ATP SYNTHASE GAMMA CHAIN, MITOCHONDRIAL PRECURSOR"
[5] "YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation pro
zeta polypeptide"
locator()
x11()
x <- data(golub, package = "multtest")
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
boxplot(x,xlim=c(0,4))
224 APPENDIX A. ANSWERS TO EXERCISES
arrows(2.0,1.93,1.24,1.93);text(2.5,1.93,"Median")
arrows(2.0,1.1,1.24,1.1) ;text(2.5,1.1,"Outlier")
arrows(2.0,1.79,1.24,1.79);text(2.5,1.79,"first quartile")
arrows(2.0,2.17,1.24,2.17);text(2.5,2.17,"third quartile")
arrows(2.0,1.27,1.24,1.27);text(2.5,1.27,"lower wisker")
arrows(2.0,2.59,1.24,2.59);text(2.5,2.59,"upper wisker")
dev.copy2eps(device=x11,file="BoxplotWithExplanation.eps")
boxplot.stats(x, coef = 1.5, do.conf = TRUE, do.out = TRUE) #finds value
(a) The medians are all around zero, the inter quartile range differ
only slightly, the minimal values are all around minus 1.5. All
persons have outliers near three.
(b) The means are very close to zero. The medians are all between
(0.15383, 0.06922), so these are also close to zero.
personmean <- apply(golub,2,mean)
personmedian <- apply(golub,2,median)
(c) The data seem preprocessed to have standard deviation equal to
one. The re-scaled IQR and MAD have slightly larger range.
> range(apply(golub,2,sd))
[1] 0.9999988 1.0000011
> range(apply(golub,2,function(x) IQR(x)/1.349))
[1] 0.9599036 1.3361527
> range(apply(golub,2,mad))
[1] 0.9590346 1.2420185
(a) Note that we need the transpose operator t to change rows into
columns. The script below will do.
data(golub, package="multtest")
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
rowindex <- agrep("^oncogene",golub.gnames[,2])
oncogol <- golub[rowindex,]
oncogolub.gnames <- golub.gnames[rowindex,]
225
8. Descriptive statistics for the ALL gene expression values of the Golub
et al. (1999) data.
(a) The ranges indicate strong difference in means. The range of the
mean and of the median are similar. The bulk of the data seems
symmetric.
> range(apply(golub[,gol.fac=="ALL"],1,mean))
[1] -1.330984 3.278551
> range(apply(golub[,gol.fac=="ALL"],1,median))
[1] -1.36832 3.35455
(b) The range of the standard deviation is somewhat smaller than of
the re-scaled IQR and MAD.
> range(apply(golub[,gol.fac=="ALL"],1,sd))
[1] 0.1336206 2.0381309
> range(apply(golub[,gol.fac=="ALL"],1,function(x) IQR(x)/1.349))
[1] 0.1153113 2.7769867
> range(apply(golub[,gol.fac=="ALL"],1,mad))
226 APPENDIX A. ANSWERS TO EXERCISES
1. Binomial
2. Standard Normal.
3. Normal.
4. T-distribution.
5. F distribution.
6. Chi-squared distribution.
7. MicroRNA.
8. Zyxin.
(a) The tree larges t-value 57.8, 55.2, and 47.5 are extremely large.
data(golub, package="multtest")
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
tval <- apply(golub[,gol.fac=="ALL"],1,function(x) sqrt(27) * mean(x)
o <- order(tval,decreasing=TRUE)
tval[o[1:3]]
golub.gnames[o[1:3],2]
(b) The scrip below gives 2185 ratios between 0.5 and 1.5.
sdall <- apply(golub[,gol.fac=="ALL"],1, sd)
sdaml <- apply(golub[,gol.fac=="AML"],1, sd)
sdratio <- sdall/sdaml
sum( sdratio > 0.5 & sdratio < 1.5)
10. Extreme value investigation. The blue line (extreme value) fits to the
black line (density of generated extreme data) much better than the
red line (normal distribution).
library(multtest);data(golub)
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
shapiro.test(golub[i,gol.fac=="ALL"])
gives p-value = 0.592 and changing ALL into AML gives p-value
= 0.2583. Hence, for normality is accepted.
(b) var.test(golub[i,] ~ gol.fac) gives p-value = 0.1095 so equal-
ity of variances is accepted.
(c) t.test(golub[i,] ~ gol.fac, var.equal = TRUE) gives p-value = 1.773e-09,
so equality of means is rejected.
(d) Yes, t = -7.9813 is quite extreme.
4. Zyxin.
5. Gene selection.
6. Antigenes.
library(multtest); data(golub)
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
pt <- apply(golub, 1, function(x) t.test(x ~ gol.fac)$p.value)
index <-agrep("^antigen",golub.gnames[,2])
golub.index<-golub[index,]
pt.index<-pt[index]
golub.gnames.index<-golub.gnames[index,]
golub.gnames.index[order(pt.index)[1:length(index)],2]
7. Genetic Model. From the output below the null hypothesis that the
probabilities are as specified is accepted.
> chisq.test(x=c(930,330,290,90),p=c(9/16,3/16,3/16,1/16))
9. Normality tests.
library(multtest);data(golub)
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
allsh <- apply(golub[,gol.fac=="ALL"], 1, function(x) shapiro.test(x)$p.value)
amlsh <- apply(golub[,gol.fac=="AML"], 1, function(x) shapiro.test(x)$p.value)
> 100 * sum(allsh>0.05)/length(allsh)
[1] 58.27598
> 100 * sum(amlsh>0.05)/length(amlsh)
[1] 78.5644
> 100 * sum(allsh>0.05 & allsh>0.05)/length(allsh)
[1] 58.27598
10. Two-sample tests on gene expression values of the Golub et al. (1999)
data.
(a) data(golub,package="multtest")
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
x <- golub[1042,gol.fac=="ALL"]
n <- length(x)
y <- golub[1042,gol.fac=="AML"]
m <- length(y)
t <- (mean(x)-mean(y))/sqrt(var(x)/n + var(y)/m)
v <- (var(x)/n + var(y)/m)^2/( (var(x)/n)^2/(n-1) + (var(y)/m)^2/(m-1
2*pt(-abs(t),v)
mean(x) - mean(y) + qt(0.025,v)* sqrt(var(x)/n + var(y)/m)
mean(x) - mean(y) + qt(0.975,v)* sqrt(var(x)/n + var(y)/m)
233
library(ALL); data(ALL)
ALLB <- ALL[,ALL$BT %in% c("B","B1","B2","B3","B4")]
> table(ALLB$BT)
B B1 B2 B3 B4 T T1 T2 T3 T4
5 19 36 23 12 0 0 0 0 0
There are 124 significant gene expressions from ANOVA which are not
significant on Kruskal-Wallis. There are only 38 significant gene ex-
pressions from Kruskal-Wallis which are non-significant according to
ANOVA. The tests agree on the majority of gene expressions.
3. Finding the ten best best genes among gene expressions of B-cell ALL
patients.
> sort(panova)[1:10]
1914_at 1389_at 38555_at 33358_at 40268_at 3971
1.466523e-14 5.891702e-14 4.873245e-10 1.117406e-09 1.145502e-09 4.748615
40763_at 37809_at 36873_at 1866_g_at
5.256410e-09 2.155457e-08 2.402379e-08 3.997065e-08
> sort(pkw)[1:10]
1389_at 40268_at 38555_at 1866_g_at 40155_at 191
2.348192e-09 7.764046e-08 1.123068e-07 2.335279e-07 6.595926e-07 1.074525
1125_s_at 40662_g_at 38032_at 40661_at
1.346907e-06 1.384281e-06 1.475170e-06 1.719456e-06
> intersect(npanova,npkw)
[1] "1914_at" "1389_at" "38555_at" "40268_at" "1866_g_at"
Thus the number of true positives from ANOVA is 3757 and the num-
ber of false negatives is 6243. For the Kruskal-Wallis test there are
1143 true positives and 8857 false negatives. This can be impoved by
increasing the number of gene expressions per group.
library("genefilter")
data(ALL, package = "ALL")
ALLB <- ALL[,ALL$BT %in% c("B1","B2","B3","B4")]
f1 <- function(x) (shapiro.test(x)$p.value > 0.05)
sel1 <- genefilter(exprs(ALL[,ALLB$BT=="B1"]), filterfun(f1))
sel2 <- genefilter(exprs(ALL[,ALLB$BT=="B2"]), filterfun(f1))
sel3 <- genefilter(exprs(ALL[,ALLB$BT=="B3"]), filterfun(f1))
sel4 <- genefilter(exprs(ALL[,ALLB$BT=="B4"]), filterfun(f1))
selected <- sel1 & sel2 & sel3 & sel4
library(limma)
x <- matrix(as.integer(c(sel2,sel3,sel4)),ncol = 3,byrow=FALSE)
colnames(x) <- c("sel2","sel3","sel4")
vc <- vennCounts(x, include="both")
vennDiagram(vc)
137 pass filter 2 but not the other
510 pass filter 2 and 3 but not 4
1019 pas filter 2 and 4 but not 3
5598 pass filter 2, 3 and 4. etc.
library("ALL"); library("limma");library("annaffy");library(hgu95av2.db)
data(ALL)
ALLB <- ALL[,ALL$BT %in% c("B1","B2","B3","B4")]
design.ma <- model.matrix(~0 + factor(ALLB$BT))
colnames(design.ma) <- c("B1","B2","B3","B4")
cont.ma <- makeContrasts(B2-B1,B3-B2,B4-B3,levels=factor(ALLB$BT))
fit <- lmFit(ALLB, design.ma)
fit1 <- contrasts.fit(fit, cont.ma)
fit1 <- eBayes(fit1)
topTable(fit1, coef=2,5,adjust.method="fdr")
tab <- topTable(fit1, coef=2, number=20, adjust.method="fdr")
anntable <- aafTableAnn(as.character(tab$ID), "hgu95av2", aaf.handler())
saveHTML(anntable, "ALLB1234.html", title = "B-cell ALL of stage 1,2,3,4"
library(ALL); data(ALL)
table(pData(ALL)$remission)
remis <- which(pData(ALL)$remission %in% c("CR","REF"))
ALLrem <- ALL[,remis]
remfac <-factor(pData(ALLrem)$remission)
pano <- apply(exprs(ALLrem),1,function(x) t.test(x ~ remfac)$p.value)
sum(pano<0.001)
> sum(pano<0.001)
[1] 45
library(hgu95av2.db)
names <- featureNames(ALLrem)[pano<.001]
ALLremsel<- ALLrem[names,]
> grep("p53",unlistednames)
[1] 12 21
> length(unique(unlistednames))
[1] 36
5. Remission achieved.
library(ALL); data(ALL)
ALLCRREF <- ALL[,which(ALL$CR %in% c("CR","REF"))]
pano <- apply(exprs(ALLCRREF),1,function(x) t.test(x ~ ALLCRREF$CR)$p.value)
> sum(pano<0.0001)
[1] 11
> featureNames(ALLCRREF)[pano<.0001]
[1] "1472_g_at" "1473_s_at" "1475_s_at" "1863_s_at" "34098_f_at" "36574_at"
library("hgu95av2.db")
affynames <- featureNames(ALLCRREF)[pano<.0001]
genenames <- mget(affynames, env = hgu95av2GENENAME)
238 APPENDIX A. ANSWERS TO EXERCISES
> grep("oncogene",genenames)
[1] 1 2 3
data: dat
p-value = 0.002047
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.562237 54.915642
sample estimates:
odds ratio
14.39959
library("ALL")
data("ALL")
table(ALL$BT)
ALLT23 <- ALL[,which(ALL$BT %in% c("T2","T3"))]
library(genefilter)
f1 <- function(x) (shapiro.test(x)$p.value > 0.05)
f2 <- function(x) (t.test(x ~ ALLT23$BT)$p.value < 0.05)
sel1 <- genefilter(exprs(ALLT23[,ALLT23$BT=="T2"]), filterfun(f1))
sel2 <- genefilter(exprs(ALLT23[,ALLT23$BT=="T3"]), filterfun(f1))
sel3 <- genefilter(exprs(ALLT23), filterfun(f2))
> sum(sel1 & sel2 & sel3)
239
[1] 905
> sum(sel1 & sel2)
[1] 9388
> sum(sel3)
[1] 1204
library("ALL")
library("limma");
allB <- ALL[,which(ALL$BT %in% c("B1","B2","B3","B4"))]
facB123 <- factor(allB$BT)
cont.ma <- makeContrasts(B2-B1,B3-B2,B4-B3, levels=facB123)
design.ma <- model.matrix(~ 0 + facB123)
colnames(design.ma) <- c("B1","B2","B3","B4")
fit <- lmFit(allB, design.ma)
fit1 <- contrasts.fit(fit, cont.ma)
fit1 <- eBayes(fit1)
> topTable(fit1, coef=2,5,adjust.method="BH")
ID logFC AveExpr t P.Value adj.P.Val B
6048 35991_at 0.5964481 4.144598 6.624128 2.578836e-09 0.0000325578 10.842989
3909 33873_at 0.5707770 7.217570 6.083524 2.891823e-08 0.0001825464 8.625253
5668 35614_at 1.7248509 5.663477 5.961231 4.946078e-08 0.0002081474 8.132884
6776 36711_at -2.3664712 7.576108 -5.759565 1.187487e-07 0.0003054110 7.329631
7978 37902_at 0.8470235 4.258491 5.742783 1.276579e-07 0.0003054110 7.263298
> sum(fit1$p.value<0.05)
[1] 4328
pval486[pval486>1]<-1
> symb
[1] "GADD45A" "DUSP4" "OAS1" "STAT1" "STAT1" "AKR1C3" "PSMB9"
[16] "TKT" "NFKB1" "SLC7A5" "CXCL2" "DLG5"
library("KEGG");library("GO");library("annaffy")
atab <- aafTableAnn(genenames, "hgu95av2", aaf.handler() )
saveHTML(atab, file="ThreeExperiments.html")
# p53 plays a role.
library(ALL)
data(ALL,package="ALL")
241
> sum(panova[probes]<0.05)
[1] 86
> sum(panova[probes]<1)
[1] 320
> sum(panova<0.05)
[1] 2581
> sum(panova<1)
[1] 12625
the odds ratio differs significantly from zero; there are more significan
data(golub, package="multtest")
data <- data.frame(golub[2124,])
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
stripchart(golub[2124,]~gol.fac, pch=as.numeric(gol.fac))
plot(hclust(dist(clusdata,method="euclidian"),method="single"))
table(cl$cluster,gol.fac)
n <- length(data); nboot<-1000
boot.cl <- matrix(0,nrow=nboot,ncol = 2)
for (i in 1:nboot){
dat.star <- data[sample(1:n,replace=TRUE)]
cl <- kmeans(dat.star, initial, nstart = 10)
boot.cl[i,] <- c(cl$centers[1,],cl$centers[2,])
}
> quantile(boot.cl[,1],c(0.025,0.975))
2.5% 97.5%
-1.07569310 -0.03344292
243
> quantile(boot.cl[,2],c(0.025,0.975))
2.5% 97.5%
0.731493 1.784468
3. MCM3.
library(multtest);data(golub);
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
o1 <- grep("oncogene",golub.gnames[,2])
plot(hclust(dist(golub[o1,],method="euclidian"),method="single"))
o2 <- grep("antigene",golub.gnames[,2])
plot(hclust(dist(golub[o2,],method="euclidian"),method="single"))
o3 <- grep("receptor",golub.gnames[,2])
plot(hclust(dist(golub[o3,],method="euclidian"),method="single"))
library(ALL); data(ALL)
ALLB <- ALL[,ALL$BT %in% c("B1","B2","B3")]
panova <- apply(exprs(ALLB), 1, function(x) anova(lm(x ~ ALLB$BT))$Pr[1])
ALLBsp <- ALLB[panova<0.001,]
> dim(exprs(ALLBsp))
[1] 499 78
> min(cor(exprs(ALLBsp)))
[1] 0.5805595
> eigen(cor(exprs(ALLBsp)))$values[1:5]
[1] 65.2016203 2.9652965 2.4781567 0.7556439 0.6040647
biplot(princomp(data,cor=TRUE),pc.biplot=T,cex=0.5,expand=0.8)
eigen(matrix(c(1,-0.8,-0.8,1),nrow=2))
eigen(matrix(c(1,0.8,0.8,0.8,1,0.8,0.8,0.8,1),nrow=3))
eigen(matrix(c(1,-0.5,-0.5,-0.5,1,-0.5,-0.5,-0.5,1),nrow=3))
> 2.6/3 * 100
[1] 86.66667
> eigen(matrix(c(1,0.8,0.8,0.8,1,0.8,0.8,0.8,1),nrow=3))$vectors
[,1] [,2] [,3]
[1,] -0.5773503 0.8164966 0.0000000
[2,] -0.5773503 -0.4082483 -0.7071068
[3,] -0.5773503 -0.4082483 0.7071068
library(multtest);data(golub);
gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
maxgol <- apply(golub[,gol.fac=="ALL"], 1, function(x) max(x))
mingol <- apply(golub[,gol.fac=="AML"], 1, function(x) min(x))
sum(maxgol < mingol)
> which.min(maxgol - mingol)
[1] 2124
> golub.gnames[2124,]
[1] "4847" "Zyxin" "X95735_at"
> boxplot(golub[2124,] ~gol.fac)
grep("Gdf5",golub.gnames[,2])
> grep("Gdf5",golub.gnames[,2])
[1] 2058
(a) library(multtest);library(ROCR);data(golub)
golub.clchanged <- -golub.cl +1
pred <- prediction(golub[1042,], golub.clchanged)
perf <- performance(pred, "sens", "spec")
plot(perf)
(b) The function is essentially the same.
(c) Use auc as before.
library(rpart)
predictors <- matrix(rnorm(100*4,0,1),100,4)
colnames(predictors) <- letters[1:4]
groups <- gl(2,50)
simdata <- data.frame(groups,predictors)
rp<-rpart(groups ~ a + b + c + d,method="class",data=simdata)
predicted <- predict(rp,type="class")
table(predicted,groups)
plot(rp, branch=0,margin=0.1); text(rp, digits=3, use.n=TRUE)
> table(predicted,groups)
groups
predicted 1 2
1 41 12
247
2 9 38
library(e1071)
svmest <- svm(predictors, groups, data=df, type = "C-classification", kernel = "l
svmpred <- predict(svmest, predictors, probability=TRUE)
> table(svmpred, groups)
groups
svmpred 1 2
1 31 25
2 19 25
library(nnet)
nnest <- nnet(groups ~ ., data = simdata, size = 5,maxit = 500, decay = 0.01, Max
pred <- predict(nnest, type = "class")
> table(pred, groups) # prints confusion ma
groups
pred 1 2
1 45 10
2 5 40
The misclassification rate of rpart, svm, and nnet is, respectively, 21/100,
44/100, and 15/100. If we increase the number of predictors, then the
misclassification rate decreases.
rpart.pred CR REF
CR 93 1
REF 6 14
> 7/(93+1+6+14)
[1] 0.06140351
> mget(c("1840_g_at","36769_at","1472_g_at","854_at"), env = hgu95av2GENE
$1840_g_at
[1] NA
$36769_at
[1] "retinoblastoma binding protein 5"
$1472_g_at
[1] "v-myb myeloblastosis viral oncogene homolog (avian)"
$854_at
[1] "B lymphoid tyrosine kinase"
choosebank("genbank"); library(seqinr)
query("ccnd3hs","sp=homo sapiens AND k=ccnd3@")
ccnd3 <- sapply(ccnd3hs$req, getSequence)
x1 <- DNAStringSet(c2s(ccnd3[[1]]))
write.XStringSet(x1, file="ccnd3.fa", format="fasta", width=80)
2. Dotplot of sequences.
par(mfrow=c(1,2))
dotPlot(seq1, seq2, main = "Dot plot of different random sequences\nwsize = 1, ws
dotPlot(seq1, seq1, main = "Dot plot of equal random sequnces\nwsize = 1, wstep =
par(mfrow=c(1,1))
par(mfrow=c(1,2))
dotPlot(seq1, seq2, main = "Dot plot of different random sequences\nwsize = 3, ws
dotPlot(seq1, seq1, main = "Dot plot of equal random sequnces\nwsize = 3, wstep =
par(mfrow=c(1,1))
par(mfrow=c(1,2))
dotPlot(seq1, seq1, main = "Dot plot of different random sequences\nwsize = 3, ws
250 APPENDIX A. ANSWERS TO EXERCISES
x <- c("RPLWVAPDGHIFLEAFSPVYK")
y <- c("RPLWVAPDGHIFLEAFSPVYK")
z <- c("PLWISPSDGRIILESFSPLAE")
choosebank("genbank"); library(seqinr)
query("ccnd3hs","sp=homo sapiens AND k=ccnd3@")
ccnd3 <- sapply(ccnd3hs$req, getSequence)
sapply(ccnd3hs$req, getName)
ccnd3prot <- sapply(ccnd3hs$req, getTrans)
dotPlot(ccnd3prot[[1]], s2c("EEEVFPLAMN"), main = "Dot plot of two protei
dotPlot(ccnd3prot[[7]], ccnd3prot[[8]], main = "Dot plot of two protein\n
dotPlot(s2c(x), s2c(z), main = "Dot plot of two protein\nwsize = 1, wstep
3. Local alignment.
library(seqinr);library(Biostrings);data(BLOSUM50)
x <- s2c("HEAGAWGHEE"); y <- s2c("PAWHEAE")
s <- BLOSUM50[y,x]; d <- 8
F <- matrix(data=NA,nrow=(length(y)+1),ncol=(length(x)+1))
F[1,] <- 0 ; F[,1] <- 0
rownames(F) <- c("",y); colnames(F) <- c("",x)
for (i in 2:(nrow(F)))
for (j in 2:(ncol(F)))
{F[i,j] <- max(c(0,F[i-1,j-1]+s[i-1,j-1],F[i-1,j]-d,F[i,j-1]-d))}
> max(F)
[1] 28
library(seqinr);library(Biostrings);data(BLOSUM50)
randallscore <- c(1,1)
for (i in 1:1000) {
x <- c2s(sample(rownames(BLOSUM50),7, replace=TRUE))
y <- c2s(sample(rownames(BLOSUM50),10, replace=TRUE))
randallscore[i] <- pairwiseAlignment(AAString(x), AAString(y), substitut
251
5. Prochlorococcus marinus.
library(seqinr)
choosebank("genbank")
query("ccmp","AC=AE017126 OR AC=BX548174 OR AC=BX548175")
ccmpseq <- sapply(ccmp$req,getSequence)
gc <- sapply(ccmpseq, GC)
> wilcox.test(gc[1:2],gc[3:9])
> t.test(gc[1:2],gc[3:9])
6. Sequence equality.
\begin{verbatim}
library(seqinr)
choosebank("genbank")
query("ccnd3hs","sp=homo sapiens AND k=ccnd3@")
252 APPENDIX A. ANSWERS TO EXERCISES
sapply(ccnd3hs$req,getLength)
> ccnd3prot <- sapply(ccnd3hs$req, getTrans)
> table(ccnd3prot[[1]])
* A C D E F G H I K L M N P Q R S T V W Y
1 31 12 12 21 6 14 7 10 10 41 9 1 17 16 22 19 18 15 3 8
> table(ccnd3prot[[2]])
* A C D E F G H I K L M N P Q R S T V W Y
1 30 12 12 21 6 14 7 10 10 41 9 1 17 16 22 20 18 15 3 8
# Hence, there is only one difference!
> which(!ccnd3prot[[1]]==ccnd3prot[[2]])
[1] 259
7. Conserved region.
ID XRODRMPGMNTB; BLOCK
AC PR00851A; distance from previous block=(52,131)
DE Xeroderma pigmentosum group B protein signature
BL adapted; width=21; seqs=8; 99.5%=985; strength=1287
XPB_HUMAN|P19447 ( 74) RPLWVAPDGHIFLEAFSPVYK 54
XPB_MOUSE|P49135 ( 74) RPLWVAPDGHIFLEAFSPVYK 54
P91579 ( 80) RPLYLAPDGHIFLESFSPVYK 67
XPB_DROME|Q02870 ( 84) RPLWVAPNGHVFLESFSPVYK 79
RA25_YEAST|Q00578 ( 131) PLWISPSDGRIILESFSPLAE 100
Q38861 ( 52) RPLWACADGRIFLETFSPLYK 71
O13768 ( 90) PLWINPIDGRIILEAFSPLAE 100
O00835 ( 79) RPIWVCPDGHIFLETFSAIYK 86
library(Biostrings);data(BLOSUM50)
x <- c("RPLWVAPDGHIFLEAFSPVYK")
y <- c("RPLWVAPDGHIFLEAFSPVYK")
z <- c("PLWISPSDGRIILESFSPLAE")
x == y
pairwiseAlignment(AAString(x), AAString(z), substitutionMatrix = "BLOSUM5
> pairwiseAlignment(AAString(x), AAString(y), substitutionMatrix = "BLOSU
Global Pairwise Alignment
253
1: RPLWVAPDGHIFLEAFSPVYK
2: RPLWVAPDGHIFLEAFSPVYK
Score: 154
>
> z <- c("PLWISPSDGRIILESFSPLAE")
>
> x == y
[1] TRUE
> pairwiseAlignment(AAString(x), AAString(z), substitutionMatrix = "BLOSUM50",gap
Global Pairwise Alignment
1: RPLWVAP-DGHIFLEAFSPVYK
2: -PLWISPSDGRIILESFSPLAE
Score: 85
data(ec999)
ec999.uco <- lapply(ec999, uco, index="eff")
df <- as.data.frame(lapply(ec999.uco, as.vector))
row.names(df) <- names(ec999.uco[[1]])
global <- rowSums(df)
title <- "Codon usage in 999 E. coli coding sequences"
dotchart.uco(global, main = title)
choosebank("genbank"); library(seqinr)
query("ccndhs","sp=homo sapiens AND k=ccnd@")
ccnd <- sapply(ccndhs$req, getSequence)
ccnd.uco <- lapply(ccnd3, uco, index="eff")
df <- as.data.frame(lapply(ccnd.uco, as.vector))
row.names(df) <- names(ccnd.uco[[1]])
global <- rowSums(df)
title <- "Codon usage in ccnd3 homo sapiens coding sequences"
dotchart.uco(global, main = title)
References
257
258 APPENDIX B. REFERENCES
Golub, G.H. & Van Loan, C.F. (1983). Matrix Computations. Baltimore:
The John Hopkins University Press.
Golub et al. (1999). Molecular classification of cancer: class discovery and
class prediction by gene expression monitoring, Science, Vol. 286:531-
537.
Gouy, M., Milleret, F., Mugnier, C., Jacobzone, M., Gautier,C. (1984). AC-
NUC: a nucleic acid sequence data base and analysis system. Nucl.
Acids Res., 12:121-127.
Grubbs, F.E. (1950). Sample criteria for testing outlying observations. An-
nals of Mathematical Statistics, 21, 1, 27-58.
Hahne, F. Huber, W., Gentleman, R. & Falcon, S. (2008) Bioconductor Case
Studies. New York: Springer.
Hartigan, J.A. & Wong, M.A. (1975). A k-means clustering algorithm. Ap-
plied Statistics, 28, 100-108.
Horn, R.A. & Johnson, C.R. (1985). Matrix Analysis. Cambridge: Cam-
bridge University Press.
Huber, P.J. (1964). Robust estimation of a location parameter. The Annals
of Mathematical Statistics, 35, 73-101.
Huber, P. J. (1981) Robust Statistics. Wiley.
Ihaka,R. and Gentleman,R. (1996) R: a language for data analysis and graph-
ics. J. Comput. Graphic Statist., 5, 299314.
Johnson, N.L. & Kotz, S. & Kemp, A. (1992). Univariate discrete distribu-
tions. New York: John Wiley & Sons.
Jolliffe, I.T. (2002). Principal Components Analysis. New York: Springer.
Jureckova, J. & Picek, J. (2006). Robust Statistical Methods with R. New
York: Chapman & Hall.
Kyte J. & Doolittle R.F. (1982). A simple method for displaying the hydro-
pathic character of a protein. Journal of Molecular Biology, 157:105132.
Laub, M.T., McAdams, H.H., Feldblyum, Fraser, C.M., and Shapiro, L.
(2000). Global analysis of the genetic network controlling a bacterial
cell cycle. Science, 290, 21441248.
Lehmann, E.L. (1999). Elements of large sample theory . New York: Springer.
Little, R. J. A., and Rubin, D. B. (1987) Statistical Analysis with Missing
Data. New York: Wiley.
Luenberger, D.G. (1969). Optimization by vector space methods. New York:
Wiley.
Maindonald J. & Braun, J. (2003). Data analysis and Graphics Using R.
Cambridge: Cambridge University Press.
260 APPENDIX B. REFERENCES
Tessarz, A.S., Weiler, S., Zanzinger, K., Angelisova, P., Horejsi, V., Cer-
wenka, A. (2007). Non-T cell activation linker (NTAL) negatively reg-
ulates TREM-1/DAP12-induced inflammatory cytokine production in
myeloid cells. Journal of Immunolgy. 178(4) 1991-1999.
Therneau, T.M. & Atkinson, E.J. (1997). An introduction to recursive par-
titioning using RPART routines. Technical report, Mayo Foundation.
Smyth, G. K. (2004). Linear models and empirical Bayes methods for as-
sessing differential expression in microarray experiments. Statistical
Applications in Genetics and Molecular Biology, 3, No. 1, Article 3.
Smyth, G. K. (2005). Limma: linear models for microarray data. In: Bioin-
formatics and Computational Biology Solutions using R and Biocon-
ductor. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber
(eds), Springer, New York, pages 397420.
Venables W.N. & Ripley B.D. (2000). S programming. New York: Springer.
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S.
Fourth edition. Springer.
Wang, Y.Y. (1971). Probabilities of the type I errors of the Welch tests
for the Behrens-Fisher problem. Journal of the American Statistical
Association, 66, 605-608.
Wichert, S., Fokianos, K., and Strimmer, K. (2004). Identifying periodically
expressed transcripts in microarray time series data. Bioinformatics,
20:5-20.
Zuker, M. & Stiegler,P. (1981) Optimal computer folding of large RNA
sequences using thermodynamics and auxiliary information. Nucleic
Acids Res., 9, 133148.
Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization
prediction. Nucleic Acids Research, 31, 3406-3415.
Guindon, S. and Gascuel, O. (2003) A simple, fast, and accurate algorithm to
estimate large phylogenies by maximum likelihood. Systematic Biology,
52, 696704.
Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method
for reconstructing phylogenetic trees. Molecular Biology and Evolu-
tion,4, 406425.
262 APPENDIX B. REFERENCES
Index
263
264 INDEX