0% found this document useful (0 votes)
402 views

R For Bioinformatics

R is a programming language and software environment for statistical analysis and graphics. It allows users to perform data analysis, implement statistical techniques, and visualize data through its integrated development environment. This document provides an introduction to using R for bioinformatics and computational biology tasks. It outlines scenarios where R could be useful, how to use the document as a tutorial, and what the reader should expect to learn about R's statistical analysis, modeling and visualization capabilities.

Uploaded by

Charles Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
402 views

R For Bioinformatics

R is a programming language and software environment for statistical analysis and graphics. It allows users to perform data analysis, implement statistical techniques, and visualize data through its integrated development environment. This document provides an introduction to using R for bioinformatics and computational biology tasks. It outlines scenarios where R could be useful, how to use the document as a tutorial, and what the reader should expect to learn about R's statistical analysis, modeling and visualization capabilities.

Uploaded by

Charles Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Class notes for Introduction to R for

Bioinformatics (BM1)
Stephen Ellner

, modied by Ben Bolker

, further modied by Ramon Diaz-Uriarte

October 19, 2009


1 Scenarios
You are designing an experiment: 20 plates are to be assigned (randomly) to 4
conditions. You are too young (or too old) to cut paper into pieces, place it in a
urn, etc. You want a better, faster way. Specially because your next experiment will
involve 300 units, not 20.
The authors of a paper claim there is a weak relationship between levels of protein
A and growth. However, you know that some of the samples are from males and
some are from females, and you suspect the correlation is present only in males.
The authors provide the complete data.
Youve been working on a microarray study. For 100 subjects (50 of them with
leukemia, 50 of them healthy) you have the Cy3/Cy5 intensity ratios for 300,000
spots. You just got the email with the compressed data le. You are leaving for
home. In less than ve minutes youd like to get a quick idea of what the data look
like: maximum and minimum values for all spots, average for 5 specic control
spots (corresponding to probes 10, 23, 56, 10,004, 20,000), and a quick-and-dirty
statistical test of differences for two specic probes (that correspond to two well
know genes, that correspond to probes 7,000 and 99,000).
Tomorrow youll look at the data in more detail. For a set of 20 selected probes you
will want to: a) take a look at the mean of the intensity, variance of intensity, and
the mean of the intensity in each of the two groups; b) plot the intensity vs. the age
of the subject; c) plot the log of the intensity vs. the age of the subject.
A paper describes a specic growth curve model (some non-linear function). You
would like to see what the actual curve looks like, and how much variation you get
if you modify the parameters slightly.
For each of those problems, would you . . .

Ecology and Evolutionary Biology, Cornell University

Department of Zoology, University of Florida

Spanish National Cancer Centre (CNIO), and Universidad Autonoma de Madrid, Spain
1
Know how to do it?
Do it quickly?
Save all the steps of what you did so that 6 months from today you know exactly
what you did, can repeat it, and apply it to new data?
This course is a quick introduction to an environment for statistical computing and
graphics that will allow you to carry out each of the above.
2 This course, this document
2.1 What you should expect from this section of the course. What I
(RDU) expect from you
This is a crash introduction to R, an environment for statistical computing and graphics.
Ris a large system, and thus we will only cover the very basics. That also means that we
will not see any exciting examples where we take a bare genomic data set, analyze it, and
nish with biologically reasonable conclusions. This might make this part of the course
dry. On the other hand, you will be introduced to an extremely powerful system which is
also free (see below) and a standard in bioinformatics. The elegance and sheer power of
the Rsystem could (should) make this part of the course exhilarating.
At the end of this part of the course, you should have an introductory understanding
of how to work with R, and will be ready to continue learning on your own at a fast pace.
These notes provide a fair amount of hand-holding. But not everything is explained
and the amount of unexplained material increases as you go along the notes. I expect you
to make usage of the available help inside R (see section 7). You should do (for your own
good) all the exercises. I expect you to understand what every single line of code is doing.
In the exam I can ask you about any of the commands/expressions/logic we use in these
notes. NO, of course I do not expect you to memorize any example, nor do I expect you to
know all the possible arguments to plot, for example. But you should understand what
plot is for, or how do dene a function that will add two numbers you give as input.
[The exact format of the exam, where you will take it, etc, is still to be dened].
2.1.1 Class mechanics and possible schedule
During the class, I will answer questions, go over whichever parts you nd difcult, etc.
I am not planning any formal lectures as the time available in class would not allow to
cover the material. Since I do want you to master the material in these notes, you should
come to class with all your questions and doubts, and we will discuss those.
Try to come to the rst class after having gone through all the notes (skip the last
section Additional exercises, to begin with). If that is not possible, at least cover 2/3 of
the material before the rst class, and the rest before the last class.
If it turns out all of the material in these notes is really a piece of cake, then I will
lecture/go over more advanced topics (I do have lecture material available for those parts).
2
2.2 How to use this document
These notes contain many sample calculations. It is important to do these yourself
type them in at your keyboard and see what happens on your screento get
the feel of working in R.
Exercises in the middle of a section should be done immediately when you get to
them, and make sure you have them right before moving on.
These notes are based in part on course materials by former TAs Colleen Webb,
Jonathan Rowell and Daniel Fink at Cornell, Professors Lou Gross (University of Ten-
nessee) and Paul Fackler (NC State University), and on the book Getting Started with
Matlab by Rudra Pratap (Oxford University Press). They also draw on the documenta-
tion supplied with R. They parallel, but go into more depth than, the chapter supplement
for the book Ecological Models and Data in R. The original title of these notes was An
introduction to R for ecological modeling (lab 1).
You can nd many other similar introductions scattered around the web, or in the
contributed documentation section on the R web site (http://cran.r-project.
org/other-docs.html). This particular version is limited (it has similar coverage
to Sections 1 and 2 of the Introduction to R that comes with R, with selected material
from other sections) and targets biologists who are neither programmers nor statisticians.
2.3 Other les you need in addition to this one
You should have (or should get) the following les:
AnotherDataSet.txt
ChlorellaGrowth.txt
lastExample.R
regressionExample.R
RCrash.R
You can download those les from the web page of the course. Make sure that they
are downloaded properly as text.
All of the les above (except the last) are mentioned or used in this document. What
about RCrash.R? That is all the Rcode used in this document.
3 What is R?
R is an object-oriented scripting language that combines
a programming language called Slang, developed by John Chambers at Bell Labs,
that can be used for numerical simulation of deterministic and stochastic dynamic
models
an extensive set of functions for classical and modern statistical data analysis and
modeling
3
graphics functions for visualizing data and model output
a user interface with a few basic menus and extensive help facilities
R is an open source project, available for free download via the Web. Originally a
research project in statistical computing (Ihaka and Gentleman, 1996), it is now man-
aged by a development team that includes a number of well-regarded statisticians, and
is widely used by statistical researchers (and a growing number of theoretical ecologists
and ecological modelers, and bioinformaticians and computational biologists) as a plat-
form for making new methods available to users. The commercial implementation of
Slang(called S-PLUS) offers an Ofce-style point and click interface that R lacks.
For our purposes, however, the advantage of this front-end is outweighed by the fact that
R is built on a faster and much less memory-hungry implementation of Slang and is
easier to interface with other languages (and is free!). A standard installation of R also
includes extensive documentation, including an introductory manual ( 100 pages) and
a comprehensive reference manual (over 1000 pages). (There is a graphical front-end for
parts of R, called Rcmdr, available at the R site, but we will not be using it in this class.)
[A proselitizing note] R is free software, meaning free as in free speech (not free
as in free bear; in Spanish, free as in "libre", not free as in "gratis"). The denition of
free software is explained, for instance, in http://www.gnu.org/philosophy/
free-sw.html. Why does it matter that R is free software? For one thing, it makes
your access to it simple and easy. As well, you can play with the system and look at the
inside (you can look at the original code) and do with that code a variety of things, includ-
ing modifying it, learning from it, etc. In addition, that R is free software is, arguably,
one of the reasons of its incredible success (and, for instance, one explanation for why
there are over 1000 contributed, and free software, packages). Moreover, Bioinformatics,
as we know it, would not exist without free software (Dudoit et al., 2003). Newton, and
others before him, used the expression standing on the shoulders of giants when ex-
plaining how the development of science and other intellectual pursuits builds upon past
accomplishments; in Bioinformatics (and many other elds), we are also standing on the
shoulders of millions of lines of free software.
4 Statistics, bioinformatics, computational biology in R
Some of the important functions and packages (collections of functions) for statistical
modeling and data analysis are summarized in Table 1. You can nd a list of avail-
able packages and their contents at CRAN, the main R website (http://www.cran.
r-project.org select a mirror site near you and click on Package sources).
For the most part, we will not be concerned here with this side of R in this rst course on
R.
However, please be aware that virtually all of the statistical analysis done in Bioin-
formatics can be conducted with R. Moreover, data mining (which is, according to
some authors, simply statistics + marketing) is well covered in R: clustering (often
called unsupervised analysis) in many of its variants (hierarchical, k-means and fam-
ily, mixture models, fuzzy, etc), bi-clustering, classication and discrimination (from
discriminant analysis to classication trees, bagging, support vector machines, etc), all
have many packages in R. Thus, tasks such as nding homogeneous subgroups in sets of
4
aov, anova Analysis of variance or deviance
lm Linear models (regression, ANOVA, ANCOVA)
glm Generalized linear models (e.g. logistic, Poisson regression)
gam Generalized additive models (in package mgcv)
nls Fit nonlinear models by least-squares
lme, nlme Linear and nonlinear mixed-effects models (repeated mea-
sures, block effects, spatial models): in package nlme
boot Package: bootstrapping functions
splines Package: nonparametric regression (more in packages
fields, KernSmooth, logspline, sm and others)
princomp, manova, lda,
cancor
Multivariate analysis (some in package MASS; also see pack-
ages vegan, ade4)
survival Package: survival analysis
tree, rpart Packages: tree-based regression
Table 1: A few of the functions and packages in Rfor statistical modeling and data analysis. There
are many more, but you will have to learn about them somewhere else.
genes/subjects, identifying genes that show differential expression (with adjustment for
multiple testing), building class-prediction algorithms to separate good from bad prog-
nosis patients as a function of genetic prole, or identifying regions of the genome with
losses/gains of DNA (copy number alterations) can all be carried out in R out-of-the-box
(see BioConductor and CRAN).
The book Modern Applied Statistics with S by Venables and Ripley (2002) is a clas-
sic, but is not an introductory book. You can take a look at several of the freely available
documents that introduce statistics with R or provide a broad overview of techniques. See
http://cran.r-project.org/other-docs.html; look at Verzanis Simple
R and the course by Kuhnert and Venables. A list of books is available from http:
//www.r-project.org/doc/bib/R-books.html. For intro stats with R I spe-
cially recommend the ones by Dalgaard Introductory Statistics with R and Everitt and
Hothorns Statistical Analysis using R. Doing statistics without an appreciation for lin-
ear models is impossible; the book by John Fox is excellent (though it is really a com-
panion to a course that explains the theory from some other book); Faraway covers the
linear model and several extensions in two self-contained and great books. The book
by Gentleman et al. Bioinformatics and Computational Biology Solutions Using R and
Bioconductor. covers a lot of ground (clustering, classication, mass spec, annotation,
etc), but is not an intro stats book; note also that this book was published in 2005 so a lot
more is available from BioC nowadays.
5 Our rst few minutes with R
5.1 Installing R on your computer: basics
If R is already installed on your computer, you can skip this section.
The main source for Ris the CRANhome page http://cran.r-project.org.
You can get the source code, but most users will prefer a precompiled version. To get one
5
of these from CRAN:
go to http://cran.r-project.org/mirrors.html and nd a mirror
site that is geographically somewhat near you.
Find the appropriate page for your operating system for Windows, go to base
rather than contrib. Download the binary le (e.g. base/R-x.y.z-win32.exe
for Windows, R-x.y.z.dmg for MacOS, where xxx is the version number). The
binary les are large (3060 megabytes) you will need to nd a fast internet
connection.
Read and follow the instructions (which are pretty much click on the icon).
R should work well on any reasonably modern computer. Easy to install binaries are
available for most common hardware and operating systems (including Windows, MacOS
X, and GNU/Linux). Ben Bolker developed and ran all the code in the book with R 2.5.1
on a dual-core PC laptop running at 1.66 GHz under Ubuntu GNU/Linux 7.04. R.D-U.
used R-2.7.1 and 2.7.2 and R-2.9.2, under Debian GNU/Linux, in both a laptop and a
desktop machine.
The standard distributions of R include several packages, user-contributed suites of
add-on functions (unfortunately, the command to load a package into R is library!).
In general, you can install additional packages from within Rusing the Packages menu,
or the install.packages command. (More detail is given below.)
For Windows or MacOS, install R by launching the downloaded le and following
the on-screen instructions. At the end youll have an R icon on your desktop that you can
use to launch the program. Installing versions for Linux can be as straightforward if there
are binaries for your Linux distribution. Otherwise (or even then) you can compile from
sources; we will not cover it in class.
5.2 Starting R
[Windows Users] Just click on the icon on your desktop, or in the Start menu (if you
allowed the Setup program to make either or both of these). If you lose these shortcuts for
some reason, you can search for the executable le Rgui.exe on your hard drive, which
will probably be somewhere like Program Files\R\R-x.y.z\bin\Rgui.exe.
5.3 Stopping R
A Lebanese proverb says when entering, always look for the exit. You can stop R from
the File menu ([Windows Users]), or you can stop it by typing q() at the command
prompt (if you type q by itself, you will get some confusing output which is actually R
trying to tell you the denition of the q function; more on this later).
When you quit, R will ask you if you want to save the workspace (that is, all of the
variables you have dened in this session); for now (and in general), say no in order to
avoid clutter.
Should an R command seem to be stuck or take longer than youre willing to wait,
click on the stop sign on the menu bar or hit the Escape key (in Unix, type Control-C).
6
6 Interactive calculations
[Windows Users] When you start R it opens the console window. The console has a
few basic menus at the top; check them out on your own. The console is also where you
enter commands for Rto execute interactively, meaning that the command is executed and
the result is displayed as soon as you hit the Enter key. For example, at the command
prompt >, type in 2+2 and hit Enter; you will see
> 2 + 2
[1] 4
(When cutting and pasting from this document to R, dont include the text for the com-
mand prompt (>).)
To do anything complicated, you have to store the results from calculations by assign-
ing them to variables, using = or <-. I prefer to use <-.
For example:
> a <- 2 + 2
R automatically creates the variable a and stores the result (4) in it, but it doesnt print
anything. This may seem strange, but youll often be creating and manipulating huge sets
of data that would ll many screens, so the default is to skip printing the results. To ask
R to print the value, just type the variable name by itself at the command prompt:
> a
[1] 4
Alternatively, you could surround the expression in parentheses:
> (a <- 2 + 2)
[1] 4
and that makes the assignment AND shows you the value just assigned to a.
The [1] at the beginning of the line is just R printing an index of element numbers;
if you print a result that displays on multiple lines, R will put an index at the beginning of
each line. print(a) also works to print the value of a variable.
By default, a variable created this way is a vector, and it is numeric because we gave
R a number rather than some other type of data (e.g. a character string like "pxqr"). In
this case a is a numeric vector of length 1, which acts just like a number.
You could also type a <- 2+2; a, using a semicolon to put two or more commands
on a single line. Conversely, you can break lines anywhere that R can tell you havent
nished your command and R will give you a continuation prompt (+) to let you know
that it doesnt think youre nished yet: try typing
a <- 3
*
(4+ [Enter]
5)
7
to see what happens (you will sometimes see the continuation prompt when you dont
expect it, e.g. if you forget to close parentheses). If you get stuck continuing a command
you dont wante.g. you opened the wrong parenthesesjust hit the Escape key or the
stop icon in the menu bar to get out.
Variable names in R must begin with a letter, followed by letters or numbers. You can
break up long names with a period, as in very.long.variable.number.3, or an
underscore (_), but you cant use blank spaces in variable names (or at least its not worth
the trouble). Variable names in R are case sensitive, so Abc and abc are different vari-
ables. Make variable names long enough to remember, short enough to type. N.per.ha
or pop.density are better than x and y (too short) or available.nitrogen.per.hectare
(too long). Avoid c, l, q, t, C, D, F, I, and T, which are either built-in R functions or
hard to tell apart.
R does calculations with variables as if they were numbers. It uses +, -,
*
, /, and
^ for addition, subtraction, multiplication, division and exponentiation, respectively. For
example:
> x <- 5
> y <- 2
> z1 <- x
*
y
> z2 <- x/y
> z3 <- x^y
> z2
[1] 2.5
> z3
[1] 25
Even though R did not display the values of x and y, it remembers that it assigned
values to them. Type x; y to display the values.
You can retrieve and edit previous commands. The up-arrow( ) key (or Control-P)
recalls previous commands to the prompt. For example, you can bring back the second-
to-last command and edit it to
> z3 <- 2
*
x^y
(experiment with the other arrow keys (, , ), Home and End keys too). This will
save you many hours in the long run.
Newer assignments silently overwrite previous assignments:
> z2
[1] 2.5
> z2 <- 999
> z2
[1] 999
> z2 <- "Now z2 is a sentence"
> z2
8
abs absolute value
cos, sin(), tan() cosine, sine, tangent of angle x in radians
exp exponential function, e
x
log natural (base-e) logarithm
log10 common (base-10) logarithm
sqrt square root
Table 2: Some of the built-in mathematical functions in R. You can get a more complete list from
the Help system: ?Arithmetic for simple, ?log for logarithmic, ?sin for trigonometric, and
?Special for special functions.
[1] "Now z2 is a sentence"
You can delete a variable
> rm(z2)
You can combine several operations in one calculation:
> A <- 3
> C <- (A + 2
*
sqrt(A))/(A + 5
*
sqrt(A))
> C
[1] 0.5543706
Parentheses specify the order of operations. The command
> C <- A + 2
*
sqrt(A)/A + 5
*
sqrt(A)
is not the same as the one above; rather, it is equivalent to > C <- A + 2
*
(sqrt(A)/A)
+ 5
*
sqrt(A).
The default order of operations is: (1) parentheses; (2) exponentiation, or powers, (3)
multiplication and division, (4) addition and subtraction (pretty please my dear Aunt
Sally).
> b <- 12-4/2^3 gives 12 - 4/8 = 12 - 0.5 = 11.5
> b <- (12-4)/2^3 gives 8/8 = 1
> b <- -1^2 gives -(1^2) = -1
> b <- (-1)^2 gives 1
In complicated expressions its best to use parentheses to specify explicitly what you
want, such as > b <- 12 - (4/(2^3)) or at least > b <- 12 - 4/(2^3) ;
a few extra sets of parentheses never hurt anything, although when you get confused its
better to think through the order of operations rather than ailing around adding paren-
theses at random.
R also has many built-in mathematical functions that operate on variables (Table 2
shows a few).
Exercise 6.1 : Using editing shortcuts wherever you can, have R compute the values
of
1.
2
7
2
7
1
and compare it with (1
1
2
7
)
1
(If any square brackets [] show up in your web
browsers rendition of these equations, replace them with regular parentheses ().)
9
2. 1 + 0.2
1 + 0.2 + 0.2
2
/2
1 + 0.2 + 0.2
2
/2 + 0.2
3
/6
e
0.2
(remember that R knows exp but not e; how would you get R to tell you
the value of e? )
3. the standard normal probability density,
1

2
e
x
2
/2
, for values of x = 1 and x = 2
(R knows as pi.) (You can check your answers against the built-in function for
the normal distribution; dnorm(1) and dnorm(2) should give you the values for
the standard normal for x = 1 and x = 2.)
7 The help system
R has a help system, although it is generally better for providing detail or reminding you
how to do things than for basic how do I . . . ? questions.
You can get help on any R function by entering
?functionname
in the console window (e.g., try ?sin).
The Help menu on the tool bar provides links to other documentation, including
the manuals and FAQs, and a Search facility (Apropos on the menu) which is
useful if you sort of maybe remember part of the the name of what it is you need
help on.
Typing help.start() opens a web browser with help information.
example(cmd) will run any examples that are included in the help page for com-
mand cmd.
demo(topic) runs demonstration code on topic topic: type demo() by itself
to list all available demos
By default, Rs help system only provides information about functions that are in the
base system and packages that you have loaded with library (see below).
help.search("topic") (with quotes) will list information related to topic
available in the base system or in any extra installed packages: use ?topic to see
the information, perhaps using library(pkg) to load the appropriate package
rst. help.search uses fuzzy matching for example, help.search("log")
nds 528 entries (on my particular system) including lots of functions with plot,
which includes the letters lot, which are almost like log. If you cant stand it,
you can turn this behavior off by specifying the incantation help.search("log",
agrep = FALSE) (81 results which still include matches for logistic, myel-
ogenous, and phylogeny . . . )
10
help(package="pkg") will list all the help pages for a loaded package.
RSiteSearch("topic") does a full-text search of all the R documentation
and the mailing list archives for information on topic.
Try out one or more of these aspects of the help system.
Other (on-line) help resources Paul Johnsons Rtips web page (pj.freefaculty.
org/R/Rtips.html) answers a number of how do I . . . ? questions. The R wiki
(wiki.r-project.org) is a newer venue for help information. The R tips page is
(slowly) being moved over to the wiki, which will eventually (we hope) contain a lot more
information. You can also edit the wiki and add your own pages!
Exercise 7.1 : Do an apropos on sin via the Help menu, to see what it does. Now
enter the commands
> help.search("sin")
> apropos("sin")
and see what that does (answer: help.search pulls up all help pages that include sin
anywhere in their title or text. Apropos just looks at the name of the function).
8 A rst interactive session: linear regression
To get a feel for working in R well t a straight-line model (linear regression) to data.
Below are some data on the maximum growth rate r
max
of laboratory populations of the
green alga Chlorella vulgaris as a function of light intensity (E per m
2
per second).
These experiments were run during the system-design phase of the study reported by
Fussmann et al. (2000).
Light: 20, 20, 20, 20, 21, 24, 44, 60, 90, 94, 101
r
max
: 1.73, 1.65, 2.02, 1.89, 2.61, 1.36, 2.37, 2.08, 2.69, 2.32, 3.67
To analyze these data in R, rst enter them as numerical vectors:
> Light <- c(20, 20, 20, 20, 21, 24, 44, 60, 90, 94,
+ 101)
> rmax <- c(1.73, 1.65, 2.02, 1.89, 2.61, 1.36, 2.37,
+ 2.08, 2.69, 2.32, 3.67)
(dont try to enter the +, which is a continuation character as described above). The
function c combines the individual numbers into a vector. Try recalling (with ) and
modifying the above command to
Light <- 20,20,20,20,21,24,44,60,90,94,101
and see the error message you get: in order to create a vector of specied numbers, you
must use the c function. To repeat, error messages are harmless: the answer to what
would happen if I . . . ? is usually try it and see!
To see a histogram of the growth rates enter > hist(rmax) , which opens a
graphics window and displays the histogram. There are many other built-in statistics
11
G G G
pch: point type
G G G G G G G G G G G G G G G G G G G G col: point color
cex: point size
a b c d e A B C D E 0 1 2 3 4 5 6 7 8 9 text
lty: line type
lwd: line width
Figure 1: Some of Rs graphics parameters. Color specication, col, also applies in many other
contexts: colors are set to a rainbow scale here. See ?par for (many more) details on graphics
parameters, and one or more of ?rgb, ?palette, or apropos("color") for more on colors.
functions: for example mean(rmax) computes you the mean, and sd(rmax) and
var(rmax) compute the standard deviation and variance, respectively. Play around
with these functions, and any others you can think of.
To see how light intensity affects algal rate of increase, type
> plot(Light, rmax)
to plot rmax (y) against Light (x). A linear regression seems reasonable. Dont close
this plot window: well soon be adding to it.
Rs default plotting character is an open circle. Open symbols are generally better than
closed symbols for plotting because it is easier to see where they overlap, but you could
include pch=16 in the plot command if you wanted closed circles instead. Figure 1
shows several more ways to adjust the appearance of lines and points in R.
To perform linear regression we create a linear model using the lm (linear model)
function:
> fit <- lm(rmax ~ Light)
(Note that the variables are in the opposite order from the plot command, which is
plot(x,y), whereas the linear model is read as model r
max
as a function of light.
Alternatively, you could use plot as plot(y ~ x).)
The lm command produces no output at all, but it creates fit as an object, i.e. a data
structure consisting of multiple parts, holding the results of a regression analysis with
rmax being modeled as a function of Light. Unlike most statistics packages, R rarely
produces automatic summary output from an analysis. Statistical analyses in R are done
by creating a model, and then giving additional commands to extract desired information
about the model or display results graphically.
To get a summary of the results, enter the command > summary(fit) . R sets
up model objects (more on this later) so that the function summary knows that fit
was created by lm, and produces an appropriate summary of results for an lm object:
> summary(fit)
Call:
lm(formula = rmax ~ Light)
12
Residuals:
Min 1Q Median 3Q Max
-0.5478 -0.2607 -0.1166 0.1783 0.7431
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.580952 0.244519 6.466 0.000116
***
Light 0.013618 0.004317 3.154 0.011654
*
---
Signif. codes: 0

A
***

Z 0.001

A
**

Z 0.01

A
*

Z 0.05

A.

Z 0.1

Z 1
Residual standard error: 0.4583 on 9 degrees of freedom
Multiple R-squared: 0.5251, Adjusted R-squared: 0.4723
F-statistic: 9.951 on 1 and 9 DF, p-value: 0.01165
[If youve had (and remember) a statistics course the output will make sense to you.
The table of coefcients gives the estimated regression line as r
max
= 1.58 + 0.0136
Light, and associated with each coefcient is the standard error of the estimate, the t-
statistic value for testing whether the coefcient is nonzero, and the p-value corresponding
to the t-statistic. Below the table, the adjusted R-squared gives the estimated fraction of
the variance explained by the regression line, and the p-value in the last line is an overall
test for signicance of the model against the null hypothesis that the response variable is
independent of the predictors.]
You can add the regression line to the plot of the data with a function taking fit as
its input (if you closed the plot of the data, you will need to create it again in order to add
the regression line):
> abline(fit)
(abline, pronounced a b line, is a general-purpose function for adding lines to a plot:
you can specify horizontal or vertical lines, a slope and an intercept, or a regression model:
?abline).
You can get the coefcients by using the coef function:
> coef(fit)
(Intercept) Light
1.58095214 0.01361776
You can also interrogate fit directly. Type > names(fit) to get a list of the
components of fit, and then use the $ symbol to extract components according to their
names.
> names(fit)
[1] "coefficients" "residuals" "effects"
[4] "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels"
[10] "call" "terms" "model"
13
20 40 60 80 100
1
.
5
2
.
0
2
.
5
3
.
0
3
.
5
Light
r
m
a
x
Figure 2: Graphical summary of regression analysis
For more information (perhaps more than you want) about fit, use str(fit) (for
structure). You can get the regression coefcients this way:
> fit$coefficients
(Intercept) Light
1.58095214 0.01361776
Its good to be able to look inside R objects when necessary, but all other things being
equal you should prefer (e.g.) coef(x) to x$coefficients.
9 Editors and GUIs
Before this gets out of hand: If you only plan on using R occassionally because you
have not yet discovered you cant live without it, and you are using Windows or
Mac, just stick to the built-in editor. All the information that follows is extra informa-
tion, targeted mainly to frequent users of Rand/or programmers.
While R works perfectly well out of the box, there are interfaces that can make
your R experience easier. Editors such as Tinn-R (Windows: www.sciviews.org/
Tinn-R/), Kate (Linux: kate-editor.org), R-Kward or Emacs/ESS (cross-platform:
14
ess.r-project.org/) will color R commands and quoted material, allow you to
submit lines or blocks of R code to an R session, and might give hints about function
arguments. The standard MacOS interface has all of these features built in. Graphical
interfaces such as JGR (cross-platform: rosuda.org/JGR/) or SciViews (Windows:
www.sciviews.org/SciViews-R/) include similar editors and have extra func-
tions such as a workspace browser for looking at all the variables you have dened. The
new SciViews-K (http://www.sciviews.org/SciViews-K/index.html) is
still in development, but should feature the best of Tinn-R (plus other things) and should
run in Linux, Macs, and Windows machines. For those used to Eclipse, there is a plug-
in designed to work with R: StatET (http://www.walware.de/goto/statet).
(All of these interfaces, which are designed to facilitate R programming, are in a different
category from Rcmdr, which tries to simplify basic statistics in R.) If you are using Win-
dows or Linux I would strongly recommend that, once you have tried R a little bit, you
download at least an R-aware editor and possibly a GUI to make your life easier. Links
to all of these systems, and more, can be found at http://www.r-project.org/
GUI/.
If you plan to spend a fair amount of time doing Boinformatics, then youll spend a fair
amount of time programming, probably using a variety of languages (R, Python, C, Perl,
Java, PHP, etc). Becoming used to a programmer-friendly editor that understands all
of the languages you use is thus worth it. Choosing an editor is a highly personal issue.
Emacs is an editor and then a lot of other things (that is what I use, for programming,
editing text, etc); if you use Emacs then Emacs + ESS is the perfect combination for you.
Those who come from the Java world might be familiar with Eclipse (and, thus, youll
want to give StatET a try). Komodo is the basis of SciViews-K, and a popular choice
among programmers of languages such as Python and Perl. Tinn-R also has some support
for colouring languages other than R.
For this class: use whatever you want. Just do not use Word (see also below). I will
be using Emacs. I could help you with Emacs + ESS and, for simple issues, with Tinn-
R. The library and lab Windows computers are unlikely to have anything beyond a bare
bones R; thus, under Windows, youll have to use (unless you have your own computer)
this infrastructure. If you have your own machine and you are using Windows, you can
use any of the above editors, or just use the bare bones approach, as explained below.
9.1 Notes for Windows Users
When you start R youll see something like:
15
To get a minimal editor, go to File and click on New script. (For those using R with
the interface in Spanish or any other language: I am using screen captures, so menu
entries should be located in similar places; otherwise, use your common sense). Youll
see a menu like:
That will open up an empty canvas, where you can start typing. Type something in
there. Before, during, after typing you can re-organize how the windows are distributed.
If the active (the window with the blue frame) is the Editor window, and you go to
16
Windows and then Tile as in
you will have your editor on the right and the R Console on the left. Do this to suit
your taste (tiling with editor on left, tiling with editor on right, cascade, whatever).
Now, from the editor you can mark code and submit it to the R Console. You can do
it from the Edit menu entry
or, much better, use a shortcut (Ctrl + R, which means: mark the text, and then
click on the Control key and the R key at the same time; for a single line, there is no
need to mark it).
17
The R Console will receive the code and execute it. And youll see a gure:
As before, you can tile, cascade, re-arrange, or whatever suits your aesthetic prefer-
ences.
There is, with R, another interface, called the SDI (single document interface). The
advantage of the SDI is that you can move between windows (editor, console, graphics,
etc) using the usual Windows mechanism (Alt + Tab). To change to the SDI, open Edit,
GUI preferences (to get here, if you have multiple things under R, you need the Console
18
to be the active selection; it should have a blue upper border).
Then, click on SDI (instead of MDI)
and then you MUST save the changes. (Note also that, if you want, you can change
the language in which Rprovides messages and (some) output. For instance, if you want
it in Spanish, under Language for menus write es.)
Beware that, depending on the settings of your machine, you might not be allowed to
save this le, unless you write to your own directory. If you can save, you then have to
19
close R and restart it. You will see something like
when you start submitting code from the Editor (YES, the editor is opened the same
way: File, New Script, etc).
10 Script les and data les
10.1 Why scripts
Modeling and complicated data analysis are often accomplished more efciently using
scripts, which are a series of commands stored in a text le. The Windows and MacOS
versions of R both have basic script editors: you can also use Windows Notepad or Word-
pad, or a more featureful editor like PFE, Xemacs, or Tinn-R: you should not use MS
Word.
Most programs for working with models or analyzing data follow a simple pattern of
program parts:
1. Setup statements.
2. Input some data from a le or the keyboard.
3. Carry out the calculations that you want.
4. Print the results, graph them, or save them to a le.
For example, a script le might
1. Load some packages, or run another script le that creates some functions (more on
functions later).
20
2. Read in data from a text le.
3. Fit several statistical models to the data and compare them.
4. Graph the results, and save the graph to disk for including in your term project.
Even for relatively simple tasks, script les are useful for building up a calculation step-
by-step, making sure that each part works before adding on to it.
10.2 Paths: where are scripts and data sets located
Before you can tell R to use your script or read some data, you need to tell R where,
exactly, to nd the scripts/data. Here are a couple of ways to do it:
Spell out the path, or le location, explicitly, using a single forward slash to separate
folders. For instance, to use the script regressionExample.R, and assuming
we have placed the script in the directory (folder) R that lives inside the folder
MyDocuments (i.e., C:\My Documents\R) we would do:
> source("c:/My Documents/R/regressionExample.R")
Change your working directory to wherever the le(s) are located. You can either
use the menus (Windows: Change dir in the File menu; Mac: Misc/Change
Working Directory) or use setwd (from set working directory). As I have
no menus, I will use setwd.
Changing your working directory is more efcient in the long run, if you save all
the script and data les for a particular project in the same directory and switch to that
directory when you start work.
If you have a shortcut dened for R on your desktop (or in the Start menu) you can
permanently change your default working directory by right-clicking on the shortcut icon,
selecting Properties, and changing the starting directory to somewhere like (for ex-
ample) My Documents/R work.
Now, for the rest of the course, I will assume that you know where your data les are,
the scripts are, etc. Where you place them depends on what you want (and the permis-
sions you have in the computer you are using). You will be using either one of the two
approaches above. It is up to you. I will be using the second: I will be running R in the
very same directory where the data les and scripts are located. (You can assume that I
have, sometime in the past, issued a setwd command.)
10.3 Using a script
Open the le regressionExample.R with the R editor (or with any of the advanced
editors mentioned above). It contains some of the commands from the example above.
We could just run that le.
Quit R , start a new session, and with everything set up correctly as explained above
you should be able to do:
> source("regressionExample.R")
21
(again, maybe you need to specify the complete path if you are not in the same direc-
tory where regressionExample.R is.)
Quit Ragain. Open regressionExample.R with the Reditor again. Now, submit
the rst line to the R Console. Now the second line. Now the rest of the lines.
10.4 Entering data into R
In the example above, we entered the data one by one. This is not practical, and there are
many ways to load data into R(for example, see the book by P. Spector, or the RData Im-
port/Export manual http://cran.r-project.org/doc/manuals/R-data.
html). Here we will only use read.table.
Instead of entering data, as we did in the regression example, we could have a data set
with the data. Open, in any editor, the le ChlorellaGrowth.txt. See how it looks.
Now do:
> X <- read.table("ChlorellaGrowth.txt")
> Light <- X[, 1]
> rmax <- X[, 2]
In ChlorellaGrowth.txt the two variables are entered as columns of a data
matrix. Think of these as shorthand for Light = everything in column 1 of X, and
rmax = everything in column 2 of X (well learn about working with data matrices
later).
To see a slightly different example, open AnotherDataSet.txt. Now do:
> another.data.set <- read.table("AnotherDataSet.txt",
+ header = TRUE)
> summary(another.data.set)
ID Age Sex Y
a1 :1 Min. :12.0 F:3 Min. :22.00
a2 :1 1st Qu.:13.0 M:2 1st Qu.:23.40
a3 :1 Median :14.0 Median :24.30
a7 :1 Mean :14.8 Mean :24.14
b15:1 3rd Qu.:16.0 3rd Qu.:25.00
Max. :19.0 Max. :26.00
Notice that we used the variable (column) names, and the object is not a matrix, but a
data frame (we will see this later).
10.4.1 Very large data sets
Yes, Rcan deal with huge data sets. You just dont want to read them with read.table.
Look at the help for scan, try data base solutions, etc. (See the book by P. Spec-
tor, or the R Data Import/Export manual http://cran.r-project.org/doc/
manuals/R-data.html)
22
10.5 A few things that can go wrong with scripts and data sets
Its vital that you save your data and script les as plain text (or sometimes comma-
separated) les. There are three things that can go wrong here: (1) if you use a web
browser to download les, be careful that it doesnt automatically append some weird
sufx to the les; (2) if your web browser has a le association (e.g. it thinks that all
les ending in .dat are Excel les), make sure to save the le as plain text, and without
any extra extensions; (3) never, (never, never) use Microsoft Word to edit your data
and script les; MS Word will try very hard to get you to save them as Word (rather than
text) les, which will screw them up!. If you send script les by e-mail, even if you paste
them into the message as plain text, lines will occasionally get broken in different places
leading to confusion. Beware.
10.6 Saving tables, data, and results
How can you save data, results, etc? Saving data in matrix or tabular form is easily done
with write.table.
> write.table(another.data.set, file = "the.table.I.just.saved.txt")
Open that le in an editor of your choice.
You can also save part or all the output from a session. You can copy and paste, or
you can use commands such as sink.
10.7 Saving an Rsession: .RData
And howcan you save all you have been doing? The simplest way is to use save.image.
Please, look at the help for that command. We will use a simple example:
> save.image(file = "this.RData")
> getwd()
[1] "/home/ramon/Proyectos/UAM/BM-1/2009/Class.Notes"
Note where that le is saved (in the current working directory, which is what getwd()
tells you).
Now open another R. Go to the directory where this.RData is. And do:
> ls()
[1] "a" "A" "another.data.set"
[4] "C" "fit" "fr"
[7] "h" "lend" "Light"
[10] "nl" "nlsp" "nv"
[13] "op" "rmax" "seg1"
[16] "seg2" "x" "X"
[19] "xpos" "xr" "y"
[22] "z1" "z3"
The above tells you what you have in your working environment. There is nothing
in there, since we just started. Now, do:
23
> load("this.RData")
> ls()
[1] "a" "A" "another.data.set"
[4] "C" "fit" "fr"
[7] "h" "lend" "Light"
[10] "nl" "nlsp" "nv"
[13] "op" "rmax" "seg1"
[16] "seg2" "x" "X"
[19] "xpos" "xr" "y"
[22] "z1" "z3"
> Light
[1] 20 20 20 20 21 24 44 60 90 94 101
> rmax
[1] 1.73 1.65 2.02 1.89 2.61 1.36 2.37 2.08 2.69 2.32 3.67
> summary(another.data.set)
ID Age Sex Y
a1 :1 Min. :12.0 F:3 Min. :22.00
a2 :1 1st Qu.:13.0 M:2 1st Qu.:23.40
a3 :1 Median :14.0 Median :24.30
a7 :1 Mean :14.8 Mean :24.14
b15:1 3rd Qu.:16.0 3rd Qu.:25.00
Max. :19.0 Max. :26.00
So all the stuff we had before is available in the new Rsession. (Now that we are done
with this example, close the Rsession you just opened).
Now, lets try a different example. Do:
> save.image()
And now open a new Rin that directory. What happens? (Try doing ls() or
summary(another.data.set)).
So be careful with this: you can end up using stuff you didnt know was there!!!!
(The truth is that we were told what happened: did you notice the [Previously saved
workspace restored]?).
And, by the way, do you understand what Rtries to do when it asks Save workspace
image?
11 The R package system
R has many extra packages that provide extra functions. You may be able to install new
packages from a menu within R. You can always type
> install.packages("RJaCGH")
24
(for example this installs the RJaCGH package). You can install more than one package
at a time:
> install.packages(c("ellipse", "plotrix"))
(c stands for concatenate, and is the command for combining multiple things into a
single object.) If the machine on which you use R is not connected to the Internet, you
can download the packages to some other medium (such as a ash drive or CD) and install
them later, using the menu or
> install.packages("plotrix", repos = NULL)
(under GNU/Linux, I had to provide the full name of the package, for instance:
> install.packages("plotrix_2.4-7.tar.gz", repos = NULL)
)
If you do not have permission to install packages in Rs central directory, R will (as
of recent versions) warn you that it is installing the packages in a user-specic directory.
With some GUIs (under some of the operating systems) you can also install packages
from a menu entry. For instance, under Windows, there is an entry in the menu bar called
Packages, which allows you to install from the Internet, change the repositories, install
from local zip les, etc.
In this course we will be using no specic packages. But most for real work
with R you do will require installation of packages. In Bioinformatics, BioConductor
(www.bioconductor.org) is a well known source of many different packages. Bio-
Conductor packages can be installed in several ways, and there is a semi-automated tool
that allows you to install suites of BioC packages.
12 Vectors
Vectors and matrices (1- and 2-dimensional rectangular arrays of numbers) are pre-dened
data types in R. Operations with vectors and matrices may seem a bit abstract now, but
we need them to do useful things later. Vectors only properties are their type (or class)
and length, although they can also have an associated list of names.
Weve already seen two ways to create vectors in R:
1. A command in the console window or a script le listing the values, such as
> initialsize <- c(1, 3, 5, 7, 9, 11)
2. Using read.table. (For instance,
initialsize <- read.table("c:/temp/initialdata.txt"))
A vector can then be used in calculations as if it were a number (more or less)
> finalsize <- initialsize + 1
> newsize <- sqrt(initialsize)
> finalsize
[1] 2 4 6 8 10 12
25
> newsize
[1] 1.000000 1.732051 2.236068 2.645751 3.000000 3.316625
Notice that Rapplied each operation to every entry in the vector. Similarly, commands
like initialsize-5, 2
*
initialsize, initialsize/10 apply subtraction,
multiplication, and division to each element of the vector. The same is true for
> initialsize^2
[1] 1 9 25 49 81 121
In R the default is to apply functions and operations to vectors in an element by ele-
ment (or vectorized) manner.
12.1 Functions for creating vectors
You can use the seq function to create a set of regularly spaced values. seqs syntax is:
x <- seq(from,to,by = )
x <- seq(from,to)
x <- seq(from,to,length.out = )
The rst form generates a vector (from,from+by,from+2
*
by,...) with the
last entry not extending further than than to. In the second form the value of by is
assumed to be 1 or -1, depending on whether from or to is larger. The third form
creates a vector with the desired endpoints and length.
Do you nd the above confusing? Except for from and to, it is recommended to
always explicitly name the arguments:
> seq(2, 10, by = 2)
[1] 2 4 6 8 10
> seq(2, 10, by = 30)
[1] 2
> seq(2, 10, length.out = 30)
[1] 2.000000 2.275862 2.551724 2.827586 3.103448 3.379310
[7] 3.655172 3.931034 4.206897 4.482759 4.758621 5.034483
[13] 5.310345 5.586207 5.862069 6.137931 6.413793 6.689655
[19] 6.965517 7.241379 7.517241 7.793103 8.068966 8.344828
[25] 8.620690 8.896552 9.172414 9.448276 9.724138 10.000000
The syntax from:to is a shortcut for seq(from,to):
> 1:8
26
[1] 1 2 3 4 5 6 7 8
Exercise 12.1 Use seq to create the vector v <- (1 5 9 13), and to create a
vector going from 1 to 5 in increments of 0.2.
You can use rep to create a constant vector such as (1 1 1 1); the basic syntax is
rep(values,lengths). For example,
> rep(3, 5)
[1] 3 3 3 3 3
creates a vector in which the value 3 is repeated 5 times. rep will repeat a whole vector
multiple times
> rep(1:3, 3)
[1] 1 2 3 1 2 3 1 2 3
or will repeat each of the elements in a vector a given number of times:
> rep(1:3, each = 3)
[1] 1 1 1 2 2 2 3 3 3
Even more exibly, you can repeat each element in the vector a different number of
times:
> rep(c(3, 4), c(2, 5))
[1] 3 3 4 4 4 4 4
The value 3 was repeated 2 times, followed by the value 4 repeated 5 times. rep can be
a little bit mind-blowing as you get started, but it will turn out to be useful.
Table 3 lists some of the main functions for creating and working with vectors.
12.2 Vector indexing
You will often want to extract a specic entry or other part of a vector. This procedure is
called vector indexing, and uses square brackets ([]):
> z <- seq(20, 70, by = 10)
> z
[1] 20 30 40 50 60 70
> z[3]
[1] 40
(how would you use seq to construct z?) z[3] extracts the third item, or element,
in the vector z. You can also access a block of elements using the functions for vector
construction, e.g.
27
seq(from,to,by=1) Vector of evenly spaced values, default increment = 1)
seq(from, to,
length.out)
Vector of evenly spaced values, specied length)
c(u,v,...) Combine a set of numbers and/or vectors into a single vector
rep(a,b) Create vector by repeating elements of a by amounts in b
as.vector(x) Convert an object of some other type to a vector
hist(v) Histogram plot of value in v
mean(v),var(v),sd(v) Estimate of population mean, variance, standard deviation
based on data values in v
cor(v,w) Correlation between two vectors
Table 3: Some important R functions for creating and working with vectors. Many of these have
other optional arguments; use the help system (e.g. ?cor) for more information. The statistical
functions such as var regard the values as samples from a population and compute an estimate of
the population statistic; for example sd(1:3)=1.
> z[2:5]
[1] 30 40 50 60
extracts the second through fth elements.
What happens if you enter v <- z[seq(1,5, by = 2)] ? Try it and see, and
make sure you understand what happened.
You can extract irregularly spaced elements of a vector. For example
> z[c(1, 2, 5)]
[1] 20 30 60
You can drop (or exclude) a particular set of elements, rather than taking a particular
set: you can do this with negative indices. For example, z[-1] extracts all but the rst
element of a vector.
> z[-1]
[1] 30 40 50 60 70
> z[-c(2, 6)]
[1] 20 40 50 60
You can also use indexing to set specic values within a vector. For example,
> z[1] <- 12
changes the value of the rst entry in z while leaving all the rest alone, and
> z[c(1, 3, 5)] <- c(22, 33, 196)
changes the rst, third, and fth values.
Exercise 12.2 Write a one-line command to extract a vector consisting of the second,
rst, and third elements of z in that order.
Exercise 12.3 Write a script le that computes values of y =
(x1)
(x+1)
for x = 1, 2, , 10,
and plots y versus x with the points plotted and connected by a line (hint: in ?plot,
search for type).
28
x < y less than
x > y greater than
x <= y less than or equal to
x >= y greater than or equal to
x == y equal to
x ! = y not equal to
Table 4: Some comparison operators in R. Use ?Comparison to learn more.
12.3 Logical operators
Logical operators return a value of TRUE or FALSE. For example, try:
> a <- 1
> b <- 3
> c <- a < b
> d <- (a > b)
> c
[1] TRUE
> d
[1] FALSE
The parentheses around (a>b) are optional but make the code easier to read. One special
case where you do need parentheses (or spaces) is when you make comparisons with
negative values. If you write a<-1, you will NOT be comparing a with -1, but
assigning 1 to a. Remember that <- (representing a left-pointing arrow) is equivalent
to = in R. To compare a with -1, you can do a< -1, or more safely a<(-1).
When we compare two vectors or matrices of the same size, or compare a number
with a vector or matrix, comparisons are done element-by-element. For example,
> x <- 1:5
> b <- (x <= 3)
> b
[1] TRUE TRUE TRUE FALSE FALSE
So if x and y are vectors, then (x==y) will return a vector of values giving the element-
by-element comparisons. If you want to know whether x and y are identical vectors, use
identical(x,y) which returns a single value: TRUE if each entry in x equals the
corresponding entry in y, otherwise FALSE. You can use ?Logical to read more about
logical operators. Note the difference between = and ==: can you gure out what
happened in the following cautionary tale?
> a <- 1:3
> b <- 2:4
> a == b
[1] FALSE FALSE FALSE
29
> a = b
> a == b
[1] TRUE TRUE TRUE
> identical(a, b)
[1] TRUE
> d <- 3
> identical(a, d)
[1] FALSE
> a == d
[1] FALSE TRUE FALSE
Exclamation points ! are used in R to mean not; != (not !==) means not equal
to.
R also does arithmetic on logical values, treating TRUE as 1 and FALSE as 0. So
sum(b) returns the value 3, telling us that three entries of x satised the condition
(x<=3). This is useful for (e.g.) seeing how many of the elements of a vector are larger
than a cutoff value. We get back to this later (section 12.4.1).
Build more complicated conditions by using logical operators to combine compar-
isons:
! Negation
&, && AND
|, || OR
OR is non-exclusive, meaning that x|y is true if either x or y or both are true. For
example, try
> a <- c(1, 2, 3, 4)
> b <- c(1, 1, 5, 5)
> (a < b) & (a > 3)
[1] FALSE FALSE FALSE TRUE
> (a < b) | (a > 3)
[1] FALSE FALSE TRUE TRUE
and make sure you understand what happened (if its confusing, try breaking up the ex-
pression and looking at the results of a<b and a>3 separately rst). The two forms of
AND and OR differ in how they handle vectors. The shorter one does element-by-element
comparisons; the longer one only looks at the rst element in each vector.
If you need exclusive OR, there is xor, which operates element-wise.
30
12.4 Vector indexing II
We can also use logical vectors (lists of TRUE and FALSE values) to pick elements out
of vectors. This is important, e.g., for subsetting data.
As a simple example, we might want to focus on just the low-light values of r
max
in
the Chlorella example:
> X <- read.table("ChlorellaGrowth.txt")
> Light <- X[, 1]
> rmax <- X[, 2]
> lowLight <- Light[Light < 50]
> lowLightrmax <- rmax[Light < 50]
> lowLight
[1] 20 20 20 20 21 24 44
> lowLightrmax
[1] 1.73 1.65 2.02 1.89 2.61 1.36 2.37
What is really happening here (think about it for a minute) is that Light<50 gen-
erates a logical vector the same length as Light (TRUE TRUE TRUE ...) which is
then used to select the appropriate values.
If you want the positions at which Light is lower than 50, you could say
(1:length(Light))[Light<50], but you can also use a built-in function:
which(Light<50). If you wanted the position at which the maximumvalue of Light
occurs, you could say which(Light==max(Light)). (This normally results in a
vector of length 1; when could it give a longer vector?) There is even a built-in command
for this specic function, which.max (although which.max always returns just the
rst position at which the maximum occurs).
12.4.1 Logical values as 0, 1
In R (as in many other languages) we can use logical values as if they were numeric: we
can treat TRUE as 1 and FALSE as 0. (Note that we can also treat anything larger than 0
as TRUE).
This can be very handy to nd out how many elements fulll a condition.
> vv <- c(1, 3, 10, 2, 9, 5, 4, 6:8)
How many elements are smaller than 5 in vv?
> length(which(vv < 5))
[1] 4
which is operating on a logical vector, not on vv directly, and length is counting
the length of the output from which.
> vv < 5
31
[1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
> vv.2 <- (vv < 5)
> vv.2
[1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
> which(vv.2)
[1] 1 2 4 7
> vv.3 <- which(vv.2)
> vv.3
[1] 1 2 4 7
> length(vv.3)
[1] 4
Do you know what the output from which is?
Alternatively, you can do:
> length(vv[vv < 5])
[1] 4
Instead of going through which, we just directly extract the relevant elements of vv,
and count how many there are. Implicitly, we are creating a new (temporary) vector, that
holds only the elements in vv that are smaller than 5, and we are counting the length of
that temporary vector.
> vv[vv < 5]
[1] 1 3 2 4
But the following can be at times easier to understand (or to use)
> sum(vv < 5)
[1] 4
Why did that work? What is vv < 5 returning?
32
12.5 Vectors: names of elements
As I mentioned in passing above, vectors can have names associated with their elements:
if they do, you can also extract elements by name (use names to nd out the names).
> x <- c(first = 7, second = 5, third = 2)
> names(x)
[1] "first" "second" "third"
> x["first"]
first
7
> x[c("third", "first")]
third first
2 7
12.6 A few exercises with vectors make sure you do these!!!
Exercise 12.4 : What would happen if instead of setting lowLight you replaced Light
by saying Light <- Light[Light<50], and then
rmax <- rmax[Light<50]? Why would that be wrong? Try it with some tempo-
rary variables. for example, set Light2 <- Light and rmax2 <- rmax and then
play with Light2 and rmax2 so you dont mess up your working variables. Now, work
out what happened . . .
(If you make a mistake and overwrite Light or rmax, use source again on the le
regressionExample.R, or load the data, etc)
Just in case, lets make sure things are OK again:
> Light <- c(20, 20, 20, 20, 21, 24, 44, 60, 90, 94,
+ 101)
> rmax <- c(1.73, 1.65, 2.02, 1.89, 2.61, 1.36, 2.37,
+ 2.08, 2.69, 2.32, 3.67)
Exercise 12.5 :
We can also combine logical operators (making sure to use the element-by-element &
and | versions of AND and OR):
> Light[(Light < 50) & (rmax <= 2)]
[1] 20 20 20 24
> rmax[(Light < 50) & (rmax <= 2)]
[1] 1.73 1.65 1.89 1.36
33
How many elements in Light (or rmax) fulll the conditions
Light<50 & rmax <= 2.0 ? Do it in at least two different ways: one using
length and another using sum over a logical vector(s).
Exercise 12.6 runif(n) is a function (more on it soon) that generates a vector of n
random, uniformly distributed numbers between 0 and 1. Create a vector of 20 numbers,
then select the subset of those numbers that is less than the mean.
Exercise 12.7 Find the positions of the elements that are less than the mean of the
vector you just created (e.g. if your vector were (0.1 0.9. 0.7 0.3) the answer
would be (1 4)).
Exercise 12.8 : Using the vv vector above, return a new vector with elements larger
than 8 excluded. Do it at least once using negative indices. Please note that I do not want
a reordering of the new vector (i.e., do not return the trivial solution obtained by 1:8).
13 Interlude: comparing oats
Comparing very similar numeric values can be tricky: rounding can happen, and some
numbers cannot be represented exactly in binary (computer) notation. By default R dis-
plays 7 signicant digits (options("digits")). For example:
> x <- 1.999999
> x
[1] 1.999999
> x - 2
[1] -1e-06
> x <- 1.9999999999999
> x
[1] 2
> x - 2
[1] -9.992007e-14
All the digits are still there, in the second case, but they are not shown. Also note that
x-2 is not exactly 1 10
13
; this is unavoidable.
Why is the above unavoidable? Because of the way computers represent numbers. We
cannot get into details, but see the following example, from the FAQ (question 7.31):
7.31 Why doesnt R think these numbers are equal?
The only numbers that can be represented exactly in Rs numeric type are in-
tegers and fractions whose denominator is a power of 2. Other numbers have
to be rounded to (typically) 53 binary digits accuracy. As a result, two oat-
ing point numbers will not reliably be equal unless they have been computed
by the same algorithm, and not always even then. For example
34
R> a <- sqrt(2)
R> a
*
a == 2
[1] FALSE
R> a
*
a - 2
[1] 4.440892e-16
The take home message: be extremely suspicious whenever you see an equality com-
parison of two oating-point numbers; that is unlikely to do what you want.
14 Factors
Factors are a special type of vectors. We need them to differentiate between a vector
of characters and a vector that represents categorical variables. The vector char.vec
<- c(abc, de, fghi) contains several character strings. Now, suppose
we have a study where we record the sex of participants. When we analyze the data we
want Rto know that this is a categorical variable, where each label represents a possible
value of the category:
> Sex.version1 <- factor(c("Female", "Female", "Female",
+ "Male", "Male"))
> Sex.version2 <- factor(c("XX", "XX", "XX", "XY",
+ "XY"))
> Sex.version3 <- factor(c("Feminine", "Feminine",
+ "Feminine", "Masculine", "Masculine"))
> Sex.version4 <- factor(c("fe", "fe", "fe", "ma",
+ "ma"))
We want all those codications of the sex of ve subjects to yield the same results
of analysis, regardless of what, exactly, the labels say. Each set of labels might have
its pros/cons (e.g., the third is probably coding gender, not sex; the last is too cryptic;
the second works only for some species; etc). Regardless of the labels, the key thing to
notice is that the rst three subjects are of the same type, and the last two subjects are of
a different type.
Recognizing factors is essential when dealing with variables that look like legitimate
numbers:
> postal.code <- c(28001, 28001, 28016, 28430, 28460)
> somey <- c(10, 20, 30, 40, 50)
> summary(aov(somey ~ postal.code))
Df Sum Sq Mean Sq F value Pr(>F)
postal.code 1 782.53 782.53 10.795 0.04623
*
Residuals 3 217.47 72.49
---
Signif. codes: 0

A
***

Z 0.001

A
**

Z 0.01

A
*

Z 0.05

A.

Z 0.1

Z 1
The above is doing something silly: it is tting a linear regression, because it is taking
postal.code as a legitimate numeric value. But we know that there is no sense in which
28009 and 28016 (two districts in Madrid) are 7 units apart whereas 28430 and 28410
35
are 20 units apart (two nearby villages north of Madrid), nor do we expect to nd linear
relationships with (the number of the) postal code itself.
Sometimes, when reading data, a variable will be converted to a factor, but it is really
a numeric variable. How to turn it into the original set of numbers?
This does not work:
> x <- c(34, 89, 1000)
> y <- factor(x)
> y
[1] 34 89 1000
Levels: 34 89 1000
> as.numeric(y)
[1] 1 2 3
> y
[1] 34 89 1000
Levels: 34 89 1000
Note that values have been re-codied. An easy way to do this is (you should under-
stand what is happening here):
> as.numeric(as.character(y))
[1] 34 89 1000
> as.character(y)
[1] "34" "89" "1000"
15 A simple hypothesis test: the t-test
Suppose we have 50 patients, 30 of which have colon cancer and 20 of which have lung
cancer. And we have expression data for one gen (say GenA). We would like to know
whether the (mean) expression of GenA is the same or not between the two groups. A
t-test is well suited for this case. Here, we will use simulated data.
Simulated data? Generating simulated data is extremely useful for testing many pro-
cedures, emulating specic processes, etc.
15.1 Generating random numbers
R offers (pseudo)random number generators from which we can obtain random variates
for many different distributions. For instance, do
> help(rnorm)
> help(runif)
> help(rpois)
36
(i.e., look at the help for those functions).
In fact, those numbers are generated using an algorithm. This comes fromthe Wikipedia
(http://en.wikipedia.org/wiki/Pseudorandom_number_generator):
A pseudorandom number generator (PRNG) is an algorithm for generating
a sequence of numbers that approximates the properties of random numbers.
The sequence is not truly random in that it is completely determined by a
relatively small set of initial values, called the PRNGs state.
And
A PRNG can be started from an arbitrary starting state, using a seed state. It
will always produce the same sequence thereafter when initialized with that
state.
Now, since we all have different machines, the actual outcome of doing, say
> rnorm(5)
[1] -1.71405268 0.06374096 0.66582525 -2.22407809 -1.45897032
is likely to differ.
What can we do to get the exact same randomnumbers? The quote fromthe Wikipedia
just told us: we will set the seed of the random number generator to force the generator
to produce the same sequence of random numbers on all computers:
> set.seed(2)
(seting the seed to 2 has no particular meaning; we could have used another integer; what
matters is that we all use the same seed). So lets obtain 50 independent random samples
from a normal distribution and create a vector for the identiers (the labels) of the
subjects.
> GenA <- rnorm(50)
> round(GenA, 3)
[1] -0.897 0.185 1.588 -1.130 -0.080 0.132 0.708 -0.240
[9] 1.984 -0.139 0.418 0.982 -0.393 -1.040 1.782 -2.311
[17] 0.879 0.036 1.013 0.432 2.091 -1.200 1.590 1.955
[25] 0.005 -2.452 0.477 -0.597 0.792 0.290 0.739 0.319
[33] 1.076 -0.284 -0.777 -0.596 -1.726 -0.903 -0.559 -0.247
[41] -0.384 -1.959 -0.842 1.904 0.622 1.991 -0.305 -0.091
[49] -0.184 -1.199
> (Type <- factor(c(rep("Colon", 30), rep("Lung", 20))))
[1] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
[11] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
[21] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
[31] Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung
[41] Lung Lung Lung Lung Lung Lung Lung Lung Lung Lung
Levels: Colon Lung
37
Lets do a t-test:
> t.test(GenA ~ Type)
Welch Two Sample t-test
data: GenA by Type
t = 1.2592, df = 44.05, p-value = 0.2146
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2394925 1.0371620
sample estimates:
mean in group Colon mean in group Lung
0.2286718 -0.1701629
As should be the case (we know it, because we generated the data) the means of the two
groups are very similar, and there are no signicant differences between the groups.
Let us now generate data where there really are differences between the groups, and
repeat the test.
> (GenB <- c(rep(-1, 30), rep(2, 20)) + rnorm(50))
[1] -1.83828715 1.06630136 -1.56224705 0.27571551 -2.04757263
[6] -2.96587824 -1.32297109 -0.06413747 0.13922980 0.67161877
[11] -2.78824221 1.03124252 -1.70314433 -0.84183524 -0.49376520
[16] -1.81999511 -2.99884699 -1.47929259 -0.91582010 -1.89548661
[21] -1.92127567 -0.66955050 -1.14166081 -0.56515224 -1.05372263
[26] -1.90711038 0.30351223 -0.22821022 0.05252560 -2.41003834
[31] 2.99598459 0.30423510 1.46662786 0.62773055 -0.20791978
[36] 3.82212252 1.34660659 1.71531878 1.61305040 2.38669497
[41] 3.60039085 3.68115496 0.81639361 0.64154275 0.48732921
[46] 0.74689510 3.95935708 2.00764587 1.15738480 1.39883989
> t.test(GenB ~ Type)
Welch Two Sample t-test
data: GenB by Type
t = -7.8114, df = 37.764, p-value = 2.106e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.481516 -2.048162
sample estimates:
mean in group Colon mean in group Lung
-1.036470 1.728369
Now we nd a large and signicant difference between the two groups.
38
16 Matrices
16.1 Creating matrices
Like vectors, you can create matrices by using read.table to read in values from a
data le. (Actually, this creates a data frame, which is almost the same as a matrix
see below.) You can also create matrices of numbers by creating a vector of the matrix
entries, and then reshaping them to the desired number of rows and columns using the
function matrix. For example
> X <- matrix(1:6, nrow = 2, ncol = 3)
> X
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
takes the values 1 to 6 and reshapes them into a 2 by 3 matrix.
By default R loads the values into the matrix column-wise (this is probably counter-
intuitive since were used to reading tables row-wise). Use the optional parameter byrow
to change this behavior. For example :
> A <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
> A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
R will re-cycle through entries in the data vector, if necessary to ll a matrix of the
specied size. So for example
> matrix(1, nrow = 10, ncol = 10)
creates a 1010 matrix of ones.
R will also collapse a matrix to behave like a vector whenever it makes sense: for
example
> sum(X)
[1] 21
Exercise 16.1 Use a command of the formX <- matrix(v,nrow=2,ncol=4)
where v is a data vector, to create the following matrix X:
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 2 2 2 2
Exercise 16.2 And what if we want?
39
matrix(v,nrow=m,ncol=n) m n matrix using the values in v
t(A) transpose (exchange rows and columns) of matrix A
dim(X) dimensions of matrix X. dim(X)[1]=# rows,
dim(X)[2]=# columns
data.entry(A) call up a spreadsheet-like interface to edit the values in A
diag(v,n) diagonal n n matrix with v on diagonal, 0 elsewhere (v is
1 by default, so diag(n) gives an n n identity matrix)
cbind(a,b,c,...) combine compatible objects by attaching them along
columns
rbind(a,b,c,...) combine compatible objects by attaching them along rows
as.matrix(x) convert an object of some other type to a matrix, if possible
Table 5: Some important functions for creating and working with matrices. Many of these have
additional optional arguments; use the help system for full details.
> matrix(rep(1:2, 4), nrow = 2, byrow = TRUE)
[,1] [,2] [,3] [,4]
[1,] 1 2 1 2
[2,] 1 2 1 2
Finally, you can use the data.entry function. This function can only edit existing
matrices, but for example (try this now!!)
> AA <- matrix(0, nrow = 3, ncol = 4)
> data.entry(AA)
will create AA as a 3 4 matrix, and then call up a spreadsheet-like interface in which
you can change the values to whatever you need.
16.2 Matrix indexing
Matrix indexing is like vector indexing except that you have to specify both the row and
column, or range of rows and columns. For example z <- A[2,3] sets z equal to 6,
which is the (2
nd
row, 3
rd
column) entry of the matrix A that you recently created, and
> (A <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
> A[2, 2:3]
[1] 5 6
> B <- A[2:3, 1:2]
> B
40
[,1] [,2]
[1,] 4 5
[2,] 7 8
There is an easy shortcut to extract entire rows or columns: leave out the limits, leav-
ing a blank before or after the comma.
> first.row <- A[1, ]
> first.row
[1] 1 2 3
> second.column <- A[, 2]
> second.column
[1] 2 5 8
(What does A[,] do?)
As with vectors, indexing also works in reverse for assigning values to matrix entries.
For example,
> A[1, 1] <- 12
> A
[,1] [,2] [,3]
[1,] 12 2 3
[2,] 4 5 6
[3,] 7 8 9
You can do the same with blocks, rows, or columns, for example
> A[1, ] <- c(2, 4, 5)
> A
[,1] [,2] [,3]
[1,] 2 4 5
[2,] 4 5 6
[3,] 7 8 9
If you use which on a matrix, R will normally treat the matrix as a vector so for
example which(A==8) will give the answer 6 (gure out why). However, which does
have an option that will treat its argument as a matrix:
> which(A == 8, arr.ind = TRUE)
row col
[1,] 3 2
41
16.3 cbind and rbind
If their sizes match, you can combine vectors to form matrices, and stick matrices together
with vectors or other matrices. cbind (column bind) and rbind (row bind) are the
functions to use.
cbind binds together columns of two objects. One thing it can do is put vectors
together to form a matrix:
> C <- cbind(1:3, 4:6, 5:7)
> C
[,1] [,2] [,3]
[1,] 1 4 5
[2,] 2 5 6
[3,] 3 6 7
Remember that R interprets vectors as row or column vectors according to what youre
doing with them. Here it treats them as column vectors so that columns exist to be bound
together. On the other hand,
> D <- rbind(1:3, 4:6)
> D
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
treats them as rows. Now we have two matrices that can be combined.
Exercise 16.3 Verify that rbind(C,D) works, cbind(C,C) works, but cbind(C,D)
doesnt. Why not?
17 Other structures: Lists and data frames
17.1 Lists
While vectors and matrices may seem pretty familiar, lists are probably new to you. Vec-
tors and matrices have to contain elements that are all the same type: lists in Rcan contain
anything vectors, matrices, other lists . . . Indexing is a little different too, use [[ ]]
to extract an element of a list by number or name, or $ to extract an element by name
(only). Given a list like this:
> L <- list(A = x, B = y, C = c("a", "b", "c"))
Then L$A, L[["A"]], and L[[1]] will all grab the rst element of the list.
A more complex list, that includes another list inside:
> (listA <- list(one.vector = 1:10, hello = "Hola",
+ one.matrix = matrix(rnorm(20), ncol = 5), another.list = list(a = 5,
+ b = factor(c("male", "female", "female")))))
42
$one.vector
[1] 1 2 3 4 5 6 7 8 9 10
$hello
[1] "Hola"
$one.matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0744594 -0.8621983 -0.4213736 0.4718595 1.2309537
[2,] 0.2605978 2.0480403 -0.3508344 1.3589398 1.1471368
[3,] -0.3142720 0.9399201 -1.0273806 0.5641686 0.1065980
[4,] -0.7496301 2.0086871 -0.2505191 0.4559801 -0.7833167
$another.list
$another.list$a
[1] 5
$another.list$b
[1] male female female
Levels: female male
Note that many functions in R return lists.
17.2 Data frames
Data frames are the solution to the problem that vectors and matrices (which might seem
to be the most natural way to store data) can only contain a single type of data, but most
data sets have several different kinds of variables for each observation. Data frames are
a hybrid of lists and vectors; internally, they are a list of vectors that may be of different
types but must all be the same length, but you can do most of the same things with them
(e.g., extracting a subset of rows) that you can do with matrices. You can index them
either the way you would index a list, using [[ ]] or $ where each variable is a
different item in the list or the way you would index a matrix. Use as.matrix if
you have a data frame (where all variables are the same type) that you really want to be
a matrix, e.g. if you need to transpose it (use as.data.frame to go the other way).
data.matrix might be more appropriate to convert a data frame into a matrix if you
have variables of different types.
Above, we ended up with a data frame when we read some data. Do you remember
where? How did the object look?
> (another.data.set <- read.table("AnotherDataSet.txt",
+ header = TRUE))
ID Age Sex Y
1 a1 12 M 23.4
2 a2 14 M 25.0
3 a3 13 F 22.0
4 a7 16 F 26.0
5 b15 19 F 24.3
43
> data.matrix(another.data.set)
ID Age Sex Y
[1,] 1 12 2 23.4
[2,] 2 14 2 25.0
[3,] 3 13 1 22.0
[4,] 4 16 1 26.0
[5,] 5 19 1 24.3
> as.matrix(another.data.set)
ID Age Sex Y
[1,] "a1" "12" "M" "23.4"
[2,] "a2" "14" "M" "25.0"
[3,] "a3" "13" "F" "22.0"
[4,] "a7" "16" "F" "26.0"
[5,] "b15" "19" "F" "24.3"
18 R programming
R is a programming language. Comprehensive coverage is given in the books by Cham-
bers Software for Data Analysis: Programming with R, Gentleman R Programming
for Bioinformatics and Venables and Ripley S Programming (see details in http:
//www.r-project.org/doc/bib/R-books.html). Chapters 9 and 10 of An
Introduction to R (http://cran.r-project.org/doc/manuals/R-intro.
html) provide a fast but thorough introduction of the main concepts of programming in
R.
It is important to emphasize that the ease of combining programming with canned
statistical procedures gives R a denite advantage over other languages in Bioinformatics
(and explains it fast adoption).
18.1 Flow control
Rcontains usual constructions for owcontrol and conditional execution: if, ifelse,
for, while, repeat, switch, break. Before continuing, please note that
for loops are rarely the best solution: the apply family of functions see below
is often a better approach.
A few examples follow. Make sure you understand them.
for iterates over sets. They need not be numbers:
> names.of.friends <- c("Ana", "Rebeca", "Marta", "Quique",
+ "Virgilio")
> for (friend in names.of.friends) {
+ cat("\n I should call", friend, "\n")
+ }
I should call Ana
I should call Rebeca
44
I should call Marta
I should call Quique
I should call Virgilio
but they can be numbers:
> plot(c(0, 10), c(0, 10), xlab = "", ylab = "")
> for (i in 1:10) abline(h = i, lty = i, lwd = 2)
(note that in the example above I could get away without using braces {} after
the for because the complete expression is only one command, abline).
while keeps repeating a set of instructions until a condition is fullled. In this ex-
ample, the condition is actually two conditions. Make sure you understand what is hap-
pening.
> x <- y <- 0
> iteration <- 1
> while ((x < 10) & (y < 2)) {
+ cat(" ... iteration", iteration, "\n")
+ x <- x + runif(1)
+ y <- y + rnorm(1)
+ iteration <- iteration + 1
+ }
... iteration 1
... iteration 2
... iteration 3
... iteration 4
... iteration 5
... iteration 6
... iteration 7
... iteration 8
... iteration 9
... iteration 10
... iteration 11
> x
[1] 5.077824
> y
[1] 2.108366
while is often combined with break to bail out of a loop as soon as something
interesting happens (and that something is detected with an if). A common approach
is to set the loop to continue forever (I wont show the output, but please do try it, and
understand it).
45
> iteration <- 0
> while (TRUE) {
+ iteration <- iteration + 1
+ cat(" ... iteration", iteration, "\n")
+ x <- rnorm(1, mean = 5)
+ y <- rnorm(1, mean = 7)
+ z <- x
*
y
+ if (z < 15)
+ break
+ }
Did you notice that iteration <- iteration + 1 is now in a different place,
and we initialize it with a value of 0?
18.2 Dening your own functions
As the Introduction to R manual says (http://cran.r-project.org/doc/
manuals/R-intro.html#Writing-your-own-functions)
(...) learning to write useful functions is one of the main ways to make your
use of R comfortable and productive.
It should be emphasized that most of the functions supplied as part of the R
system, such as mean(), var(), postscript() and so on, are themselves written
in R and thus do not differ materially from user written functions.
Here we only cover the very basics. See the Introduction to R for more details, and
the books above for even more coverage.
You can dene a function doing
the.name.of.my.function <- function(arg1, arg2, arg3, ...) {
#what my function does
}
In the above, you substitute the.name.of.my.function by, well, the name of your
function, and arg1, arg2, arg3, by the arguments of the function. Then, you write the
R code in the place where it says #what my function does. (By the way, # is the sign
that delimits the beginning of a comment). The . . . refer to other arguments passed on to
functions called inside the main function (we wont get into this).
For example
> multByTwo <- function(x) {
+ z <- 2
*
x
+ return(z)
+ }
> a <- 3
> multByTwo(a)
[1] 6
> multByTwo(45)
46
[1] 90
Another example
> plotAndSummary <- function(x) {
+ plot(x)
+ print(summary(x))
+ }
> x <- rnorm(50)
(I wont be showing the output of plotAndSummary: but make sure you do it, and
understand what is going on).
> plotAndSummary(x)
> plotAndSummary(runif(24))
Using more arguments, one of them default:
> plotAndLm <- function(x, y, title = "A figure") {
+ lm1 <- lm(y ~ x)
+ cat("\n Printing the summary of x\n")
+ print(summary(x))
+ cat("\n Printing the summary of y\n")
+ print(summary(y))
+ cat("\n Printing the summary of the linear regression\n")
+ print(summary(lm1))
+ plot(y ~ x, main = title)
+ abline(lm1)
+ return(lm1)
+ }
> x <- 1:20
> y <- 5 + 3
*
x + rnorm(20, sd = 3)
(Again, I am not showing the output here. But make sure you understand what is
happening!)
> plotAndLm(x, y)
> plotAndLm(x, y, title = "A user specified title")
> output1 <- plotAndLm(x, y, title = "A user specified title")
Make sure you understand the difference between
> plotAndLm(x, y)
and
> out2 <- plotAndLm(x, y)
(hint: in the last call, you are assigning something to out2. Look at the last line of
the function plotAndLm). Not all functions return something, and many functions do
not plot or print anything either. You decide what and how your functions do, print, plot,
etc, etc, etc.
And, as we will see below, many functions are often dened on the y.
47
19 The apply family
One of the great strengths of R is operating over whole vectors, arrays, lists, etc. Some
available functions are: apply, lapply, sapply, tapply, mapply. Please
look at the help for these functions. Here we will show some examples, and you should
understand what is happening:
> (Z <- matrix(c(1, 27, 23, 13), nrow = 2))
[,1] [,2]
[1,] 1 23
[2,] 27 13
> apply(Z, 1, median)
[1] 12 20
> apply(Z, 2, median)
[1] 14 18
> apply(Z, 2, min)
[1] 1 13
For those of you who have programmed before: using the apply family is often
much, much, much more efcient (and elegant, and easy to understand) than using explicit
loops.
With lists we will use lapply. For example, lets look at the rst element of each of
the components of the list (see how we dene a function on the y):
> (listA <- list(one.vector = 1:10, hello = "Hola",
+ one.matrix = matrix(rnorm(20), ncol = 5), another.list = list(a = 5,
+ b = factor(c("male", "female", "female")))))
$one.vector
[1] 1 2 3 4 5 6 7 8 9 10
$hello
[1] "Hola"
$one.matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 0.5923207 -0.7384038 -1.0733357 -0.6598970 0.4314560
[2,] -0.1540484 -1.9271588 -0.7559848 -0.6138839 -0.4005427
[3,] -1.8517709 -1.3494545 0.3523818 0.4264668 -1.9643583
[4,] 0.6406335 -0.7560543 -0.4299626 2.1331745 0.8796542
$another.list
$another.list$a
[1] 5
$another.list$b
[1] male female female
Levels: female male
48
> lapply(listA, function(x) x[[1]])
$one.vector
[1] 1
$hello
[1] "Hola"
$one.matrix
[1] 0.5923207
$another.list
[1] 5
When we have data that can be used to stratify or select other data, we often use
tapply:
> (one.dataframe <- data.frame(age = c(12, 13, 16,
+ 25, 28), sex = factor(c("male", "female", "female",
+ "male", "male"))))
age sex
1 12 male
2 13 female
3 16 female
4 25 male
5 28 male
> tapply(one.dataframe$age, one.dataframe$sex, mean)
female male
14.50000 21.66667
20 Matrices (II)
20.1 Dropping dimensions
Look at the different outputs of the selection operation:
> (E <- matrix(1:9, nrow = 3))
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> E[, 1]
[1] 1 2 3
> E[, 1, drop = FALSE]
49
[,1]
[1,] 1
[2,] 2
[3,] 3
> E[1, ]
[1] 1 4 7
> E[1, , drop = FALSE]
[,1] [,2] [,3]
[1,] 1 4 7
Unless we use drop = FALSE, if we select only one row or one column, the result
is not a matrix, but a vector
1
. But sometimes we do we need them to remain as matrices.
That is often the case in many matrix operations, and also when using apply and related.
Suppose we select automatically (with some procedure) a set of rows that interest us
in the matrix.
> rows.of.interest <- c(1, 3)
We can do
> apply(E[rows.of.interest, ], 1, median)
[1] 4 6
Now, imagine that in a particular case, rows.of.interest only has one element:
> rows.of.interest <- c(3)
> apply(E[rows.of.interest, ], 1, median)
What is the error message suggesting?
But we can make sure our procedure does not crash by using drop = FALSE:
> apply(E[rows.of.interest, , drop = FALSE], 1, median)
[1] 6
The situation above can be even more confusing with some matrix operations.
To summarize: when you select only a single column or a single row of an array, think
about whether the output should be a vector or a matrix
1
row vector? column vector? that is a longer discussion than warranted here; nothing with two dimen-
sions, anyway
50
21 An example that brings a few things together
This should not be mysterious now (you might want to look at the help for hist and
order). We want to reproduce a fairly common analysis that is done in genomics; here
we will simulate the data. The steps are:
1. Generate a random data set (samples in columns, variables or genes in rows); there
are 50 subjects and 500 genes.
2. Of the 50 subjects, the rst 30 are patients with colon cancer, the next 20 with lung
cancer
3. For each gene (variable, row) do a t-test
4. Find out how many p-values are below 0.05, and order p-values. Plot p-values.
> data <- matrix(rnorm(50
*
500), ncol = 50)
> clase <- factor(c(rep("sano", 20), rep("enfermo",
+ 30)))
Do we know what the output from a t-test looks like? What do we want to select?
Lets play a little bit:
> tmp <- t.test(data[1, ] ~ clase)
> tmp
Welch Two Sample t-test
data: data[1, ] by clase
t = 0.5638, df = 37.635, p-value = 0.5762
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4723414 0.8368225
sample estimates:
mean in group enfermo mean in group sano
-0.2456990 -0.4279395
> attributes(tmp)
$names
[1] "statistic" "parameter" "p.value" "conf.int"
[5] "estimate" "null.value" "alternative" "method"
[9] "data.name"
$class
[1] "htest"
> tmp$p.value
[1] 0.5762442
51
OK, so now we know what to select (note: I do NOT show the results of the compu-
tations here!):
> resultado <- apply(data, 1, function(x) t.test(x ~
+ clase)$p.value)
> hist(resultado)
> order(resultado)
> which(resultado < 0.05)
> sum(resultado < 0.05)
[1] 29
Now, repeat all of this by running the script:
> source("lastExample.R")
> sum(resultado < 0.05)
[1] 18
Please, look at lastExample.R, and understand what happened. You need to re-
peat the sum(resultado < 0.05). You could lastExample.R, to explicitly print
results. Or you can use source("lastExample.R", echo = TRUE).
Please, understand what is happening.
22 Error messages
The best way to learn to use R is to use it. As explained before, mistakes are harmless,
so you should play and experiment. However, there are two key attitudes that will make
your learning a lot faster: rst, using the help system, and second paying attention to
the error messages. Yes, the error messages are written in English, not some weird,
unintelligible language. Sometimes they are a little bit cryptic, but more often than not,
if you read them carefully, you will see how to approach to problem to x the mistake, or
will realize that what you typed makes no sense.
Lets look at a few. These are not representative or common or anything like it. But
you should read them, understand them, and think about how to take corrective action (or
realize that I was trying to do something silly).
> apply(F, 1, mean)
> log("23")
> rnorm("a")
> lug(23)
> rnorm(23, 1, 1, 1, 34)
> x <- 1:10
> y <- 11:21
> plot(x, y)
> lm(y ~ x)
> z <- 1:10
> t.test(x ~ z)
52
23 Additional exercises
Exercise 23.1 Make a copy of regressionExample.R under a new name, and modify the
copy so that it does linear regression of growth rate on the natural log of light intensity,
LogLight <- log(Light), and plots the data appropriately. Use the arguments
xlab and ylab to the plot command to appropriately label the axes.
Exercise 23.2 The axes in plots are scaled automatically, but the outcome is not al-
ways ideal (e.g. if you want several graphs with exactly the same axis limits). You can
use the xlim and ylim arguments in plot to control the limits:
plot(x,y,xlim=c(x1,x2), [other stuff])
will draw the graph with the x-axis running fromx1 to x2, and using ylim=c(y1,y2)
within the plot command will do the same for the y-axis.
Create a plot of growth rate versus light intensity with the x-axis running from 0 to
120 and the y-axis running from 1 to 4.
Exercise 23.3 You can place several graphs within a single gure by using the par
function (short for parameter) to adjust the layout of the plot. For example, the com-
mand
> par(mfrow = c(m, n))
divides the plotting area into m rows and n columns. As R draws a series of graphs,
it places them along the top row from left to right, then along the next row, and so on.
mfcol=c(m,n) has the same effect except that R draws successive graphs down the
rst column, then down the second column, and so on.
Use mfrow=c(2,1) to create graphs of growth rate as a function of Light, and of
log(growth rate) as a function of log(Light) in the same gure. Do the same
again, using mfrow=c(1,2).
References
Dudoit, S., R. C. Gentleman, and J. Quackenbush (2003). Open source software for the
analysis of microarray data. Biotechniques Suppl, 4551.
Fussmann, G., S. P. Ellner, K. W. Shertzer, and N. G. Hairston (2000). Crossing the hopf
bifurcation in a live predator-prey system. Science 290, 13581360.
Ihaka, R. and R. Gentleman (1996). R: A language for data analysis and graphics. Journal
of Computational and Graphical Statistics 5, 299314.
Venables, W. and B. Ripley (2002). Modern Applied Statistics with S. Springer.
53

You might also like