R Textbook Full
R Textbook Full
R Textbook Full
Version 2
Sheri Sanders
Copyright
c 2020 Sheri Sanders
S ELF P UBLISHED
Licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License (the “Li-
cense”). You may not use this file except in compliance with the License. You may obtain a copy of the
License at http://creativecommons.org/licenses/by-nc/3.0. Unless required by applicable
law or agreed to in writing, software distributed under the License is distributed on an “AS IS ” BASIS ,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND , either express or implied. See the License
for the specific language governing permissions and limitations under the License.
Image sources:
Cover and fill: https://pixabay.com/en/structure-math-biology-network-1473346/
Chapter 1: https://pixabay.com/en/digital-abstract-binary-code-1742687/
Chapter 2: https://pixabay.com/en/dna-biology-medicine-gene-163466/
Chapter 3: Google Watercolor image from IU’s CIB
Chapter 4: https://en.wikipedia.org/wiki/Genetic_history_of_Europe/media/
File:PCA_of_the_combined_autosomal_SNP_data_of_Jewish_and_Eurasians.png
Chapter 5: https://libreshot.com/binary-code/
Chapter 6: Data from Lab
Chapter 7: https://en.wikipedia.org/wiki/Genetic_history_of_Europe/media/
File:PCA_of_the_combined_autosomal_SNP_data_of_Jewish_and_Eurasians.png
4 R Lab 2: Ordination in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Introduction to PCA 41
4.1.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 A Simple PCA using Vegan 44
4.2.1 Data Clean Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Compute the Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Plotting the PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Principal Coordinate Analysis 49
4.3.1 Distance calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Computing the components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Graphing the PCoA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.4 More on Vegan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 General Notes 52
4.4.1 A note on functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Wrap up and back to the biology 53
8 Answers to Labs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1 Lab 1 89
8.2 Lab 2 90
8.3 Lab 3 91
8.4 Alternative R Lab 2 95
1. Using and Manipulating Basic R Data Types
2. use Rstudio
A very useful program to look into is RStudio. RStudio is a slightly smaller version of R that
makes viewing graphs, etc. much easier. It installs on your machine but requires you to first have
R installed (see 1). I tend to use this version for everything I do.
3. use Rstudio online
This is what we will do for this course. RStudio Server has the advantage of having a really nice
interface to work with, especially when you are learning the language. The software can be run
on a virtual machine (VM), meaning I can set up all the libraries, etc. that you will need. I can
provide you all with exactly the same hardware, software versions, etc. regardless of what is on
your laptop making everyone’s lives easier. I can also sign in as you and view your data, even if
you are taking the course from a distance – which is super helpful. You can also use the publicly
available image (on XSEDE Jetstream) to run the same set up whenever you’d like.
4. R command line
Because everything is more fun on the command line . R is pre-installed on most clusters, and
usually available as a module.
A nice feature of RStudio Server is that there is a terminal tab in the bottom left hand sec-
tion. You can access the terminal and write any command you normally would in Unix without
pulling up a terminal or signing in!
We are also going to be using files throughout these lessons, which are all available via the ncgas
website - https://ncgas.org/R%20for%20Biologists%20Workshop.php. A zip file is available with all
files separated by chapter. Each chapter’s text is included as well.
1.2 Getting Started in R 9
Where am I?
While you would use pwd (present working directory) to determine our present working directory in
Unix, R has a similar concept of "working directories" – where it expects files you are referring to
directly to be found.
getwd() – reports what the working directory is currently (i.e.: pwd for R)
setwd(dir) – allows you to define a new working directory (i.e.: cd dir for R)
getwd();
setwd("~");
In RStudio, you can also set your working directory in the GUI. To do this, navigate to where you
would like to be within the "Files" tab on the lower right panel. Then, click "More" on the menu bar for
that panel. This will bring up a drop down menu with the ability to select "Set as working directory". If
you are running into issues with your files loading, this is the number one solution!
Navigating to a new folder in the File pane in the RStudio GUI. You can navigate by clicking on
folders as normal, and the file path is added to the bar below "New Folder"
! Note that R expects the files that you read in to be in the working directory, so if you stored
your file somewhere else, you will have to either move your working directory or list the
route to the file (i.e. /home/guest_user/dengue.fasta).
10 Chapter 1. Using and Manipulating Basic R Data Types
Using the Upload button (purple box) on the RStudio VMs. Note that whatever is shown in your
current path (in this case, Home) is where the file will upload - not to your working directory!
1.3 R is a Language
As I mentioned before, R is a language and it helps to think of it this way. There are concrete (ish)
data that are akin to nouns – they are things – numbers, characters, variables. We call these data types
in R. There are also actions that go along with these data akin to verbs – what things do – sort, subset,
define. We call these functions. How they are used and and how they are written are the grammar and
Video 1.3 syntax of the language. Just like any other language course, let’s start with some basic nouns...
Data types in R
R has many data types, and we’ll discuss how to add more. However, you can get by pretty easily by
starting with the main four datatypes that are built into R:
scalars - "variables", which are usually of common types (string, BOOL, int, float, etc.)
vectors - sets of scalars that are all the same with order
matrix - vector of vector, grid of scalars - all the same data type!
dataframe - "table", like a matrix, but can be of different data types
lists - vectors with names instead of lists, much like hashes in other languages or dictionaries in python
There is one other uncommon data types built into R that we won’t discuss:
arrays - multidimensional matrices
sequences - vectors
observation data - data frames
counts - matrices
samfiles - data frames
vcfs - data frames
database tables - usually data frames
1.4 Working with Scalars 11
Thought Questions
Also, spacing does not matter! The spaces in bash are serving as an indicator of a chunk of
information. In R, however, spaces aren’t used for this purpose – punctuation is. In bash:
myint=6 #works
myint = 6 #fails
but in R:
myfloat = 5.2;
myfloat=5.2;
are identical to the system. This is because the name of the variable and the content is one
chunk of information in bash – so no spaces. In R, it is looking for the punctuation – in this case
"=" to understand what you are telling it.
How does R know what kind of data you have in a variable? Are all things treated the
same?
Well, we cannot add and multiply letters, so there is clearly some "data typing". If you look
closely at the commands we used, there are some clear indicators that R uses – strings use "",
integers only contain [0-9]*, floats contain [0-9]*.[0-9*], and Boolean values are in all caps. For
reference, bash just assumes everything is a string, which is why math is weird in bash.
12 Chapter 1. Using and Manipulating Basic R Data Types
Functions
Okay, that is super basic and won’t get you far. You need to be able to do things (verbs!) with your
variables! So, let’s look at what functions look like in R, using printing to screen as our first example.
There are lots of ways to print things in R; one that may be super familiar to you is the cat() function
– stands for catenate just like in Unix - and its does the same thing! It simply prints to the console what
Video 1.4 is in that variable ("noun"). Another is simply "print()".
Functions
cat("Hi", name, "\n");
print(name);
! While this seems pretty basic, there’s actually a quite bit to talk about here!!
First, you will notice the parentheses. If you see parentheses, it’s a function. Anything within those
parentheses are inputs into the function. Again, this is why white space doesn’t matter in R. In bash,
functions look like this:
cat $myint
blast –query $query –db $db
A lone string is assumed to be a command or program in bash, and has to be a separate chunk – i.e.
"cat" or "blast". Then each information chunk has to be served up separated by spaces so it can parse
what is a chunk of information. In R, this is handled by commas and parentheses.
print(name);
print( name );
print ( name );
Usually, you should try to be consistent, since this is a community written language. Generally,
keep the leading "(" with the function name at very least!
Another thing you may notice from the print command is that things that are printed to the terminal
by default. Beware, R likes to print a lot (which can be annoying). For example, any value that is
calculated but not stored somewhere (pushed into a variable) gets printed to terminal. We will get more
into this next chapter.
Imagine a car. While everyone will have a slightly different mental image, there will be common
aspects to everyone’s idea of a car. It will have certain characteristics – four wheels, doors, windshield,
transmission, color, model, year, etc. It will also have certain basic functions – drive forward, go in
reverse, park.
Imagine a dog. Same thing – everyone will have slightly different mental images of a dog – but they
will all have similar characteristics and functions (run, drool, etc). Some things are shared – like both
dogs and cars have colors. But other functions and characteristics are not necessarily shared between
dogs and cars. . . you wouldn’t expect to be able to put a dog in drive or a car to drool. You wouldn’t
expect a transmission type of a dog or a breed of car. This is because they are different classes of
objects.
This is how R and many other languages work – they have defined classes of data – scalars, matrices,
vectors – that all have specific characteristics and functions that are built into that class. Every object of
that class type has those features, while other classes may not. So R must know what class of object a
piece of data is before it can do anything with it.
Thought Questions
?cat
One reason I love RStudio is it automatically gives you a heads up on what is expected and
prints out the information in its own little frame.
Also, just like in bash, there are defaults to the functions. For example, the cat func-
tion is as follows:
cat(... , file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE)
Any time you see something defined, such as sep = " " – it’s a default! You can skip
this part of the call if you want, which is why our first cat() call was so much shorter than all of
these options!
This may seem like a trivial point – but it is critical! ?function is going to help you
14 Chapter 1. Using and Manipulating Basic R Data Types
read any code that you run across and also use new functions, look for options, and generally
function within R. It is the dictionary to help you in learning the new language. Also, this let’s
you see the difference between a function and a variable immediately – if you don’t see (), even
empty ones like q(), it is not a function; it is a variable! Again, this is immensely useful in
reading other people’s code!!
WHEW! That was a lot of information coming from a simple print command! However, much of
this applies to the language as a whole – think of it as grammar!
! Now that we are using ()’s and will be getting more complex with other brackets, you may run
into a common problem - what happens when your prompt gives you a "+" instead of a ">"? If
you see the "+", it means R is looking for the rest of the command - meaning you likely forgot a
end quote, an end parentheses, etc. If you want to get back to the main prompt (>), hit esc. The +
Video +’s is there if you want to make multiple lines of code, which we will do much later!
Anatomy of a vector
Let’s think of a string of DNA (primer) as an easy example:
ATCGCCCTG
The order of these nucleotides matters, right? So we can assign them numbers, as we can always
Video 1.5 expect them to be in the same order for this primer:
ATCGCCCTG
123456789
Notice I started with 1. This is kind of odd for computational languages, but R is 1-indexed.
Creation of Vectors
So how do we replicate this in R? We define a vector:
Thought Questions
DNA isn’t the only time you’ll use vectors. Sometimes you want to create your own, for example
using numbers:
We can also define a new, ordered vector of numbers using the seq function:
! The # indicates a comment, which is ignored by the computer. These are helpful to use throughout
code!
Or read a vector in from a file (read documentation on this one if you use it!), vect.txt:
file:
12356
6
7
scan(file="file.txt");
vect = scan(file="~/vect.txt");
print(vect);
Thought Questions
file "primer.txt":
ACTGCCCTG
Thought Questions
Okay I got it working...but this doesn’t work for sequences that don’t have spaces
You can import the data, then split it, then convert it to a vector but this gets complicated
(remember - each verb is an additional function). We’ll come back to this!!
16 Chapter 1. Using and Manipulating Basic R Data Types
Manipulating Vectors
Because vectors assume a numbered order – its part of the vector classification. So you can easily grab
a specific nucleotide by requesting it’s number in line:
myprimer[1]; #prints A
myprimer[2]; #prints C
Thought Questions
Functions of Vectors
Vectors have defined lengths as well, as they are ordered sets:
This is handy to check files imported or anything you create – VERY highly recommended! Another
REALLY useful function for checking almost all data type is summary(). Try summary(vect) and see
what you get.
You can also combine functions together, such as using length and seq together:
print(list);
You can also name the indices in a list, which can come in handy when you want a vector, but the
number based indices aren’t really logical. For example, if you have several sequences (which don’t
have an inherent order), remembering the index of any individual sequence isn’t really intuitive. Also,
in this case, we are not using vectors to represent the DNA for the sake of simplicity. We’ll get back to
that soon.
seqs = c("AGTGAGGGA","AGTGAGGCA","AGTGAGGCA","AGTGAGGCC","AGTGAGGCT");
names(seqs) = c("Dmel_AX39", "Dmel_AX43", "Dmel_CC09", "Dmel_CC83", "Dmel_LM20");
print(seqs);
$Dmel_AX39
"AGTGAGGGA"
$Dmel_AX43
"AGTGAGGCA"
$Dmel_CC09
"AGTGAGGCA"
$Dmel_CC83
"AGTGAGGCC"
$Dmel_LM20
"AGTGAGGCT"
Notice that the names all start with $ now? It is is R’s way of designating a subset of a list. This allows
for the more logical grabbing of an individual item:
print(seqs["Dmel_AX39"]);
[1] "AGTGAGGGA"
You can also pull them by numerical index, just like a vector:
print(seqs[3]);
18 Chapter 1. Using and Manipulating Basic R Data Types
But it doesn’t require you to know what the order is (which is helpful in large lists!). In fact, lists act a
lot like vectors:
#add an element:
seqs[6]="AGTGAGGCA";
names(seqs)[6] = "Dmel_TG40"; #remember names is a vector
#or
seqs["Dmel_TG40"]="AGTGAGGCA";
1 6 11 16
2 7 12 17
3 7 13 18
4 8 14 19
5 9 15 20
We can see that this data also has inherent order data – this time in two dimensions:
I labeled the columns and rows in the same way R refers to them:
matrix[1,1] is 1
matrix[1,] is all of row 1
matrix[,1] is all of col 1
1.7.2 Creation
There are a couple of ways to generate matrices – let’s look a few. Let’s first generate the matrix above:
1.7 Vectors of Vectors - Matrices 19
We can also fill a matrix by rows instead of columns, using an option called byrow:
Thought Questions
a = as.matrix(read.table("matrix.dat"));
print(a);
Note, just as we did above with seq and length, you can use a function inside a function – as.matrix
makes sure the data is read in as a matrix and not a dataframe (default for read.table).
Thought Questions
! The standard separator for read.table is white space. To do a csv file (comma separate values),
you can use read.csv, or in your read.table function you can set sep = ",".
Since you are reading in files, it is ALWAYS a good idea to check the data. You can get:
You can also click on the variable name in the Environment tab in the top right pane. This will
bring up the variable information in the top left pane. If the matrix has row names or column names,
they will now be listed.
20 Chapter 1. Using and Manipulating Basic R Data Types
Thought Questions
Given the anatomy of a matrix, how would you get the following from matrix a:
Print the 3rd row?
4th column?
Element at row 3, col 4?
Grab just rows 1-4 and col 2-5?
mydata = read.table("BlastResults.txt");
but this will lack column names and row names! If there is no header line and you want to name
columns at import, you can:
We can also set colnames and rownames after the fact, using:
However, if the first line is a header describing columns and columns have names – as is the case in
ours:
1.8 Data Frames 21
You can click on the file in the top right frame to see the file that was loaded. You can also check
the file with:
Subsetting dataframes can be done exactly the same way as a matrix, but it can also be done leveraging
the names of the columns and rows!
Extracting columns:
bitscore = mydata["bitscore"];
bitscore = mydata[c("bitscore", "evalue")];
Extracting rows:
goodhits = subset(mydata, evalue < 1e-35); #evalue cannot have ""s
nrow(mydata);
nrow(goodhits);
Extracting matches:
#Define a set of genes:
gene_names = rownames(mydata)[1:2];
So this line prints the number of rows that contain information on the genes listed in gene_names!
To download a package:
#format: install.packages("packagename");
install.packages(c("maptools","mapdata")); #We will need these for Day 3!
! Note: You will see a LOT Of red text scrolling through your terminal - this is perfectly normal,
1.11 CRAN versus Bioconductor 23
even though it may look like an error. This is just the installation occurring! You will see the ">"
return when it is done!
Load a package:
library(packagename);
However, if you are doing this on the clusters, the lack of root access will make things fail if we
don’t direct the program to a location we have access to! In this example, I use my home directory
on Indiana University’s cluster - but the file path will have to be a different, existing path on your cluster!
install.packages("ggplot2", lib="/N/u/ss93/R/library/");
library(ggplot2, lib.loc="/N/u/ss93/R/library/");
Packages only need to be installed once, but they need to be loaded every time you have a new R
instance. This is much easier if you are working on your own machine, or in Rstudio.
When you are in the terminal or working with R on a cluster, you can (and should) define a variable
called R_LIB_USER. This is often defined as "/̃R/library" but you can put your libraries anywhere.
This becomes your default location for installing packages when you don’t have permission to the main
repository (which is VERY common on servers) – it also becomes the default location that R will look
for packages if they aren’t in the system wide installation. I recommend putting this in your .bashrc so
that you don’t have to define it every time!
This is worth Googling before trying to install anything on a cluster!
function; - NO parentheses, this will print the code for the function
cat;
Another option is to look at vignettes. These are build in examples that come with many packages and
are required by bioconductor. To pull up the list of vignettes available in a package:
vignette(package="name of package");
vignette(package="annotate"); #This will bring up a list of all the vignettes in "annotate"
vignette("name of vignette");
vignette("GOusage"); #This will bring up the documentation on how to use this feature!
The purpose of this activity is to get you working in R and getting a feel for some of the syntax. This is
definitely a case of "the more you put in, the more you get out". Think about the syntax you are using,
and what it’s doing. It’s all well and good to be able to type the commands and answer some questions,
but thinking about commands and options and how vectors work will help you get more familiar with
R.
If we wanted to work with these sequences, grouping the information together is really helpful,
and since the information isn’t in any real order (no real reason for annotation to come before ori-
gin), let’s make this a list. We also want to be able to name the information for what it is ("se-
quence","origin","annotation"), meaning we’ll use the names() function from Chapter 1.
print(Dmel_AX39);
$seq
"AGTGAGGGA"
$annot
"primer1"
$origin
"colony1"
Great! We now have labeled parts of that first sequence. Notice that $ before each name of the list -
reminding you to use Dmel_AX39$seq to refer to the sequence variable, not just seq. Remember, each
sample will have a $seq entry!
samples = list(sample1,sample2,sample3,sample4,sample5);
names(samples) = c("Dmel_AX39", "Dmel_AX43", "Dmel_CC09", "Dmel_CC83", "Dmel_LM20");
print(samples["Dmel_LM20"]);
$Dmel_LM20
$Dmel_LM20$seq
"AGTGAGGCT"
$Dmel_LM20$annot
"primer1"
$Dmel_LM20$origin
"colony18"
2.2 Installing seqinR 27
Again, notice that each sequence has a $ before it. Since this is a list of lists, there are now two sets of
$’s - one for the sequence name (samples list) and one for the sequence information list for that sample
(individual lists). So the sequence information here is stored in samples$Dmel_AX39$seq.
That’s a lot of code for just five sequences. And we didn’t even get to the point of dealing with flat file
input or splitting the sequence into a handy vector. Wouldn’t it be nice if there were a package that
does all this sequence handling for you? There is! Let’s load it!
To use function from the SeqinR package, we first need to install the SeqinR package.
#load package
library("seqinr");
dengue = read.fasta("dengue.fasta");
! Note that R expects the files that you read in to be in the working directory, so if you stored
your file somewhere else, you will have to either move your working directory or route to
the file (i.e. /home/guest_user/dengue.fasta).
Look at your new variable – you should always check values you read in. When we read out the
new variable, dengue, we see that it has the sequence, but also some "attr"s. These are attributes of the
sequence (which is an "object"). In this case, it has a name (from the header in the file), an annotation
(first line of the sequence), and a class (type of data). Since the function assumes there might be more
than one sequence in the fasta file, it makes a list of the sequences - even if the list only contains one
sample in this case. So, just like before, we can see the NC_001477.1 sample in the dengue list by:
28 Chapter 2. R Lab 1: DNA Words
dengue$NC_001477.1;
Again, the $ is the means by which we grab individual entries in a list (dengue being the list of
sequences). You can type the name of the object, type "$" and the list of entries will come up. Since
dengue.fasta only had one fasta entry, you will only see one option.
If we wanted to get the sequence only, we can easily find how to do this by Googling "get sequence
only in seqinr". This is a general format for finding functions – Googling (what you want to do) in (r if
you don’t know the package name, or the package name if you do!). Google will return as likely the
first hit – getSequence. Look at that function.
Problem 2.1 How would you output the length of the sequence in dengueseq? Store the output as
seqlength.
table(dengueseq);
Alternatively, you can find the value of the element of the table in column "g" by typing:
table(dengueseq)["g"];
Problem 2.3 How would you manually calculate GC content given the above information? Can you
do it using variables instead of the raw numbers? Store this value as GCcontent.
Problem 2.4 Can you find a function in the seqinr package that would do this for you? Where do you
look?
Problem 2.5 Use the help options to count the two letter words in the sequence. Save the output as
Count2words.
ρ(xy) = f xy/( f x ∗ f y)
where f xy and f x are the frequencies (percent occurrence) of the DNA words xy and x in the DNA
sequence under study. For example, the value of ρ for the DNA word "ta" can be calculated as:
ρ(ta) = f ta/( f t ∗ f a)
where f ta, f t and f a are the frequencies of the DNA words "ta", "t" and "a" in the DNA sequence.
The idea behind the ρ statistic is that, if a DNA sequence had a frequency f x of a 1-nucleotide
DNA word x, and a frequency f y of a 1-nucleotide DNA word y, then we expect the frequency of the
2-nucleotide DNA word xy to be f x ∗ f y. That is, the frequencies of the 2-nucleotide DNA words in a
sequence are expected to be equal the products of the specific frequencies of the two nucleotides that
compose them. If this were true, then ρ would be equal to 1 (the null hypothesis of random). If we
find that ρ is much greater than 1 for a particular 2-nucleotide word in a sequence, it indicates that that
30 Chapter 2. R Lab 1: DNA Words
2-nucleotide word is much more common in that sequence than expected (ie. it is over-represented).
Likewise, if it is below 1, then that 2-nucleotide word is less common than expected by random.
Note that if the ratio of the observed to expected frequency of a particular DNA word is very low or
very high, then we would suspect that there is a statistical under-representation or over-representation
of that DNA word. However, to be sure that this over- or under-representation is statistically significant,
we would need to do a statistical test.
In this case, we can use a zscore (standard score, a measure of the deviation from mean). See
https://en.wikipedia.org/wiki/Standard_score if you are not familiar with zscores. Basically, the zscore
is the number of standard deviations away the observation falls from the average. This should agree
with the rho value, but will give you a statistical backing of HOW under- or over-represented a "word"
is. If the zscore is 0, the measure is identical to the mean, negative numbers mean the measure under-
represented (below average), and positive numbers mean the measure is over-represented (above the
average). In this case, let’s use the "base" model, which looks at the individual bases, rather than codons.
Problem 2.8 Using the command zscore, are the "gc" or "cg" words significantly over- or underrepre-
sented?
Thought Questions
One additional useful function for graphing is abline(). This function allows you to give R a
function to plot, so you can add a regression line to a plot.
We can also use explicit plotting functions like hist(). To make a histogram of a column, you can
32 Chapter 3. Graphing and Making Maps with Your Data
use the hist() function and point it to the variable, and the portion of that variable (by using $).
3.2 Mapping
There is a lot you can do in making figures in R. We are going to explore the matter of location
data further, mainly because it’s intuitive and demonstrates the key concepts. There are a couple of
useful libraries to load ahead of time, so let’s start with that:
These are a great starting point for exploring mapping and include the Google maps library
(ggmaps), which we will use to make plots of GIS data. It also automatically loads ggplot2, which is
also a very very common plotting library.
Since it will come up a lot in this section, let’s get a reminder of some REALLY useful syntax:
This is a means of telling R that you want to give it a vector of items to be used for one variable i.e.
xlim=c(0,10). xlim is one variable, but you want to pass it two integers here, so we make both integers
into one vector - one variable holds the one vector. We couldn’t do xlim = 0, 10 as one variable cannot
hold two discrete items. Also, that "," means second parameter in R!
Okay, now that that’s fresh in our brains, let’s do some graphing!
data(wrld_simpl);
I said this data was pre-loaded, but where is it really coming from? We can find out using the
documentation:
?data;
3.4 Mapping Points 33
Turns out this function is a pre-installed "utils" library in R that loads specified datasets, including
data that may come from other packages. Useful to know!
Let’s get back to making our first plot. We can try our first plotting function and see what happens:
plot(wrld_simpl);
Oo.. pretty! Let’s "zoom in" - almost everything in mapping is based on latitude and longitude.
Unfortunately, we usually think of these as lat, long - but graphing programs want them as x and y. So,
so zoom in to 130W to 60W by 25N to 53N, we have to do the following:
Thought Questions
But since we don’t always know the exact lat and long of where we want to "zoom" to, setting
variables that we can quickly change before rerunning the same plot function saves on a lot of
scrolling around. Also, it is a bit easier to read!
Since we know that plot works here, and xlim and ylim work, let’s try the other common parameters
– color and title. I’ll also throw in a useful one that isn’t listed in the help – bg. See if you can figure
out what it does!
So pch changes the shapes. . . we can Google "pch options in r" and get this link
(http://www.endmemo.com/program/R/pchsymbols.php) to see some options. I like 19, but you can
use whatever you’d like!
We can also draw lines with the points function by setting type = "l". Let’s plot the borders of the
province of Colorado because they’re easy to draw. You will have to give the function FIVE sets of
coordinates... why?
Just like the pch options, you can also get a list of the color options by Googling "col options in r"
to get this link (http://www.stat.columbia.edu/ tzheng/files/Rcolor.pdf) – R has a LOT of pre-named
colors for you to use!
! GADM stands for global administrative boundaries here - you can also use ’climate’, etc for
different data types. Feel free to check it out using ?getData!
Hmm, that looks like it would fit in our world map... they’re both listed as similar data types in our
Environment tab, so let’s try it!
plot(wrld_simpl);
A quick ?plot search gives us the option to look at the "plot" function in two packages – one is in
raster, which is what we used to load the USA data. Let’s try that first – which is great, because it tells
us there is an add=TRUE option! Let’s try that:
plot(USA, add=TRUE);
Very cool. We can use country specific data to fill in more information on the world map if we need
it. We can do the same to add provenances to Canada:
But what are we actually plotting in these cases? Let’s look at the data:
USA;
We can see this variable is of "class: SpatialPolygonsDataFrame". . . what does that mean?
Remember that classes are types of things, with defined specific functions and characteristics. This make
sense for spatial polygon data (map shapes), as they would share specific characteristics (boundaries)
that could be kind of complex to hand code each time. They also have functions that would be useful
across all spatial data, like plot handling.
A bit down in the output for USA, we can see that the names of the states are listed in NAME_1, so
let’s use that to extract states. Since these names are a part of an object, we can refer to each state in
this specific object (of the SpatialPolygonsDataFrame class) by using USA$NAME_1:
For example, Indiana:
IN = USA[USA$NAME_1=="Indiana",];
plot(IN,col="white");
Thought Questions
Let’s do the same for a couple other states that surround Indiana: Plot IL, OH, MI, and KY –
remember to "add" them
IL = USA[USA$NAME_1=="Illinois",];
plot(IL,col="grey", add=TRUE);
OH = USA[USA$NAME_1=="Ohio",];
plot(OH,col="grey", add=TRUE);
MI = USA[USA$NAME_1=="Michigan",];
plot(MI,col="grey", add=TRUE);
KY = USA[USA$NAME_1=="Kentucky",];
plot(KY,col="grey", add=TRUE);
We can go one further - we can add cities, with information about the cities. Let’s plot points in
Toledo, OH; Notre Dame, IN; and Bloomington, IN.
Most of the time, we want to add some sort of information into the points – make their size indicate
36 Chapter 3. Graphing and Making Maps with Your Data
the population, for example. In order to do this, we need to build our own data frame to add that infor-
mation. Let’s do this with variables to make it more readable. All of this we’ve more or less seen before:
However, if you plot raw population, you can see there is a LARGE difference (The actual town
of Notre Dame is very tiny). The dot sizes would end up covering a large portion of the map – so
let’s scale them. You can do this by dividing the square root of the population column (df$pop) by the
maximum population value (in this case Toledo):
scaling = sqrt(df$pop)/max(df$pop);
Thought Questions
By dividing by the maximum number, we are representing all of the values as a per-
centage of our largest, thereby scaling the data to our own values. Technically, you can use
sqrt(max(df)$pop) to scale to the max square root transformed populations size. However, as
you will see next, we have to adjust these values anyway, so this additional sqrt() calculation
doesn’t really help anything.
Now we can plot our points. Note that cex is the sizing parameter in points!:
Well, the scaling made the dots very very small. Let’s increase them all by a factor of 5000.
Much better!
was pulled from the USGS, and then I added randomized GIS coordinates to the samples to demonstrate
mapping. As a reminder, this data is available on the ncgas website:
https://ncgas.org/R%20for%20Biologists%20Workshop.php.
Let’s map Wyoming, Montana, and Idaho to start. Since we aren’t centering on any one state, let’s
use the USA map, and crop the x and y lims:
Thought Questions
Now we can plot the points with this dataframe. Here we use df$long, etc. but you could just as easily
use df[,1] to pull the first column. Using the $ is consistant with what we did above with plotting states,
but also removes the requirement of remembering the order. However, use what makes sense to you!
scaling = sqrt(distemper$Titer)/max(distemper$Titer);
points(x=distemper$Long, y=distemper$Lat, col="blue", pch=19, cex=scaling*50);
Finally, it is beneficial to install the latest version of this package (in theory). It’s a bit annoying to
keep doing, but since this is attached to the cloud service, it is worth guaranteeing it will work (really,
this is a major issue with this package - if it doesn’t work, start here). This is an example of how to load
packages from github - which is becoming more common (though is more likely to be problematic
than loading through bioconductor or CRAN).
if(!requireNamespace("devtools")) install.packages("devtools");
devtools::install_github("dkahle/ggmap", ref = "tidyup", force=TRUE);
Then you have to give it that key you copied from Google Cloud:
You only have register once, but you should set your key each time to guarantee it will work. This
now brings us to actually using the package and yet another means of grabbing existing maps is found
in this package – using get_map.
Now, google likes to keep changing how this works, so I cannot guarantee that this code will work
by the time you read it. It’s a starting point - but you will have to check on google’s resources to get it
working.
?get_map;
Google = get_map(location = c(-110,44), zoom = 6, maptype = "satellite");
! NOTE: sometimes this has errors due to being pulled from Google’s servers – just wait a second
and retry the command. If you keep getting errors (because something likely changed), check out
https://developers.google.com/maps/documentation/maps-static/error-messages.
ggmap(Google);
With this package, you have to pick a center point, and then work with zoom to get the map you
want. The package is nice enough to print out the lat and long on the axes, so it’s pretty quick to figure
out if you are within the range you want.
3.8 One More Cool Thing 39
We cannot add points to it the same way as before. . . We have to use ggmaps’ syntax, but much of
it should look familiar:
Common issues with ggmaps working are almost all Cloud Console based:
1. Maps API is not activated - it has to be linked to your project.
2. You didn’t add your API key - make sure you use the register_google().
3. Check to make sure you have billing enabled. If you have a free account for more than a year,
it require being re-enabled. If you click on the name of your project on the top menu bar, you
should see a section labeled "Billing" on the left - make sure there is a valid date range there. If
not, you will have to click reset up Billing (this changes, so you will have to google it).
Sometimes you will have to click around the Cloud Console to get it all set up. Unfortunately, this
does change so googling doesn’t always help.
Sometimes it really pays off to check out the options in a new function!!
4. R Lab 2: Ordination in R
Often after you get your data, you will do some form of visualization - just like we did with the GIS
data. This allows you to explore your data for biases, patterns, etc. that might lead to further filtering
or analyses/questions. It also allows you to show your audience the lack of bias, the support for your
findings, etc. after you have completed your filtering. Both GIS mapping of sample locations and PCA
to demonstrate important factors in your data are common figures to include.
Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing
you to better visualize the variation present in a dataset with many variables. It is particularly helpful in
the case of "wide" datasets, where you have many variables for each sample. For example, it is very
common to use these plots to visualize microbiome data or identify important ecological factors in
niche partitioning. It is also used frequently in machine learning to reduce the dimensionality of the
data and to determine which factors explain the most data without "overfitting" the machine learning
models to the training data.
Since this is so useful, let’s start with PCA in R.
each math is done to combine measurements, each weighted differently, to create individual principal
components that explain as much of the variation in the data as possible.
So for example, look at the graph below of car characteristics:
First, we can see the individual characteristics for the cars in the center, such as gear, mpg, etc. What
these are specifically is not important at the moment - just know these are characteristics measured.
When plotted, these characteristics are all spreading the data in different directions and by different
magnitudes (shown by length of line – though these are largely similar). Each arrow (a) has a x (ax )
and y (ay ) component:
Components of eigenvectors
For instance, the mpg has a really small y component and a relatively large x component. This
means that mpg contributed much more to PC2 (x) than PC2. qsec, on the other hand, has relatively
4.1 Introduction to PCA 43
even x and y components, meaning it contributed similarly to PC1 and PC2. Both PC1 and PC2 are
composed of some weighted value of each measurement (mpg was higher weighted than gear in PC1,
for example). Since mpg and cyl have a large influence on PC1, which explains 62.8% of the data –
these two measures are likely largely defining the variance in the data.
Another aspect of the data that becomes evident in PCA plotting is correlation – when several
variables all contribute strongly to a variable, they are likely correlated. Mpg and cyl both contribute
largely to PCA1, and are correlated – cars with more cylinders consume more gas.
Thought Questions
So wait, there are possibly more eigenvalues and eigenvectors to be found in one dataset?
That’s correct! The number of eigenvalues and eigenvectors that are produced is equal to the
number of dimensions the dataset has. In the example that you saw above, there were 9 variables,
so the dataset was nine-dimensional (hard to plot!). That means that there are nine eigenvectors
and eigenvalues. The highest two explained 85.9% of the variance, and they were plotted. The
others are available to be viewed in R, but are not really that informative in this case.
We can also re-frame a dataset in terms of these other eigenvectors and eigenvalues without
changing the underlying information. For example, you may see PC2 vs. PC3 in plots. Re-framing a
dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re
just looking at it from a different angle (in nine-dimensional space...), which should represent the data
better.
44 Chapter 4. R Lab 2: Ordination in R
Three dimensional object in three different two dimensional projections, taken from
https://en.wikipedia.org/wiki/Ambigram/media/File:3d-ambigram.jpg
The shape of the object (data) does not change, but when you compress it into two dimensions, it
looks different from different angles. The same applies here, but in more dimensions!
Now that you’ve seen some of the theory behind PCA, you’re ready to see all of it in action!
Microbiome datasets contain the information of the micrbial community that is present within or on
the host organism. The test dataset includes the taxonomic profiles of the 18 human gut microbiome sam-
ples. The taxanomic profiles were generated using a program called FOCUS (http://edwards.sdsu.edu/FOCUS/)
as part of a larger workflow (described here: https://blogs.iu.edu/ncgas/2019/09/24/taxa-annotation-
for-shotgun-sequenced-metagenomic-data). The output table format includes column of all the taxa
present, followed by the abundance of each taxa per sample:
Let’s load this data in (again, data can be found on the following website: https://ncgas.org/R%20for%20Biologists%20W
and get a feel for what we are working with and doing any data clean up we will need to do. Get used
to doing this explore and clean step, as pretty much all analyses start with a data clean up stage before
doing any sort of analysis. In this example, we will use the Phylum level data, but the same analysis
could be done on any taxonomic level.
So, let’s load this data in, and get a feel for what we are working with. Get used to doing this, as
pretty much all analyses start with a data clean up stage before doing any sort of analysis.
Okay, now that we have our data loaded, let’s take a look at what we have. First, since we are doing
a PCA, we need to make sure our data is not categorical. PCA will only work on numerical data, so
let’s look at the summary of the Data table.
summary(Data);
summary(Meta);
Hmm... that’s all categorical data. We want to be able to analyze the data together, so let’s merge
the data into a second data table. However, if you look at the tables, Data has the samples as column
names and Meta has the samples as row names. We will have to rotate one of these tables. Since we
want to keep the samples as rows for the PCA, we will have to transpose the Data table.
Problem 4.1 Can you find the transpose function? What is the default output of transpose?
Problem 4.2 Transpose Data into a dataframe called Data_transposed.
Our data is now in the same format, with the samples as the row names, but they aren’t in the same
order. Let’s sort both dataframes by their row names (in default order), which will allow us to merge
the two dataframes more easily. If you google "sorting dataframe by row.names in R", you will be
pointed to the order() function. If we try this with just the row names from Data_transposed:
rows_ordered = order(row.names(Data_transposed));
We get a vector containing a new order of rows indices. If you recall, a dataframe can be subset
using a list of rows. We can also reorder columns with the same trick!
Data_transposed[rows_ordered, ];
#remember, listing nothing after the , keeps all of the columns!
We can do all of this in one line for each of the two dataframes:
In addition to the information that we pulled from FOCUS, we may have information on the various
studies that you’d like to add in, such as test populations. Let’s make a vector of experimental categories
(entirely made up for the purposes of this exercise):
groups = c("adult","adult","adult","adult","adult","adult","adult","adult","adult","adult",
"elderly","adult","adult","adult","elderly","baby","baby","baby");
Now that we have all our data in order, we’re almost ready to merge our Data_transposed and Meta
tables, together with our custom data, into a dataframe, which we will call Data_with_metadata.
cbind() is a function can we use to bind tables together in a way that preserves columns, but if you
look at the documentation (which you should!), you will see that it doesn’t sort the tables, it simply
glues them together:
library(vegan);
The first step to a PCA is to compute the principal components. In vegan, the most basic PCA func-
tion is rda(). This is a slightly different type of PCA than you might see outside of the vegan package (i.e.
using prcomp() from the stats package). vegan uses a singular value decomposition (SVD) based PCA,
whereas the prcomp() version is a spectral/eigenvalue based PCA. Vegan specifically uses a redundancy
analysis PCA – hence the rda() function name. See http://www.flutterbys.com.au/stats/tut/tut14.2.html)
for more details.
Problem 4.3 Look at the rda() function and compute the principal components of the Data_transposed
(remember this is the dataframe without categorical data), and save the results as a variable called "pca"
The easiest way to start exploring the results is to plot them, so we’ll start there.
?biplot;
Okay, that seems pretty similar to plot... but if you read the Arguments section carefully, you will
see that the input biplot.default is looking for is a two-column matrix... that is definitely not what we
4.2 A Simple PCA using Vegan 47
just generated. However, notice how there was a ".default" after biplot. I wonder what other options
there are for biplot...
Problem 4.4 Start typing "?bioplot." into the command window, and look through the options it gives
in the dropdown. Which of these do you think R will use given our current data?
Now we have relevant information on how to use biplot! The one kind of unclear part of this
function is the display = c("sites", "species") bit. This was kind of poorly named, since that assumes a
very specific use of this function. However, these are just referring to the two aspects of the biplot -
"sites" are your samples (rows), and "species" are your variables (columns, in this case Phyla).
So, we want to plot our pca variable, display both "sites" and "species", and let’s only use points as
text will make it hard to see the plot.
Great! We have something that looks like we might expect a PCA to look like! Before we think too
much about the vectors, let’s explore how our points are grouped.
Just like how plot() had assumptions based on our data type, points() does as well. Unfortunately,
it’s not as easy to find the documentation on it inside of vegan. If points() gives you trouble, a tip is
to try using the same basic aspects of the associated plotting function, and then add in the optional
parameters for the point formatting. In this case, points() inherits the "...(pca, display=c(...)) part of
biplot:
Thought Questions
factor(groups);
If you want to assign a color to each of the unique factors, you can do:
col=factor(groups);
Problem 4.5 Do you see any pattern based on just the platform data?
Problem 4.6 Look at some of the other metadata in Data_with_metadata, and see if any show a
technical bias. Also look at groups, and see if there is a pattern there (this would be actual experimental
effect).
Those groups really look like they explain something! Let’s dig into that! First, we’ll narrow down
the phlya to ones that significantly explain the data. We can do this with the envfit() function:
fit=envfit(pca, Data_transposed);
This function does a correlation analysis and provides a p-value for the fit of the environmental
factors (hence the name). We can plot them by:
plot(fit, col="red",p.max=0.05);
Great! We have samples plotted that aren’t being impacted by technical biases, we have a good
experimental effect being visualized, and even which Phlya are driving that effect. But we’re still
missing how much of the variance in our data is being explained.
To do that, we can revisit summary() and look at what it says about our pca variable:
summary(pca);
So, again, using what we learned about general plot() options from last chapter, let’s add those
numbers to our x and y labels. Don’t forget, each time you replot, you have to redo your points as well:
groupnames=levels(Data_with_metadata$groups);
#levels() is similar to factor(), but only gives the vector of unique names
legend("topright",
col = 1:3,
lty = 1,
legend = groupnames);
Thought Questions
Problem 4.7 How could you tell which distance options you have in vegdist()? Let’s use the vegdist()
function, and produce a Euclidean distance matrix of our data, called "dist_matrix".
50 Chapter 4. R Lab 2: Ordination in R
This function is also based on a stats function called cmdscale(), which is similar except that it
doesn’t weight the values and handles negatives differently. We can elect to not input weights, which
makes wcmdscale() identical to cmdscale(), which doesn’t allow for vegan’s features. So we’ll use
wcmdscale().
While annoying, at least you now know!
Let’s do a WCMDS/PCoA on our new distance matrix, with eigenvalues returned.
Problem 4.8 Can you figure out how to do this from the help section? Call your new variable pcoa
summary(pcoa);
You should see two major things here: First, you should see a "points" attribute – which holds all
the points to be plotted on the dimensions. You should also see a "GOF" attribute – which we will talk
about shortly.
! IF YOU DO NOT SEE A POINTS AND GOF LISTING – revisit the options in pcoa and make
sure you tell it to return eigenvalues!!
Ok. . . that’s pretty simple. But we’re missing a lot here, like labels, group coloring, and the nice
arrows that were so helpful in our interpretation. First, let’s handle labeling those points using the
ordilabel() function. Many functions in vegan start with ordi-, so if you are looking for something, start
typing ordi into the help for a full list! Use help to investigate these commands before just typing them
in!
4.3 Principal Coordinate Analysis 51
ordilabel(pcoa, labels=rownames(Data_transposed));
points(pcoa$points, pch=20, col=c("black", "red", "green")[Data_with_metadata$groups]);
legend(-50, 30, legend=c("adult","baby","elderly"), col=1:3,pch=20);
Okay! Looking better! Now for those useful arrows! Vegan has a wonderful function called envfit()
- "The function fits environmental vectors or factors onto an ordination." That sounds useful!
fit=envfit(pcoa,Data_with_metadata);
We can now add this to our graph (and only plot significant values!):
Great! Now let’s revisit how to fit ellipses onto our data to see trends. Vegan has a function called
ordiellispse() for just this occasion:
Notice that the last one fails... it’s because the ehull calculation needs more than two points to work.
So let’s try a different kind - standard error - which can be computed on only two values.
Problem 4.9 Take a crack at this yourself! Remake the graph, and add standard error ellipses.
So, notice that this has no variance listed on the "dimensions". This is a result of the nature of how
PCoA is done – the analysis is only getting a distance matrix, not a set of the original values. When
reporting a PCoA, a goodness of fit value is usually reported. Since we saw "GOF" is an attribute of
pcoa, we can list just how we’ve listed all attributes – with $!
pcoa$GOF;
Problem 4.10 Try proving this to yourself by relotting the PCoA using cmdscale() rather than wcmd-
scale().
52 Chapter 4. R Lab 2: Ordination in R
# To plot stress:
stressplot(mds);
# this plots the relationship between the original matrix and the nMDS matrix - it should be linear
(hence the R2 value)!
Some additional vegan resources:
https://rpubs.com/dillmcfarlan/R_microbiotaSOP
https://www.fromthebottomoftheheap.net/2012/04/11/customising-vegans-ordination-plots/
https://cran.r-project.org/web/packages/vegan/vignettes/FAQ-vegan.html
• rda() is the vegan package function (shown above) used for constrained ordinations. Species and
site scores are re-scaled according to Legendre, P. and Legendre, L. (1998) Numerical Ecology.
2nd English ed. Elsevier
• prcomp() which is a Singular value decomposition of variance/co-variance matrix where variance
is calculated using N-1 and by default, data are not scaled. Similar to rda(), but not using vegan.
• princomp() which is an eigen analysis of the correlation of variance/co-variance matrix where
variance calculated differently (N). It has no option for scaling data (need to do this before), but
there is an option, cor, that specifies if you use correlation or co-variance matrix
• capscale() is a vegan package function to do Constrained Analysis of Principal Coordinates
(CAP), which is an ordination method similar to Redundancy Analysis (rda), but it allows non-
Euclidean dissimilarity indices, such as Manhattan or Bray–Curtis distance.
If you don’t know when to use what, it’s a great chance to talk to a statistician! Just know there are
different ways with different applications and you should consider it when doing your analyses!
4.5 Wrap up and back to the biology 53
vector=seq(1,10,1);
print(vector[1:4]);
Great! Now that you know that works, put that code in the source window. You can now rerun this
code line by line or as a whole. To run just one line, you can click the end of the line, then the button
that has a box with a green arrow. To run multiple lines, highlight what you’d like to run, and then
click the same button. The output will still print out to the console, but it is much easier to rerun and
troubleshoot.
You can save any R code on a cluster terminal by clicking the blue floppy disk icon in the menu bar
of the source window. I’m saving this example as demo.R. ".R" is a typical extension for R code, but
you can technically make it whatever you’d like. Sometimes you will see .RScript as well.
56 Chapter 5. Writing Custom Scripts
To run this on a terminal, we can click the terminal tab at the top of our console. If you type "ls",
you can see a list of all the files you have saved to your virtual machine that is running RStudio. demo.R
should show up.
To run it, type:
Rscript demo.R
Notice, this only outputs what you explicitly print in the script!
If you were on the cluster, you could also do the following (just as before):
Rscript ~/demo.R;
but notice – this will not allow you to change things as you go. To do that, you’d have to edit the file
and rerun it, and repeat. This is why we focus on running things in RStudio to start – it is much much
easier to develop code in the RStudio than on the cluster.
Thought Questions
?count
count(seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"), frame = start)
R functions can take a number of required parameters ("options") and/or a number of optional
parameters, which are listed in the help. Remember, anything that is not followed by an = has no
predefined value. So in the case of count, you need to specify the sequence and the wordsize every
time – there is no default. However, start, by, freq, alphabet, and frame can all be ignored if you
want to use the default values – which are all listed after the equal sign. To make things easy to read,
it is generally recommended to put all the required options up front, followed by the optional parameters.
1) A name
Function and variable names can have a "." as part of the name in R, however, in most languages "." has
other meanings. I wouldn’t get used to naming functions with "."s in them as a matter of consistency.
3) Code
This is the most flexible part, but a couple of guidelines: A function should encapsulate a single "idea"
- it shouldn’t be too long, or try to do too much. It should be a single "verb", not an entire complex
sentence.
Only assume the existence of data given to it as a parameter ("option") - don’t refer to any outside
variables; they may not exist! You should be able to copy and paste an entire function to use in another
program.
Make sure to write comments to document what functions do, what parameters are, and what they
return. Since these functions won’t have a help file, you will need to put the information in the code!
Comment lines start with # in R.
58 Chapter 5. Writing Custom Scripts
To define a function, the first line looks a lot like the results you look for in ?cat, etc.:
The only difference is the "= function(...)". This is similar to how we have defined matrices, vectors,
etc. It is called a "constructor" function – it constructs what it is labeled! In this case, it makes a
function. Now, when you hit enter, you will see that R is expecting more (gives you the "+" instead of a
">"). It wants code!
So let’s give it some. Functions are wrapped in "{}" - R’s punctuation based marker for a the code
equivalent of a "paragraph". It groups all the code going into this function together.
Thought Questions
myResult = myFunction(1,2);
Now our function, myFunction, can be called in several different ways (even if all the parameters
aren’t used):
1) name: table_summary
2) parameters: Let’s plan to give the function a file, and have it default to giving the average to each
column, but optionally also give histograms in a pdf (default named out.pdf) for each column as well.
(Don’t worry if you know how to do this quite yet!).
Often it is easier to get some of the code down, before starting into building the actual function.
Let’s start a new R script window and make a general plan. This isn’t quite pseudo code quite yet, but
it will help us get there. Basically, we want to define a goal (report averages across the dataframe and
plot) and start sketching in code that will get us closer to that goal. For instance, we will need to load
in a file and calculate the averages for each column. That’s a good start and it’s not too far outside of
things we’ve already done! Let’s get that down into the script window:
#load file
#calculate averages for each column
We know we want to hand our function the file name, so let’s plan ahead a bit for that. When
writing functions, it’s easier to start by making vague names for specific content that you are using to
build the function. For example, we’re going to be summarizing a table of wetland ecological data. So
let’s store the specific file name in a variable called "file":
#load file
#calculate averages for each column
This may seem kind of dumb right now, but trust me, it will make more sense in a bit. Let’s load
that file in and look at it. This file does have a header, and the row names are in the first column:
#load file
table = read.table(file=file, header=TRUE, row.names=1);
print(table);
Okay, the table seems to load in a sane way, we can see that all columns are numerical, so we can
get averages and build a histogram for each. Keep in mind though, this might cause problems if we
use our function on non-numerical data - such as the BlastResults.txt we’ve used in the past. That has
text-based categorical data in the first three columns. So you might want to add a comment to the code,
60 Chapter 5. Writing Custom Scripts
to remind yourself for later. Also, let’s remove that table print out, as we don’t need to see it every
time!:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#load file
table = read.table(file=file, header=TRUE, row.names=1);
Great - we’ll need to talk about a handy tool to do the column averages - for loops!
vector = c(0,1,2,3,4);
for (element in vector) #NOTE you can name the elements whatever you want!
{
cat("Element is", element, "\n");
}
This is pretty handy - it stops us from having to manually print each element. Notice that the loop
is defined in one line then the code that it is looping through is contained in "{}"s. These are usually
set on their own lines and the code inside them is indented. This is good practice for readability and
ease of troubleshooting (missing brackets is a common issue!)
But this is in a vector, which only has one dimension. R is smart about figuring out what elements
you are referring to, but it isn’t clairvoyant! R can figure out that the discrete elements in a vector are
the individual values stored in the indicies - but that’s because it’s only two dimensional. It cannot
predict if you want row or column data in a matrix or dataframe, so the syntax changes just a bit. Let’s
look at that:
It is pretty common to see "i" as the place holder in for loops (I’ve been told it stands for "iterator",
but you can use any letter). If you are dealing with two for loops, you may see "j" as well. But in this
case, i is just standing in place of the column number as the for loop loops over each column from 1
through the number of columns in matrix!
Let’s give this a whirl on our Wetland data, calculating the average for each column in our dataframe.
Back in our original script, let’s comment in the basic format of a for loop:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#load file
table = read.table(file=file, header=TRUE, row.names=1);
Looping over dataframes is very similar to looping over matrices, so we can use the same syntax.
Let’s print i just to make sure we see it running through the whole range of columns in our dataframe:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#load file
table = read.table(file=file, header=TRUE, row.names=1);
Great! Now let’s get that mean calculated! Hm... I wonder how I can find out how to perform an
average... specifically a mean... Oh, look, there’s a super easy function listed in the help docs! Let’s
use mean() to calculate the average and print it. Also, let’s removing the pointless printing of i:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#load file
table = read.table(file=file, header=TRUE, row.names=1);
62 Chapter 5. Writing Custom Scripts
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#load file
table = read.table(file=file, header=TRUE, row.names=1);
Okay! Now we have some code working, let’s wrap it in a function before we go any further!
1) name: table_summary
2) parameters: Let’s plan to give the function a file, and have it default to giving the average to each
column, but optionally also give histograms in a pdf (default named out.pdf) for each column as well.
(Don’t worry if you know how to do this quite yet!).
So for our function we can follow the general format of the above. We need to include all of the
parameters described above:
Since we already have much of our code, we can build this function pretty quick! However, we
don’t want to include the line where we define the file - that’s going to be different every time we call
the function. We simply have to plob the rest of the code into the function and indent:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
Hm.. nothing happens! Remember, you are still just defining the function – you need to call it to
have it actually do anything!
table_summary("Wetland Data.txt");
or
table_summary(file);
Thought Questions
Why don’t I have to declare ’out’ and ’hist’ if they are parameters?
They are predefined in our function - just like in any function help document, if the parameter is
followed by an "=" and some value, it has a default!
64 Chapter 5. Writing Custom Scripts
Congrats! You have written a function! Notice how it wasn’t that much effort to make functioning
code into a function? The thing is that you have to think a little bit ahead - building our code around
the more vague "file" variable rather than a hard coded name allowed this process to be much smoother.
We’ll see more of that in the lab, but if you forget to do this, you have to go back through the code and
make it more flexible - which isn’t as convenient.
Now, to illustrate a couple of useful concepts in R (namely if statements and how to write to a file),
let’s add a bit more functionality to our function.
5.6.3 If statements
Just like for loops, if statements are a very common logic functions in coding languages. Again, the
syntax will change between languages, but the basic concept is the same - if something is true, then run
the following code.
Let’s open another script and explore if statements:
var=11;
Note that the "else if" and "else" are on the same line as the closing bracket, but not contained
within the preceding clause. Also note else if can be a number of different variations in different
languages (elseif, else if, elsif, etc.).
Other than having to probably refresh your mind on what the syntax is, if statements are pretty
intuitive. Let’s add an if statement into our table summary function to handle what to do if we want to
have histograms. Since we want these per column, it might seem like a good ideal to add this code
inside of our for loop, but that causes issues when we go to print the histograms to a file. So we’ll put it
outside the for loop:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#calculate average
avg = mean(table[,i]);
#print average
name = colnames(table[i]);
cat(name);
print(avg);
}
table_summary("Wetland Data.txt");
table_summary("Wetland Data.txt", hist=TRUE);
You should see the loop work only if hist is TRUE! Now let’s get that graphing part done!
But we can also plot other functions, such as hist(), which is of particular interest in this function.
We can define breaks if we want to, but we can also have the function figure out what makes:
Let’s add the code for histograms to our function! We have to add a for loop to loop through the
columns inside of our if statement:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#print average
name = colnames(table[i]);
cat(name);
print(avg);
}
table_summary("Wetland Data.txt");
table_summary("Wetland Data.txt", hist=TRUE);
See all the histograms popping up on the side panel? You can scroll through them with the arrows,
or save them to a file individually. But. . . we want to have these write to a file.
But sink( ) will not redirect graphic output, which we are interested in doing for our hist function!
To redirect graphic output use one of the following functions. You will have to use dev.off() to return
output to the terminal – if you forget, the file will be garbled.
Function Output to
pdf("mygraph.pdf") pdf file
win.metafile("mygraph.wmf") windows metafile
png("mygraph.png") png file
jpeg("mygraph.jpg") jpeg file
bmp("mygraph.bmp") bmp file
postscript("mygraph.ps") postscript file
Just like in sink, we have to turn off the function to make sure we return our output to the console.
5.6 Wetland Summary Function 67
To do this we use dev.off(). This is critical to functions where you are producing a non-text format, as
the lack of an End of File (EOF) marker can cause the output to be unreadable!
Use a full path in the file name to save the graph outside of the current working directory. Let’s
output our histograms to a pdf that we can name with the out parameter:
#Input file must be all numerical - subset tables with categorical data before running!
file = "Wetland Data.txt";
#print average
name = colnames(table[i]);
cat(name);
print(avg);
}
table_summary("Wetland Data.txt");
table_summary("Wetland Data.txt", hist=TRUE);
Now look in the files tab – you have a new pdf that you can view! This should be saved as "out.pdf",
but we could define a different name by using:
! Note that we put the pdf() and dev.off() outside the for loop. If we don’t the file will open and
close each time the loop executes and it will over-write our output file with only a single graph!
68 Chapter 5. Writing Custom Scripts
This is also why we had to put the hist if statement outside of the first for loop - there is no way to
open and close the file once if we put the if statement inside.
Before you save your function, you should remove the lines before and after the function. This then
gives you a file that is solely the function meaning you could load it in R, move it to another RStudio
installation on a different computer, or load it via terminal!
You can also easily change things – you can change mean to median, or highest, or lowest. You can
use any file you’d like – try it with both example files. You could have it graph bar graphs, etc. It’s very
flexible!
Also, this is a good primer on how to build code – make a logical plan, and build and test in pieces!!
That’s the basics of R – you can now handle default data types, know how to find more libraries to
add more data types, get and read the help, and write real R code! While this may not all sink in right
away, you can use these notes as a reminder at any time. It takes time, and you will be googling your
way through it for years, but at least you can now read and get help!
The next chapter will help you practice making your own function with a little less hand-holding!
6. R Lab 3: Building a Sliding Window Analysis
When we have data that is sequential in some way - a DNA sequence, environmental values over a
transect, climate over time - we can visualize trends with a simple plot. However, looking at the values
individually leads to very messy plots with lots of noise. So, one thing we can do to smooth out the
noise is to look at blocks of the data at a time, and get an average value of the data for that "window".
Then we can slide to the next block of data ("window"), and calculate the value for that subset of the
data. Then we can graph the values for these windows, allowing the trends to be clearer since we are
averaging out some of the noise.
Another aspect of these plots that is really useful is that with a block of data, we can calculate an
expected value based on the average, and compare it to the value we are actually observing.
For example, epigenetic studies are increasingly interested in calculating the observed amount of CG
DNA words (sometimes called CpG for the phosphate in between the bases) in relation the expected
amount of CG DNA words in a sequence. Methylated Cs in a CpG motif will spontaneously convert
into thymine, leading to a degradation in the number of CG DNA words in the sequence. If there is no
methylation in the species (this is true of fruit flies for instance!), there will not be a degradation in the
number of CG DNA words, and the expected number will be similar to the observed number. This is a
really quick and cheap way to determine if a target sequence (genome) has this kind of methylation
without having to pay for the special chemistry to sequence the methylated C’s!
70 Chapter 6. R Lab 3: Building a Sliding Window Analysis
Let’s investigate the observed CG / expected CG (which we will call CpGoe) value for the dengue
sequence: First, we have to reload our dengue sequence and review how to get DNA word frequencies:
library(seqinr);
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
We calculated a very similar metric to CGoe in the first lab when we calculated rho!
Problem 6.1 Review what you did to calculate rho and adapt it to calculate CpGoe!
This should give you a value of 0.452 or 45.2%. Compared to published genomes where we know
methylation is occurring, this indicates....
While this is useful as a general metric of overall possible methylation, we really want to see how
this metric of possible methylation changes over the length of a sequence... So, let’s make a sliding
window plot of the CpGoe!
First let’s start in a new R script window. Let’s make a generic name for our sequence and a variable
to hold our window size as we know we will have to make this flexible in our future function:
6.2 Building the function 71
Problem 6.2 Make a window from the first base through the 1789th base in the sequence, calling the
variable "window".
Problem 6.3 Calculate the rest of the windows manually to get practice using ranges of vectors. How
do you know when to stop?
Well, that was fun. How about we automate that so we never have to have that level of fun again?
First we want to find the pattern in how we calculated the windows. It will look a bit like this:
Let’s make a vector of starting points for these windows, which should be easy since we know the
starting point (1), the ending point (one less than the length of the sequence), and the increment (1789).
Hmm, that sounds a LOT like a job for seq()!
Thought Questions
Problem 6.4 Create a vector (call it startpoints) of start points using seq()! Don’t hardcode the length
of our sequence - we do eventually want to make this usable for other sequences as well!
Great start. Now let’s build a loop that calculates that window for each startpoint. Hm... for.. that
sounds like a for loop! We’ll want to loop through each of the startpoints from 1 through the end of the
list:
for (i in 1:length(startpoints)) {
#window = sequence[(start of window):(start value + windowsize)]
}
for (i in 1:length(startpoints)) {
window = sequence[startpoints[i]:(startpoints[i] + windowsize)];
}
72 Chapter 6. R Lab 3: Building a Sliding Window Analysis
for (i in 1:length(startpoints)) {
window = sequence[startpoints[i]:(startpoints[i] + windowsize)];
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
}
Interesting stuff. Before we move on to plotting, this seems like a good time to convert this into a
function before things get too messy. Luckily, we were smart and planned ahead - we generalized some
variables ahead of time, so now function conversion will be easy!
Your R script should look a bit like this thus far (it’s a good time to add some comments too!
library(seqinr);
#load sequence
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
}
Not all of your code will be in the function - what needs to stay outside of the function so that you
can pass the function the sequence and windowsize variables? What will you need to add to the script
to run the function?
CpGbyRange(sequence, windowsize);
or
CpGbyRange(dengueseq, 1789);
Thought Questions
Let’s use a vector to store the values of CpGoe. It should be the same length as startpoints, as there
are the same number of startpoints as there are windows and values for each window.
Problem 6.6 Add a vector (called vectCpG) to the function with the same length as startpoints.
Now we’ll need to fill that vector with CpGoe values. Since we’re looping through startpoints,
we can use that same loop to fill the vectCpG. In fact, we can just use i, since the index of startpoints
should match the index of the vector we want to fill with that value.
Problem 6.7 Store CpGoe in the i index of vectCpG, after you print the value!
Now we can plot those values! This is actually pretty easy - just add a plot command after that for
loop to plot startpoints vs vectCpG, using both a line and points.
After all of that, we have a plot, with some substantial troughs and peaks. These would be
interesting to compare against the genes in the area. It can also be helpful in planning epigenetic
projects when you don’t have methylation data quite yet, since CpGoe can be used as a surrogate value.
(See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5915941/, the paper that inspired this lab).
74 Chapter 6. R Lab 3: Building a Sliding Window Analysis
Can you read the following function and see where those changes occurred? This was not something
I changed in one go – it was slow changes, testing along the way. I included it for completion, and for
your reference in thinking about how to do this! Also, look how far you’ve come!!
library(seqinr);
#load sequence
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
vectCpG[i]=CpGoe;
}
CpGbyRange(sequence, windowsize,100);
CpGbyRange(sequence, windowsize,200);
6.3 Final Comments 75
! This will cause weird values in the last window. We could fix that with another bit of work, but I
think you get the point by now!
• https://www.datacamp.com/community/tutorials/pca-analysis-r
• https://stat.ethz.ch/pipermail/r-sig-ecology/2011-May/002125.html
• https://ourcodingclub.github.io/2018/05/04/ordination.html
• https://rpubs.com/dillmcfarlan/R_microbiotaSOP
but with a lot of added detail and information to dovetail into our course better.
Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing
you to better visualize the variation present in a dataset with many variables. It is particularly helpful in
the case of "wide" datasets, where you have many variables for each sample. For example, it is very
common to use these plots to visualize microbiome data or identify important ecological factors in
niche partitioning.
So let’s start with PCA in R.
78 Chapter 7. Alternative R Lab 2: Ordination in R with ggbiplot
PCA is particularly handy when you’re working with "wide" datasets. By "wide" we mean data that has
a lot of characteristics (measurements) for each sample (row). These are usually in tables, which are
wide due to the number of columns (measurements). But why is PCA a good tool to use with "wide"
data?
Part of the problem with this data format is that it is very difficult to plot the data in its raw format –
there are too many dimensions. The purpose of PCA is to visualize trends in the data by collapsing a
large number of measures into a small number of dimensions, which hopefully help identify samples
that are similar to one another and which are largely different.
This isn’t a statistics course, so I’ll keep the explanation of how this is done brief (see the bottom for
some more in depth resources). Basically, each math is done to combine measurements, each weighted
differently, to create individual principal components that explain as much of the variation in the data
as possible.
So for example, look at the graph below of car characteristics:
First, we can see the individual characteristics for the cars in the center, such as gear, mpg, etc.
What these are specifically is not imporant at the moment - just know these are characteristics measured.
When plotted, these characteristics are all spreading the data in different directions and by different
magnitudes (shown by length of line – though these are largely similar). Each arrow (a) has a x (ax )
and y (ay ) component:
7.1 Introduction to PCA 79
Components of eigenvectors
For instance, the mpg has a really small y component and a relatively large x component. This
means that mpg contributed much more to PC2 (x) than PC2. qsec, on the other hand, has relatively
even x and y components, meaning it contributed similarly to PC1 and PC2. Both PC1 and PC2 are
composed of some weighted value of each measurement (mpg was higher weighted than gear in PC1,
for example). Since mpg and cyl have a large influence on PC1, which explains 62.8% of the data –
these two measures are likely largely defining the variance in the data.
Another aspect of the data that becomes evident in PCA plotting is correlation – when several
variables all contribute strongly to a variable, they are likely correlated. Mpg and cyl both contribute
largely to PCA1, and are correlated – cars with more cylinders consume more gas.
You may see a lot of reference to eigenvalues and eigenvectors when looking into PCA plots. Simply
put, an eigenvector is a direction, such as "vertical" or "45 degrees", while an eigenvalue is a number
telling you how much variance there is in the data in that direction. The reddish brown lines in the
plot above are vectors of the raw data; the combined data used to explain the data (PC1 and PC2) are
eigenvectors, with eigenvalues (% variance explained). PCA calculates a bunch of eigenvectors and
eigenvalues. The eigenvector with the highest eigenvalue is, therefore, the first principal component.
Once the top 2-3 eigenvectors are selected, the plot is made rotating the data to make PC1 and PC2
horizontal and vertical.
80 Chapter 7. Alternative R Lab 2: Ordination in R with ggbiplot
Thought Questions
So wait, there are possibly more eigenvalues and eigenvectors to be found in one dataset?
That’s correct! The number of eigenvalues and eigenvectors that are produced is equal to the
number of dimensions the dataset has. In the example that you saw above, there were 9 variables,
so the dataset was nine-dimensional (hard to plot!). That means that there are nine eigenvectors
and eigenvalues. The highest two explained 85.9% of the variance, and they were plotted. The
others are available to be viewed in R, but are not really that informative in this case.
We can also re-frame a dataset in terms of these other eigenvectors and eigenvalues without
changing the underlying information. For example, you may see PC2 vs. PC3 in plots. Re-framing a
dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re
just looking at it from a different angle (in nine-dimensional space...), which should represent the data
better.
For example, consider this three dimensional shape:
Three dimensional object in three different two dimensional projections, taken from
https://en.wikipedia.org/wiki/Ambigram/media/File:3d-ambigram.jpg
7.2 A Simple PCA 81
The shape of the object (data) does not change, but when you compress it into two dimensions, it
looks different from different angles. The same applies here, but in more dimensions!
Now that you’ve seen some of the theory behind PCA, you’re ready to see all of it in action!
• mpg: Fuel consumption (Miles per (US) gallon): more powerful and heavier cars tend to
consume more fuel.
• cyl: Number of cylinders: more powerful cars often have more cylinders
• disp: Displacement (cu.in.): the combined volume of the engine’s cylinders
• hp: Gross horsepower: this is a measure of the power generated by the car
• drat: Rear axle ratio: this describes how a turn of the drive shaft corresponds to a turn of the
wheels. Higher values will decrease fuel efficiency.
• wt: Weight (1000 lbs): pretty self-explanatory!
• qsec: 1/4 mile time: the cars speed and acceleration
• vs Engine block: this denotes whether the vehicle’s engine is shaped like a "V", or is a more
common straight shape.
• am: Transmission: this denotes whether the car’s transmission is automatic (0) or manual (1).
• gear: Number of forward gears: sports cars tend to have more gears.
• carb: Number of carburetors: associated with more powerful engines
! Note that the units used vary and occupy different scales.
BEFORE YOU BEGIN: make sure ggplot2 is not loaded, it will conflict with packages in this lab.
You can either uncheck the box or you can run the following command:
Problem 7.1 How would you grab this dataset (similar to how we grabbed wrld_map)?
Because PCA works best with numerical data, you’ll need to exclude the two categorical variables
(vs and am).
82 Chapter 7. Alternative R Lab 2: Ordination in R with ggbiplot
Problem 7.3 How would you subset the data to remove vs and am?
Problem 7.4 Now that all of the data is the same type (numerical), how would you make this a matrix
called mtcars_matrix?
You should now have a matrix of 9 columns (measures) and 32 rows (samples). We are now going
to use the prcomp() function.
Problem 7.5 What package is this part of? How do you know?
Problem 7.6 How would you find some options for this function? What does center do? What does
scale. do?
Let’s run the prcomp() function on our mtcars_matrix data, center it, and scale it. Let’s also save
the output as mtcars.pca.
Let’s take a look at this object with our handy summary() function. Remember, this function is
great for giving some general information about tabular data in particular.
summary(mtcars.pca);
Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413 0.2419 0.14896
Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167 0.0065 0.00247
Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103 0.9975 1.00000
As we expected, we will get 9 principal components, because we have nine measures. Each of
these explains some portion of the total variation in the dataset. You can see that PC1 explains 62.8%
of the data, and PC2 explains another 23.13%, for a cumulative percentage of 85.9%. This should seem
very familiar!
Again, this means that by just knowing where a sample sits on the first two components, you get a
really good idea of where it sits in relation to all the other data.
Problem 7.8 How would you check if ggbiplot is installed? How would you install it and enable it?
Installing from CRAN won’t work – what should you do next?
7.2 A Simple PCA 83
ggbiplot takes a pca object, which we have! Let’s feed it our data and see what happens by default:
ggbiplot(mtcars.pca);
You should see the car PCA, which will look very familiar. Because we centered our data, we see
all vectors coming from the center point. We can see all the points, and get an idea of the distribution
of the data, but not a great idea of where any one sample sits or what that cluster under wt might be.
Problem 7.9 How would you add the row names of mtcars to be labels to the ggbiplot?
Now you can see which cars are similar to one another. For example, the Maserati Bora, Ferrari
Dino and Ford Pantera L all cluster together at the top. This makes sense, as all of these are sports cars
with larger engines, lower fuel efficiency, more gears, and higher weight. You can also see that the
Camero and Plymouth Duster 360 have terrible mpg and high horse power (they’re 1970s muscle cars!)
where as the Civic was a great efficiency car even back then.
! You can also blot this with biplot(), a similar plotting function to plot(), meaning it uses the same
parameters (xlab, main, col, etc). However, ggbiplot adds a lot of cool features that we’ll go
through below. But if you want a basic plot, biplot(mtcars,pca); will get you there with simpler
formats to make it pretty! Remember, plot() family is easier, ggplot() family is more feature rich!
But let’s look at the data further – there is more to learn and see here!
mtcars.country=c("Japan","Japan","Japan","US","US","US","US","Europe","Europe","Europe",
"Europe","Europe","Europe","Europe","US","US","US","Europe","Japan","Japan","Japan","US",
"US","US","US","Europe","Europe","Europe","US", "Europe","Europe","Europe");
Problem 7.10 How would you get ggbiplot to use these groups and form ellipses around them?
Now we see something interesting: the American cars form a distinct cluster to the right. Looking
at the axes, you see that the American cars are characterized by high values for cyl, disp, and wt (this
was the muscle car era in America after all!). Japanese cars, on the other hand, are characterized by
high mpg. European cars are somewhat in the middle and less tightly clustered than either group.
We could also look at other PC’s that explain less information and look to see how the groups line
up with those eigenvectors. Let’s have a look at PC3 and PC4. Is there an option in ggbiplot to choose
which PC? There is!
Problem 7.11 Can you find the option to choose with Principal Component?
84 Chapter 7. Alternative R Lab 2: Ordination in R with ggbiplot
We don’t see much here, but this isn’t too surprising. PC3 and PC4 explain very small percentages
of the total variation, so it would be surprising if you found that they were very informative and
separated the groups or revealed apparent patterns. This is a good thing to try when exploring your data
though, as you may have less heavily weighted "top" PCs.
As ggbiplot is based on the ggplot function, you can use the same set of graphical parameters to
alter your biplots as you would for any ggplot. Think of this like the shared vocabulary in the various
plot functions we used while mapping, but in this case a shared vocabulary for most "gg" plotting
packages:
ggplot does have a bit more complex syntax, which we saw while working with ggmaps (Google
mapping package). Generally, you construct your plot first (as we have been doing) and then add in
aspects you want to tweak – just like we made the ggmap and then added (with "+") the points we
wanted to see.
Google maps example from before:
Mapping the PCA with similar syntax - + is used to add scale_colour_manual (just like col=) and
ggtitle (just like main=):
! This command is only split for the sake of printing - it must be all on the same line when you run
the code.
7.2 A Simple PCA 85
#bind the entry to the matrix – similar to how we did data.frame(name, lat, long, pop) before!
mtcarsplus <- rbind(mtcars_matrix, spacecar);
Yikes! That looks awful. Our nice PCA plot has changed dramatically – because we are including
a HIGHLY variable sample.
When you consider this result in a bit more detail, it actually makes perfect sense. In the original
dataset, you had strong correlations between certain variables (for example, cyl and mpg), which
contributed to PC1 and separated our groups from one another along this axis. However, when you
perform the PCA with the extra sample, the same correlations are not present, which warps the whole
dataset. In this case, the effect is particularly strong because your extra sample is an extreme outlier in
multiple respects.
If you want to see how the new sample compares to the groups produced by the initial PCA, you
need to project it onto that PCA, the second option we discussed.
But how do we scale the values the same way our magic prcomp() function did? Well, two things
happened – we scaled all the values to make eigenvalues, and we rotated the graph to make the PC1
and PC2 the x and y axis. So we need to do both!
The scaling vector is nicely stored in the "center" variable within the pca object.
Let’s use the scale() function to scale the supercar vector using the mtcars.pca$center:
We also have to rotate the sample to match up with the vertical and horizontal eigenvectors.
Remember we said that we are aren’t changing the data, just changing the angle? We have to rotate it
to match the rotation we did when plotting the initial function. Remember that ABC image above? We
want to make sure we are viewing the data from the same angle as the graph! The pca object store the
rotation it did in mtcars.pca$rotation:
Thought Questions
What is %*%? It’s matrix multiplication! We are multiplying s.sc (the vector) by the rotation
factor in the pca.
Then we just add the pca object – which just places it on the graph without changing the calculation
of the eigenvectors – and re-graph!
This result is drastically different. Note that all the other samples are back in their initial positions,
while spacecar is placed somewhat near the middle. Our extra sample is no longer skewing the overall
distribution, but it can’t be assigned to a particular group – it is so far in another direction it doesn’t
show up well in this view.
7.3 A note on functions 87
Thought Questions
So which is better, the projection or the re-computation of the PCA? As always, it depends
on the question that you want to answer. The re-computation shows the outlier, but makes
interpretation of the rest of the data harder. If you are interested in the outlier or demonstrating
the variance in the whole set, consider this. The projection makes it clear that the new sample
is not part of any existing group, but doesn’t warp your other data. This could be useful in
checking if subsampling is sufficient (won’t be warped by additional data), etc.
However, since these are exploratory analyses, performing both approaches is often useful. This
type of exploratory analysis is often a good starting point before you dive more deeply into a
dataset.
If you don’t know when to use what, it’s a great chance to talk to a statistician! Just know there are
different ways with different applications and you should consider it when doing your analyses!
8. Answers to Labs
8.1 Lab 1
2.1: How would you output the length of the sequence in dengueseq? Store the output as seqlength.
seqlength=length(dengueseq);
2.2: Why is the [] after the function? table(dengueseq) is a list - think of it as a named vector. So the
["g"] grabs the "g" entry in the table.
2.3: How would you manually calculate GC content given the above information? Can you do it using
variables instead of the raw numbers? Store the value as GCcontent.
GCcontent=(table(dengueseq)["g"]+table(dengueseq)["c"])/length(dengueseq);
2.4: Can you find a function in the seqinr package that would do this for you? Where do you look?
GC(dengueseq);
2.5: Use the help options to count the two letter words in the gene. Save the output as Count2words.
Count2words = count(dengueseq, 2);
a. count(dengueseq, 1);
b. fG = count(dengueseq,1)["g"]/sum(count(dengueseq,2));
fC = count(dengueseq, 1)["c"]/sum(count(dengueseq,2));
#these are frequencies, so they are divided by the total - otherwise you just end up with discrete counts!
c. count(dengueseq, 2);
d. fGC = count(dengueseq,2)["gc"]/sum(count(dengueseq,2));
e. pGC=fGC/(fG*fC);
#should be 0.8651367
2.7: Can you find a function that would calculate this for you (do it by hand first to practice R syntax)?
rho();
2.8: Using the command zscore, are the "gc" or "cg" words significantly over- or underrepresented?
You should get the following results using the "base" model:
rho(gc) = 0.8651367
zscore(gc) = -4.2314011
rho(cg) = 0.4516013
zscore(cg) = -17.2062694
For rho, the null hypothesis is 1, with anything less than 1 is underrepresented and anything greater
than 1 is overrepresented. So, both of these dinucleotide "words" are underrepresented.
Zscore tells you how many standard deviations away from average the score is. So "cg" is far more
underrepresented than "gc".
8.2 Lab 2
4.1: Can you find the transpose function? What is the default output of transpose?
The function is t(), and the default output is a matrix!
4.3: Look at the rda() function and compute the principal components of the Data_transposed (remem-
ber this is the dataframe without categorical data), and save the results as a variable called "pca".
?rda
pca=rda(Data_transposed)
4.4: Start typing "?bioplot." into the command window, and look through the options it gives in the
dropdown. Which of these do you think R will use given our current data?
?biplot.rda
4.5: Do you see any pattern based on just the platform data?
No, there is no real pattern here.
4.6: Look at some of the other metadata in Data_with_metadata, and see if any show a technical
8.3 Lab 3 91
bias. Also look at groups, and see if there is a pattern there (this would be actual experimental effect).
points(pca, display=c("sites"), pch=20, col=factor(Data_with_metadata$Title))
points(pca, display=c("sites"), pch=20, col=factor(Data_with_metadata$BioProject))
points(pca, display=c("sites"), pch=20, col=factor(Data_with_metadata$groups))
4.7: How could you tell which distance options you have in vegdist()? Let’s use the vegdist() function,
and produce a Euclidean distance matrix of our car data.
?vegdist();
dist_matrix=vegdist(Data_transposed, method="euclidean");
4.8: Can you figure out how to do this from the help section? Call your new variable pcoa pcoa=wcmdscale(dist_matrix,
eig=TRUE)
4.9: Take a crack at this yourself! Remake the graph, and add standard error ellipses.
plot(pcoa, type="p")
ordilabel(pcoa, labels=rownames(Data_transposed));
points(pcoa$points, pch=20, col=c("black", "red", "green")[Data_with_metadata$groups]);
legend(-50, 30, legend=c("adult","baby","elderly"), col=1:3,pch=20);
ordiellipse(pcoa, groups=Data_with_metadata$groups, display="sites", kind="se", conf=0.99,
label=FALSE,col="black", draw="lines", alpha=200, show.groups = c("adult"), border=FALSE);
ordiellipse(pcoa, groups=Data_with_metadata$groups, display="sites", kind="se", conf=0.99,
label=FALSE,col="red", draw="lines", alpha=200, show.groups = c("baby"), border=FALSE);
ordiellipse(pcoa, groups=Data_with_metadata$groups, display="sites", kind="se", conf=0.99,
label=FALSE,col="green", draw="lines", alpha=200, show.groups = c("elderly"), border=FALSE);
4.10: Try proving this to yourself by relotting the PCoA using cmdscale() rather than wcmdscale().
ord = cmdscale(dist_matrix, eig = TRUE);
plot(ord$points);
#remember plot changes with different input types, and the output of cmdscale is not the same as
wcmdscale!
8.3 Lab 3
6.1: Review what you did to calculate rho and adapt it to calculate CpGoe.
fg=count(dengueseq,1)["g"]/sum(count(dengueseq,1));
fc=count(dengueseq,1)["c"]/sum(count(dengueseq,1));
fcg=count(dengueseq,2)["cg"]/sum(count(dengueseq,2));
CpGoe = fcg/(fc*fg);
6.2: Make a window from the first base through the 1789th base in the sequence, calling the variable
"window".
window = sequence[1:1790];
#remember that the end point is not inclusive!
6.3: Calculate the rest of the windows manually to get practice using ranges of vectors. How do you
know when to stop?
start end
92 Chapter 8. Answers to Labs
1 1790
1790 3579
3579 5368
5368 7157
7157 8946
8946 10736
6.4: Create a vector (call it startpoints) of start points using seq()! Don’t hardcode the length of
our sequence - we do eventually want to make this usable for other sequences as well!
startpoints = seq(1,length(sequence)-1,windowsize);
Not all of your code will be in the function - what needs to stay outside of the function so that
you can pass the function the sequence and windowsize variables? What will you need to add to the
script to run the function?
library(seqinr);
#load sequence
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
}
}
8.3 Lab 3 93
6.6: Add a vector (called vectCpG) to the function with the same length as startpoints.
library(seqinr);
#load sequence
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
}
}
6.7: Store CpGoe in the i index of vectCpG, after you print the value!
library(seqinr);
#load sequence
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
vectCpG[i]=CpGoe;
}
}
CpGbyRange(sequence, windowsize);
#load sequence
dengue = read.fasta("dengue.fasta");
#grab only the sequence as a vector
dengueseq = getSequence(dengue$NC_001477.1);
fg=count(window,1)["g"]/sum(count(window,1));
fc=count(window,1)["c"]/sum(count(window,1));
fcg=count(window,2)["cg"]/sum(count(window,2));
CpGoe = fcg/(fc*fg);
print(CpGoe);
vectCpG[i]=CpGoe;
}
CpGbyRange(sequence, windowsize);
7.3: How would you subset the data to remove vs and am?
mtcars=mtcars[, c(1:7,10:11)];
7.4: Now that all of the data is the same type (numerical), how would you make this a matrix called
mtcars_matrix?
mtcars_matrix=as.matrix(mtcars);
7.6: How would you find some options for this function? What does center do? What does scale. do?
?prcomp
center:
a logical value indicating whether the variables should be shifted to be zero centered. Alternately, a
vector of length equal the number of columns of x can be supplied. The value is passed to scale.
Scale.:
a logical value indicating whether the variables should be scaled to have unit variance before the
analysis takes place. The default is FALSE for consistency with S, but in general scaling is advisable.
Alternatively, a vector of length equal the number of columns of x can be supplied. The value is passed
to scale.
7.8: How would you check if ggbiplot is installed? How would you install it and enable it? In-
stalling from CRAN won’t work – what should you do next?
CRAN won’t work – what should you do next?
Google it! Ggbiplot in R 3.4 works _ .
ˆ
library(devtools);
install_github("vqv/ggbiplot");
library(ggbiplot);
7.9: How would you add the row names of mtcars to be labels to the ggbiplot?
ggbiplot(mtcars.pca, labels=rownames(mtcars));
96 Chapter 8. Answers to Labs
7.10: How would you get ggbiplot to use these groups and form ellipses around them?
ggbiplot(mtcars.pca, ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country);