Tuesday, 13 November 2018

Phylogenetics I: Zika virus outbreak investigation

This tutorial now runs in Docker. Instructions for setting up under MacOS are here and setting up under Ubuntu are here.

Before beginning the tutorial, create a working directory to store your results. Open a Terminal and type
WDIR=~/Downloads/phylogenetics
mkdir $WDIR

To start the tutorial, type in a Terminal (slightly different for Ubuntu: see here)
docker run -v $WDIR:/home -e DISPLAY=$(ifconfig en0 | grep 'inet ' | cut -f 2 -d' '):0 -it --name practical dannywilson/teaching-phylogenetics:march2020

During the tutorial, you will need to open multiple sessions. To start a new session open a new Terminal window and type
docker exec -it practical /bin/bash

To check the graphics are working correctly, type in your docker session
xclock &
which should open a new window showing a clock face with hour and minute hands.

Introduction

At the beginning of 2015, several cases of patients presenting symptoms of mild fever, rash, conjunctivitis and joint pain were reported by clinicians in South America. Investigation revealed infections with Zika virus, an obscure relative of dengue and chikungunya, that until the late 2000s had caused sporadic benign infections in Africa and Asia. Outbreaks emerged in Micronesia in 2007 and French Polynesia in 2013-14, during which an unusual increase in the incidence of severe neurological complications was first noticed. Over the course of 2015 and into 2016, clinicians in Brazil reported an upswing in the numbers of children born with microcephaly. The finger of blame was pointed towards Zika virus, a flavivirus, members of which group had been known to exhibit neurotropism and thereby carry the risk of causing neurological defects. As of March 2016, the epidemiological history and expectations for the ongoing spread of Zika virus were very uncertain.

The purpose of this practical is to demonstrate the tools available for phylogenetic analysis during early outbreak investigation. The aims are to address the following questions:
  • What was the geographic origin of the 2015/2016 Zika outbreak?
  • How was the virus related to previous Zika outbreaks?
  • When did the 2015/2016 Zika virus strain diverge from other variants?
  • Was there anything special about the 2015/2016 strain?
  • At what speed was Zika virus spreading through South America?
  • Could we have predicted the total number of cases?

Genome sequences

As of January 2016, only a modest number of Zika virus whole, or near-whole, genome sequences were available:

Click here for the Google Doc. The first step in the analysis is to obtain the whole genome sequences from Genbank using the accession numbers provided in publications or found through a Genbank search query.

In your web browser, navigate to https://www.ncbi.nlm.nih.gov/genbank/

☛ You can open Firefox in your docker session by typing firefox &

Copy the accession numbers from the spreadsheet and paste them into the query box, and click search. The 31 sequences will appear, split across two pages.



To show the sequences, click Summary, FASTA, and to download click Send to: File. (Leave Complete Record checked). By default, it will save the genomes to a file called sequence.fasta.txt in your working directory.

☛ Your working directory has different names from inside and outside the docker container. From inside docker, its location is /home. From outside docker it is wherever you set WDIR to reference.

Multiple sequence alignment

Phylogenetic reconstruction relies on the identification of character traits whose patterns of presence and absence is informative about shared evolutionary ancestry. In the past, many traits were used, but modern phylogenetics utilizes, almost exclusively, molecular sequence traits: DNA, RNA or protein sequences.

To compare any character across individuals, first homology must be established. Only traits that are homologous, i.e. directly co-inherited, can be used as the basis for valid comparison. Therefore the first challenge is to identify sequence homology between the virus genomes. By visualizing the Zika genomes in jalview, you will see they are all different lengths, partly because of differences in primers used for sequencing, partly because of mutations.

Open jalview by opening a docker session in Terminal and typing jalview &. (Ignore the annoying messages it opens.)

In jalview, open the genomes you downloaded via the File, Input Alignment, From File menu, and navigating to sequence.fasta.txt in your working directory, which is /home from inside docker.

You can assist visualization by selecting, from the menu, Colour, Nucleotide. Scanning through the sequences you will notice that there are similarities between some of the genomes, but it is very messy and inconsistent.


We're going to use mafft to align the sequences by identifying regions of sequence similarity. Sequences will be padded with indels (- characters, representing insertions or deletions) to improve the similarity within columns of the sequence alignment.

Open a docker session in the Terminal. Typing mafft -h will produce a bewildering list of all the mafft options. Multiple sequence alignment is a very difficult problem in bioinformatics and even the most popular aligners can only be considered heuristic at best. In this practical, I will give you a recipe for aligning the sequences. If you want to learn more, Google for mafft and read the online documentation as a starting point.

In docker, navigate to your working folder and run mafft as follows (increase the --thread option if you want to use more than two processors)
cd ~
mafft --thread 2 --anysymbol --bl 62 --op 1.53 --ep 0.123 --reorder --retree 1 --treeout --maxiterate 0 --localpair sequence.fasta.txt > genomes.mafft.fasta

☛ Some users have found they need to replace sequence.fasta.txt with sequence.fasta in the command above. Type cat sequence.fasta.txt  to check the file exists and is not empty.

This will take 10-20 minutes with two processors depending on your machine. Once the alignment is complete, you can open the genomes.mafft.fasta file in jalview. If the process says it was killed, you may need to increase the maximum system memory docker is allowed to use.

A good alignment between sequences that are genuinely homologous should
  • Maximize the percent identity in each column of the alignment
  • Produce a sequence alignment which is not spuriously longer than the original sequences
  • Not contain indels for the majority of sequences in most parts of the alignment
Between closely related individuals, the sequences should be identical for most of their length. The quality of the alignment is critical to all downstream analyses because they are based on the differences that remain between the sequences after alignment. These differences are phylogenetically informative, and they can be better visualized by selecting from the menu Colour, Percent Identity.


Basic alignment summaries

Before moving on to phylogenetic reconstruction, we're going to quantify the alignment quality in R. In a new docker session, type R and follow the instructions below.

#To load the functions for reading the alignment, type
source("https://raw.githubusercontent.com/danny-wilson/danny-wilson.github.io/main/files/fasta.R")
#Read in the alignment
setwd("~")
a = toupper(read.fasta("genomes.mafft.fasta",as.char=TRUE))

#Number of sequences 
nrow(a)

#Alignment length
ncol(a)

#Maximum sequence length
max(rowSums(a!='-'))

#Number of sites with one or more indel: for closely related sequences, most sites would not be expected to be indels, although differences in primers may exacerbate indel rates near the ends of the sequences
sum(colSums(a=='-')>0)

#Total frequency of indels in the alignment
mean(a=='-')

When I ran through the practical, the total alignment length was 10835 bases, only 0.2% longer than the longest sequence (10808 bases), indicating that mafft had found the sequences to be largely collinear, and had introduced indels only sparingly. Although 997 columns had at least one indel, only 3.1% of all bases were indels. These summary statistics indicated no major problems with the alignment process.

Tidying up the alignments

Nevertheless, sequence alignments can contain a variety of idiosyncrasies that are helpful to address prior to downstream analysis so they do not cause analysis artifacts. It is also helpful to tidy up the sequences before quantifying the sequence similarity of aligned bases.

#Nucleotide composition: should be in keeping with known base frequencies in the organism.
table(a)

#Because of sequencing quality limitations, there are often ambiguous base calls such as R (purine, A/G), Y (pyrimidine, C/T) and N (any), in addition to the indel symbol (-). Downstream analyses differ in their treatment of such ambiguity codes. However, they are all roughly equivalent in the sense that they convey little or no information as to what the base would have been. Therefore to simplify matters, it is reasonable to replace them all with 'N'. 
gd = a=='A' | a=='C' | a=='G' | a=='T' | a=='U' | a=='N'
a[!gd] = 'N'

#For RNA sequences, there can be inconsistency in whether uracil or its DNA counterpart thymine is reported. When sequences come from different sources, they might use different conventions. To simplify matters, replace all uracil codes with thymines.
a[a=='U'] = 'T'

#In downstream analyses, computational complexity strongly depends on the number of different phylogenetic patterns, i.e. the occurrence of As/Cs/Gs/Ts/Ns across individuals in a single column of the alignment. Ambiguous base calls that are scattered throughout the alignment fairly randomly can massively increase the number of different phylogenetic patterns. Particularly for columns that are otherwise invariant, this added cost comes with no benefit because only variable bases are phylogenetically informative. Therefore, it is reasonable at invariant sites to replace Ns with the consensus base.
cons = apply(a,2,function(x) names(sort(table(ifelse(x=='N',NA,x)),decreasing=TRUE))[1])
is.variable = apply(a,2,function(x) nlevels(factor(ifelse(x=='N',NA,x)))>1)
a[,!is.variable] = sapply(which(!is.variable),function(j) ifelse(a[,j]=='N',cons[j],a[,j]))

#The sequence labels in Genbank are often long and contain unusual characters that can upset downstream analysis programs. Truncate them at the first space
lab = sapply(rownames(a),function(s) unlist(strsplit(s,' '))[1])

#Write the modified sequences in FASTA format
rownames(a) = lab
write.fasta(a,"genomes.relab.fasta")


#One of the bugbears of bioinformatics is file format conversion. Many programs use different file formats. In the phylogenetics setting, PHYLIP format is common. For later use, we're also going to write the sequences in PHYLIP format.
outfile = "genomes.relab.phylip"
nc = max(nchar(lab))
lab = sapply(lab,function(s) paste0(s,paste0(rep(" ",nc-nchar(s)),collapse="")))
cat(nrow(a),ncol(a),file=outfile)
cat("\n",file=outfile,append=TRUE)
for(i in 1:nrow(a)) {
  cat(lab[i],paste0(a[i,],collapse=""),file=outfile,append=TRUE)
  cat("\n",file=outfile,append=TRUE)
}


Basic diversity measures

As a way of familiarizing yourself with the sample, it is helpful to quantify the relatedness of the individuals in terms of nucleotide similarity and total number of variable sites. Basic measures of pairwise genetic diversity between individuals also provide the starting point for distance-based phylogenetic tree building.

#Number of variable sites: for closely related sequences, most sites would not be expected to vary
(S = sum(colSums(t(t(a)!=cons))>0))

#Mean pairwise diversity: the mean, across all pairs of sequences, of the proportion of the sequence at which they vary. For closely related sequences, should be much less than 0.1
a2i = c('A'=1,'C'=2,'G'=3,'T'=4)
i = matrix(a2i[a],nrow(a),ncol(a))
DIST = rowMeans(apply(i,2,dist)>0,na.rm=TRUE)
(PI = mean(DIST))

#Distribution of pairwise diversity across sequences
summary(DIST)

The number of variable sites (4362/10835) and the mean pairwise diversity (0.087) are perhaps on the high side. Since the alignment quality metrics gave no cause for concern, this could indicate the presence of one or more outlier genomes in the sample, or it might simply reflect the high inherent diversity of the viruses.

This practical continues on the next page

No comments:

Post a Comment