journal.pone.0243927
journal.pone.0243927
journal.pone.0243927
RESEARCH ARTICLE
1 Institute of Natural and Applied Sciences, Akdeniz University, Antalya, Turkey, 2 Department of Agricultural
Biotechnology, Faculty of Agriculture, Akdeniz University, Antalya, Turkey, 3 Department of Biostatistics and
Medical Informatics, Faculty of Medicine, Akdeniz University, Antalya, Turkey
a1111111111
a1111111111 ☯ These authors contributed equally to this work.
¤ Current address: Department of Agricultural Biotechnology, Faculty of Agriculture, Akdeniz University,
a1111111111
Antalya, Turkey
a1111111111 * ubilge@akdeniz.edu.tr
a1111111111
Abstract
OPEN ACCESS Phylogenetic analyses can provide a wealth of information about the past demography of a
population and the level of genetic diversity within and between species. By using special
Citation: Toparslan E, Karabag K, Bilge U (2020) A
workflow with R: Phylogenetic analyses and computer programs developed in recent years, large amounts of data have been produced
visualizations using mitochondrial cytochrome b in the molecular genetics area. To analyze these data, powerful new methods based on
gene sequences. PLoS ONE 15(12): e0243927. large computations have been applied in various software packages and programs. But
https://doi.org/10.1371/journal.pone.0243927
these programs have their own specific input and output formats, and users need to create
Editor: Michael Scott Brewer, East Carolina different input formats for almost every program. R is an open source software environment,
University, UNITED STATES
and it supports open contribution and modification to its libraries. Furthermore, it is also pos-
Received: July 20, 2020 sible to perform several analyses using a single input file format. In this article, by using the
Accepted: November 30, 2020 multiple sequences FASTA format file (.fas extension) we demonstrate and share a work-
Published: December 15, 2020 flow of how to extract haplotypes and perform phylogenetic analyses and visualizations in
R. As an example dataset, we used 120 Bombus terrestris dalmatinus mitochondrial cyto-
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review chrome b gene (cyt b) sequences (373 bp) collected from eight different beehives in Antalya.
process; therefore, we enable the publication of This article presents a short guide on how to perform phylogenetic analyses using R and
all of the content of peer review and author RStudio.
responses alongside final, published articles. The
editorial history of this article is available here:
https://doi.org/10.1371/journal.pone.0243927
Competing interests: The authors have declared analyses can be performed in a single framework. Although R (programming language) is a
that no competing interests exist. software environment for statistical computing and graphics, it is increasingly used in bioin-
formatics and phylogenetic data analysis thanks to advanced packages and libraries [2–11].
R is an environment for linear and nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering [10] and graphics. It is free and it enables static and dynamic
program analyses [12]. On the other hand, RStudio is an open-source Integrated Development
Environment (IDE) for the R programming language. RStudio is a software that combines var-
ious components of R, such as console, resource editing, graphics, history, help, in one work-
bench [13].
R and RStudio create an R Markdown document provided by the rmarkdown package,
which can store all code snippets, analyses, results, and images in a document [14]. With knitr
package, a new Markdown file is created and converted into different file formats such as PDF,
HTML, Word etc. by using pandoc() function [15]. Additionally, with R Markdown, jour-
nal articles and multi-part books can be written, and websites and blogs can be generated [14].
Therefore, they provide a wide range of options and are quite practical.
One of the strongest biomarkers used to estimate phylogenetic relationships is also mito-
chondrial DNA. It is frequently used in phylogenetic research and it is possible to group indi-
viduals as haplotypes by defining variations in the mtDNA for every population. Moreover, a
haplotype network based on nucleotide differences between haplotypes can be created [16].
We demonstrate how to perform phylogenetic analyses and graphics in a single workflow
using R for mtDNA sequences. Additionally, we have shared all the R commands in these anal-
yses for everyone to use. Furthermore, some of these commands can be used directly, and
some are in modifiable form for users who have samples with different sequence lengths and
numbers.
aligned objects, calculating distance matrices, performing MDS analysis, and visualizing
results [8].
adegenet: Exploratory analysis of genetic and genomic data. adegenet package (version
2.1.3) is a toolset to explore genetic and genomic data. It generates class "genind" for hierar-
chical population structure, class "genpop" for alleles counts by populations, and class
"genlight" for genome-wide SNP data [5].
The fasta2DNAbin() command reads FASTA format files and outputs a DNAbin
object containing either the full alignments or only SNPs. At the same time, this command
processes the massive datasets with its memory-efficiency [5].
ape: Analyses of phylogenetics and evolution. The ape (version 5.3) is a package that can
read, write, plot and manipulate phylogenetics data. Moreover, it has many functions such as
comparing these data in a phylogenetic framework, character analysis of ancestors, reading,
and writing nucleotide sequences [9]. The dist.dna() function calculates distances from
DNA sequences by computing a matrix of pairwise distances from DNA sequences [9]. Also,
the nj() function performs the neighbor-joining tree estimation of Saitou and Nei [18]. The
boot.phylo() function performs the bootstrap analysis, and it can be used with the
ggtree() command for visualization of the phylogenetic tree with bootstrap values. Addi-
tionally, it can extract data from Bioconductor and work together with adegenet and pegas
packages [2, 7].
ggtree: An R package for visualization of tree and annotation data. The ggtree, R/Bio-
conducter package (version 2.0.4), is created for visualization of phylogenetic analysis such as
annotation of the phylogenetic tree and other phylogenetic relationship structures. The
ggtree() command is visualizing the phylogenetic tree as a tree object or as a phylo object
by the as.phylo() command [11].
ggplot2: Create elegant data visualizations using the grammar of graphics. The ggplot2
package (version 3.3.1) that is the extension of ggtree is a package that declaratively generates
graphics based on Graphics Grammar [11, 19]. It explains how to match variables with aesthet-
ics and which graphical principles to use. It provides a better plotting of the graphics obtained
with the ggtree package with a set of layers such as geom_tiplab() or geom_treescale
() [11, 20].
stats-package: The R stats package. This package (version 3.6.3), which includes com-
mands for statistical calculations and random number generation, provides methods for hier-
archical cluster analysis based on a set of dissimilarities. The dist() command calculates the
distances between the lines of a data matrix. It can use six different distance measures which
are "euclidean", "maximum", "manhattan", "canberra", "binary" or "min-
kowski". The heatmap() function it provides creates a heat map using the distance [10].
haplotypes: Manipulating DNA sequences and estimating unambiguous haplotype net-
work with statistical parsimony. The haplotypes package (version 1.1.2) reads and manipu-
lates aligned DNA sequences, supports indel coding methods, shows base substitutions and
indels, calculates absolute pairwise distances between DNA sequences. It provides or infers
haplotypes by using identical DNA sequences or absolute pairwise character difference matrix.
Furthermore, this package gives genealogical relationships among haplotypes using estimation
of statistical parsimony and plots its networks [3].
pegas: Population and evolutionary genetics analysis system. The pegas package (ver-
sion 0.13) provides commands for reading, writing from different DNA sequences files includ-
ing from VCF files. It generates plots, analyzing and manipulating allelic and haplotypic data.
It requires packages ape and adegenet, making an integrated environment for population
genetic data analysis. Additionally, it realizes the analysis of basic statistics, linkage
disequilibrium, Fst and Amova, HWE, haplotype networks, minimum spanning tree and net-
work, and median-joining networks [2].
Extraction of haplotypes
We wrote dynamic short R commands to find out information about haplotypes and sequence
variations. Firstly, we converted the DNAbin object to the DNA matrix (120x373) using the
as.matrix() command provided by R base package [10]. Secondly, by comparing the
sequences, we extracted the haplotype number, haplotype frequency and variable regions.
Thirdly, we identified unique haplotype sequences by ignoring common nucleotides between
haplotypes and by printing variable regions.
The number of haplotypes per population was calculated using haplotypes package and
short R commands [3]. Firstly, DNAbin object was converted to an object of class "DNA"
using the as.dna() command which is provided by haplotypes package. Then haplotypes
were extracted and grouped using the haplotype() and grouping() commands, respec-
tively [3]. Finally, the population frequency matrix was extracted.
Haplotype network
The haplotype() and haploNet() functions implemented in pegas package were used
for the construction of the haplotype network [2]. In this section, we wanted to show that data
in R can be modified quickly and easily, creating multiple options for analysis. For this reason,
we have shown three different haplotype graphs that were represented with different colors as
hierarchical using the same data set. Thus, we have created options for those working both in
individual datasets and those working with larger populations or groups. While the first haplo-
type network was represented by individuals, the second haplotype network was represented
by populations and the third haplotype network was represented by groups. All haplotype net-
works were also plotted in different colors.
The haplotype network represented by individuals has been colored using rainbow colors
defined as the default and the names and colors of the samples were described using fill
argument in the legend() command, as below.
plot(net, size = attr(net, "freq"), scale.ratio = 2, cex = 0.6,
labels = TRUE, pie = ind.hap, show.mutation = 1, font = 2,
fast = TRUE)
legend(x = 57,y = 15, colnames(ind.hap), fill = rainbow(ncol(ind.
hap)),
cex = 0.52, ncol = 6, x.intersp = 0.2, text.width = 11)
We chose special colors for the haplotype network represented by the populations. For the
haplotype network, the desired colors were defined as a list in bg argument in the plot()
command, as below.
bg <- c(rep("dodgerblue4", 15), rep("olivedrab4",15),
rep("royalblue2", 15), rep("red",15), rep("olivedrab3",15),
rep("skyblue1", 15), rep("olivedrab1", 15),
rep("darkseagreen1", 15))
plot(net, size = attr(net, "freq"), bg = bg, scale.ratio = 2,
cex = 0.7,
labels = TRUE, pie = ind.hap,show.mutation = 1, font = 2,
fast = TRUE)
The names and colors of samples were described as a list in fill argument in the legend()
command, as below.
hapcol <- c("Aksu", "Demre", "Kumluca", "Firm", "Bayatbadem",
"Geyikbayir", "Phaselis", "Termessos")
ubg < -c(rep("dodgerblue4",1), rep("royalblue2",1),
rep("skyblue1",1),
rep("red",1), rep("olivedrab4",1), rep("olivedrab3",1),
rep("olivedrab1",1), rep("darkseagreen1",1))
legend(x = -35,y = 45, hapcol, fill = ubg, cex = 0.8, ncol = 1, bty =
"n",
x.intersp = 0.2)
For the construction of the haplotype network represented by groups, each individual has
been renamed with the name of the group to which it belongs. The sample names in the DNA-
bin object were replaced with the group names to which they belong with a few simple com-
mands, and the haplotype network represented by the groups was plotted. The desired color
set for the network diagram was defined in a list for the gbg argument in the plot() com-
mand, as below.
gbg <- c(rep("red"), rep("blue"), rep("green"))
plot(netg, size = attr(netg, "freq"), bg = gbg, scale.ratio = 2,
cex = 0.7,
labels = TRUE, pie = ind.hapg, show.mutation = 1, font = 2,
fast = TRUE)
Colors of the groups were defined as a list in fill argument in the legend() command,
as below.
legend(x = -35,y = 45, colnames(ind.hapg), fill = c
("red","blue","green"),
cex = 0.8, ncol = 1, bty = "n", x.intersp = 0.2)
Phylogenetic trees
We demonstrated the circular phylogenetic tree by using ggtree, ggplot2, ape, and stats pack-
ages [7, 10, 11]. To construct the phylogenetic tree, the dist.dna() and nj() commands
were used supported by stats package. We have shown two circular phylogenetic trees. In the
first tree, populations have been colored using the aes(color = Populations) com-
mand inherited from ggtree() and were drawn using ggplot2 package, as below.
emos <- ggtree(tree, layout = ‘circular’,
branch.length = ‘branch.length’, lwd = 0.5) +
xlim(-0.1, NA)
groupOTU(emos, krp, ‘Populations’) +
aes(color = Populations) +
theme(legend.position = “right”) +
geom_tiplab(names(nbin), cex = 1.7, offset = 0.002) +
guides(color = guide_legend(override.aes = list(size = 2.5))) +
Results
Multiple sequence alignment and plotting aligned FASTA format file
The mitochondrial cty b with length 373 base pairs belonging to 120 Bombus terrestris dalmati-
nus was aligned with the ClustalW method using the msa package [4]. The aligned sequences
were visualized by matching them with the phylogenetic tree by using the ggtree() and
msaplot() commands [11]. In Fig 1, the branch lengths of the tree have been colored to
represent the genetic distance. Each of the nucleotides was represented by a different color.
Thus, color changes on the plot have revealed nucleotide differences between the samples.
Fig 1. Plotting the multiple sequence alignments with a phylogenetic tree. The phylogenetic tree belonging to the mitochondrial cty b gene of 120 Bombus
terrestris dalmatinus and nucleotide differences have been schematized. Colors both on the phylogenetic tree and on the branch scale represent genetic distance.
mtDNA cyt b sequences have been colored by four different colors shown in the ‘seq’ column for each nucleotide. Color changes on the aligned sequences
represent nucleotide differences.
https://doi.org/10.1371/journal.pone.0243927.g001
The distance scale has shown 0.6% genetic variation per nucleotide substitution. Each row
of the multiple sequence alignments corresponding to the names at the end of the tree repre-
sents a complete sequence.
Extraction of haplotypes
Haplotype sequences were obtained from the DNA matrix object by comparing each row with
rows and each column with columns. As shown in Table 1, when nucleotides were compared,
the same nucleotides between haplotypes were expressed with dots while different nucleotides
were directly written.
The number of haplotypes per population was extracted using haplotypes package and short
R commands. The most common haplotype was H11 that was submitted on NCBI (GenBank:
MH221884.1), while the H16 and H14 haplotypes were observed in only one sample. While
https://doi.org/10.1371/journal.pone.0243927.t001
the Firm population has seven different haplotypes, only two haplotypes were detected in the
Demre, Kumluca, and Termessos populations (Table 2).
Haplotype network
Haplotype networks were constructed using pegas package [2]. Circle sizes were provided to
the plot() command using size = attr(x, "freq") argument; in this case based on
the number of individuals each hapolotype has [10]. Each link between the haplotypes is a dis-
tance that showed the number of nucleotides between the two haplotypes. Every line on the
links is a representation of one nucleotide. We demonstrated three haplotype networks repre-
sented by individuals (Fig 3), populations (Fig 4a) and groups (Fig 4b).
https://doi.org/10.1371/journal.pone.0243927.t002
Fig 2. Heat map based on the number of nucleotide differences between the haplotypes. Each branch of the phylogenetic tree represents the
corresponding haplotype in the matrix. We defined the close relationships with "darkred" color and far relationships with "white" color.
https://doi.org/10.1371/journal.pone.0243927.g002
In Fig 3, indivudals have been colored rainbow colors as default by the plot() command. In
Fig 4a, the populations representing the greenhouse group have been colored in 3 different
shades of blue, the Firm population belonging to the firm group was colored in red, and the
populations of the nature group have been colored in 4 different shades of green. In Fig 4b,
samples belonging to the greenhouse, firm, and nature groups have been colored blue, red and
green, respectively. Therefore, each pie chart here represents individuals belonging to that
group as a whole.
In Figs 3 and 4a, the largest circle in the center represents H11. Since the circles are schema-
tized in the form of a pie chart according to the number of individuals, the name of the H11
Fig 3. Haplotype network represented by individuals. Each slice in the circles represents an individual. Branch lengths between the circles have been represented by
genetic distance between haplotypes. Every dash on the lines is a representation of one nucleotide. The rainbow() command was used for coloring the individuals.
Haplotype network was demonstrated using the plot() command and names of individuals were added with the legend() command.
https://doi.org/10.1371/journal.pone.0243927.g003
haplotype does not appear in the small-sized images. But it looks quite clear in high resolution
and large scale images or in the form of the larger pie chart (Fig 4b). 20 haplotypes with 71
links were determined. The closest haplotypes were H9-H10-H11; H12-H15, and H15-H16,
while the most distant haplotypes were the H5-H14 haplotypes (Figs 3 and 4). Every dash on
Fig 4. Haplotype networks for populations (a) and groups (b). (a) Populations have been colored in blue, red, and green representing the groups (greenhouse, firm
and nature) they belong to. (b) Groups have been colored compatibly with the population haplotype network. Circle sizes were calculated based on the number of
individuals they had. Each slice in the circles represents an individual. Every dash on the lines is a representation of one nucleotide. To visualize the two plots at the
same time, the plot(new, TRUE) command was used after the first plot command and was continued with the second plot command.
https://doi.org/10.1371/journal.pone.0243927.g004
Fig 5. Colorized phylogenetic tree representing populations. The distance scale has shown 0.6% genetic variation per
nucleotide substitution.
https://doi.org/10.1371/journal.pone.0243927.g005
the lines is a representation of one nucleotide. The circles have been drawn according to their
number of samples using "size = attr(net, "freq")" command. Therefore, the
dashes (H11-H7; H11-H8, H11- H9, H11-H10) around the largest circle (H11) can’t be seen. If
this command is removed and run again, all the mutation difference numbers can be seen as
the dashes on the lines.
Phylogenetic trees
Phylogenetic trees were constructed using ggtree and ggplot2 packages. Tree estimation was
calculated by the neighbor-joining method supported by ape package. While phylogenetic tree
in Fig 5 was colored to represent populations, in Fig 6 it was colored to represent genetic dis-
tance. The biggest clade in both Figs 5 and 6 consisted of samples from Kumluca (12), Geyik-
bayir (6), Firm (4), Phaselis (7), Demre (13), Bayatbadem (6), and Aksu (7). Mostly, Aksu and
Firm samples created more than one different phylogenetic clades.
We demonstrated the phylogenetic relationship between haplotypes using the bootstrap
method (Fig 7). The distance was estimated by the Hamming distance method of nucleotide
differences between the two sequences [2]. The confidence interval was defined as strong for
85% and above, moderate for 70–85%, weak for 50–70%, and poor for 50% and below [25].
The bootstrap values were specified by coloring according to these confidence intervals.
Discussion
There are many software packages available today that compute and visualize population
genetic statistics and phylogenetics. Some of the most used software packages are MEGA,
DnaSP, splitsTree, TASSEL and Arlequin [1, 2]. With the tremendous advancements in
Fig 6. Colorized phylogenetic tree representing DNA distance. The distance scale has shown 0.6% genetic variation
per nucleotide substitution.
https://doi.org/10.1371/journal.pone.0243927.g006
Fig 7. Neighbor-joining (NJ) tree for cyt b haplotypes of B. terrestris dalmatinus. Colored internal nodes represent the bootstrap confidence level.
https://doi.org/10.1371/journal.pone.0243927.g007
technology in recent years, large amounts of molecular genetic data have been obtained. To
process and analyze these data, there is a need for powerful computer software and hardware.
Almost every software environment has its own specific input and output file format. While
some analyses can be completed in one program, one software or in one workflow, the major-
ity of them can only be completed in combined software packages. These multiple workflows
create an imperative to convert inter-program input/output format changes, which is some-
times difficult and mostly time-consuming. Here we show phylogenetic relationships and sta-
tistical findings in a single workflow using R, on a sample mtDNA dataset. By sharing all the
commands we use, we aim to present a ready-made format for researchers working in this
field. Thus, by using only one FASTA format file as input, we are able to output multiple
sequence alignments, haplotype sequences, heat map, haplotype networks, and phylogenetic
trees. We have demonstrated the use and plotting of R in phylogenetic analyses, using both
packages and R codes, taking advantage of R’s free language and free license. We tried to make
some commands as modifiable as possible. One of these was the creation of a haplotype net-
work [2]. We shared three haplotype networks representing individuals, populations, and
groups. Some arguments such as coloring or plot scaling can be modified in accordance with
other data sets. We demonstrated three different colored phylogenetic trees representing the
populations, DNA distance and haplotypes bootstrap tree [11, 20]. Likewise, by changing the
commands we use for phylogenetic trees, arguments such as colors, tree type, and graphic scal-
ing can be modified. We wanted to show that R can be used for researchers who study mtDNA
and are interested in phylogenetics. While input-output files are frequently needed in phyloge-
netic analysis software, we showed that 8 different outputs can be obtained from a single .fas
extension file and shared all the packages, libraries and commands we used in these analyses.
At the same time, we recommend using RStudio for visualization studies, with features such as
easier editing of the code with the source pane, easier manipulation of visuals with the plot
pane, and to be a reminder for arguments or definitions. Consequently, the haplotypes
sequences, heat map, haplotype networks, and phylogenetic trees gave results that are
completely compatible with each other. We believe that R and RStudio will be increasingly
used in phylogenetic analyses and visualization due to the fact that it is an open source and
always up-to-date environment, free for all as well as open for researcher contributions such as
this one.
Supporting information
S1 Appendix. R codes. It is for phylogenetic analyses by using FASTA format file.
(TXT)
S2 Appendix. 120 mitochondrial cyt b sequences of B. terrestris dalmatinus. It is an example
dataset as FASTA format sequences file.
(FAS)
Acknowledgments
The data used in this project were provided by the project numbered FYL-2016-1502, sup-
ported by the Scientific Research Projects Coordination Unit of Akdeniz University. We
would like to thank Philippa Price for proofreading this article.
Author Contributions
Conceptualization: Emine Toparslan, Kemal Karabag, Ugur Bilge.
References
1. Excoffier L, Heckel G. Computer programs for population genetics data analysis: a survival guide. Nat
Rev Genet. 2006; 7(10): 745–758. https://doi.org/10.1038/nrg1904 PMID: 16924258
2. Paradis E. pegas: an R package for population genetics with an integrated–modular approach. Bioinfor-
matics. 2010; 26(3): 419–420. https://doi.org/10.1093/bioinformatics/btp696 PMID: 20080509
3. Aktas C. haplotypes: Manipulating DNA Sequences and Estimating Unambiguous Haplotype Network
with Statistical Parsimony. [Internet]. 2020. https://cran.r-project.org/web/packages/haplotypes/index.
html
4. Bodenhofer U, Bonatesta E, Horejs-Kainrath C, Hochreiter S. msa: an R package for multiple sequence
alignment. Bioinformatics. 2015; 31(24): 3997–3999. https://doi.org/10.1093/bioinformatics/btv494
PMID: 26315911
5. Jombart T, Kamvar ZN, Collins, C, Lustrik, R, Beugin, M, Knaus BJ, et al. adegenet: Exploratory Analy-
sis of Genetic and Genomic Data. The Comprehensive R Archive Network. 2020 [cited 2020 Apr 10].
https://cran.r-project.org/web/packages/adegenet/adegenet.pdf
6. Jombart T. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics.
2008; 24(11): 1403–1405. https://doi.org/10.1093/bioinformatics/btn129 PMID: 18397895
7. Paradis E, Claude J, Strimmer K. APE: analyses of phylogenetics and evolution in R language. Bioinfor-
matics. 2008; 20(2): 289–290.
8. Pele J, Becu JM, Abdi H, Chabbert M. Bios2mds: an R package for comparing orthologous protein fami-
lies by metric multidimensional scaling. BMC Bioinformatics. 2012; 13(1):133–139.
9. Popescu AA, Huber KT, Paradis E. ape 3.0: New tools for distance-based phylogenetics and evolution-
ary analysis in R. Bioinformatics. 2012; 28(11): 1536–1537. https://doi.org/10.1093/bioinformatics/
bts184 PMID: 22495750
10. R Core Team. 2019. R: A language and environment for statistical computing [Internet]. Vienna, Aus-
tria: R Foundation for Statistical Computing; 2019. http://www.R-project.org
11. Yu G, Smith DK, Zhu H, Guan Y, Lam TTY. ggtree: an R package for visualization and annotation of
phylogenetic trees with their covariates and other associated data. Methods Ecol Evol. 2017; 8(1): 28–
36.
12. Morandat F, Hill B, Osvald L, Vitek J. Evaluating the design of the R language. In: Noble J, editor.
Objects and Functions for Data Analysis. Berlin: Springer; 2012. pp. 104–131.
13. Racine JS. RStudio: a platform-independent IDE for R and Sweave. J Appl Econ. 2012; 17: 167–172.
14. Xie Y, Allaire JJ, Grolemund G. R markdown: The definitive guide [Internet]. 2018. Boca Raton, Florida:
Chapman; Hall/CRC; 2018. https://bookdown.org/yihui/rmarkdown
15. Xie Y. knitr: A General-Purpose Package for Dynamic Report Generation in R [Internet]. 2020. https://
cran.r-project.org/web/packages/knitr/index.html
16. Gupta A, Bhardwaj A, Sharma P, Pal Y. Mitochondrial DNA-a tool for phylogenetic and biodiversity
search in equines. J Biodivers Endanger Species. 2015; S1: 006.
17. Pagès H, Gentleman R, Aboyoun P, DebRoy S. Biostrings: String objects representing biological
sequences, and matching algorithms [Internet]. 2020. https://bioconductor.org/packages/release/bioc/
html/Biostrings.html
18. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol
Biol Evol. 1987; 4(4): 406–425. https://doi.org/10.1093/oxfordjournals.molbev.a040454 PMID: 3447015
19. Wickham H. ggplot2. Wiley Interdiscip Rev Comput Stat. 2011; 3(2): 180–185.
20. Yu G. Using ggtree to Visualize Data on Tree-Like Structures. Curr Protoc Bioinformatics. 2020; 69(1):
e96. https://doi.org/10.1002/cpbi.96 PMID: 32162851
21. Heibl C. ips: Interfaces to phylogenetic software in R [Internet]. 2019. https://cran.r-project.org/web/
packages/ips/index.html
22. Kimura M. Estimation of evolutionary distances between homologous nucleotide sequences. PNAS.
1981; 78: 454–458. https://doi.org/10.1073/pnas.78.1.454 PMID: 6165991
23. Pinheiro HP, de Souza Pinheiro A, Sen PK. Comparison of genomic sequences using the Hamming dis-
tance. J Stat Plan Infer. 2005; 130(1–2): 325–339.
24. Wang LG, Lam TTY, Xu S, Dai Z, Zhou L, Feng T, et al. Treeio: an R package for phylogenetic tree
input and output with richly annotated and associated data. Mol Biol Evol. 2019; 37(2): 599–603.
25. Kress WJ, Prince LM, Williams KJ. The phylogeny and a new classification of the gingers (Zingibera-
ceae): evidence from molecular data. Am J Bot. 2002; 89(10): 1682–1696. https://doi.org/10.3732/ajb.
89.10.1682 PMID: 21665595