Inferring Ethnicity From Mitochondrial DNA Sequence: Proceedings Open Access

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Lee et al.

BMC Proceedings 2011, 5(Suppl 2):S11


http://www.biomedcentral.com/1753-6561/5/S2/S11

PROCEEDINGS Open Access

Inferring ethnicity from mitochondrial DNA


sequence
Chih Lee1*, Ion I Măndoiu1*, Craig E Nelson2*
From 6th International Symposium on Bioinformatics Research and Applications (ISBRA’10)
Storrs, CT, USA. 23-26 May 2010

Abstract
Background: The assignment of DNA samples to coarse population groups can be a useful but difficult task. One
such example is the inference of coarse ethnic groupings for forensic applications. Ethnicity plays an important role
in forensic investigation and can be inferred with the help of genetic markers. Being maternally inherited, of high
copy number, and robust persistence in degraded samples, mitochondrial DNA may be useful for inferring coarse
ethnicity. In this study, we compare the performance of methods for inferring ethnicity from the sequence of the
hypervariable region of the mitochondrial genome.
Results: We present the results of comprehensive experiments conducted on datasets extracted from the mtDNA
population database, showing that ethnicity inference based on support vector machines (SVM) achieves an overall
accuracy of 80-90%, consistently outperforming nearest neighbor and discriminant analysis methods previously
proposed in the literature. We also evaluate methods of handling missing data and characterize the most
informative segments of the hypervariable region of the mitochondrial genome.
Conclusions: Support vector machines can be used to infer coarse ethnicity from a small region of mitochondrial
DNA sequence with surprisingly high accuracy. In the presence of missing data, utilizing only the regions common
to the training sequences and a test sequence proves to be the best strategy. Given these results, SVM algorithms
are likely to also be useful in other DNA sequence classification applications.

Introduction information including behavior, cultural and societal


Human ethnic identity is a controversial and complex norms, skin color, and other influences. For this reason,
topic. Each human individual is a complex mosaic of attempts to accurately infer probable coarse ethnic iden-
genetic material originating from a multitude of ances- tity can be difficult in contexts with limited access to
tral sources. However, despite this complexity, the divi- most informative markers, such as skin and hair sam-
sion of humans into coarse ethnic groupings can greatly ples. In these situations genetic information can be
assist forensic investigators and is also increasingly extremely valuable to forensic pursuits by significantly
being used as a predictor of drug effectiveness in the enhancing the accuracy of coarse ethnic classification in
emerging fields of personalized medicine and race-based these contexts.
therapeutics. Self-reported and investigator-assigned eth- Several approaches to genetic-based inference of eth-
nicity typically rely on the subjective interpretation of a nicity have been proposed in the literature. In particular,
complex combination of both genetic and non-genetic the use of panels of autosomal markers have been
shown to provide excellent accuracy for assigning sam-
* Correspondence: chihlee@engr.uconn.edu; ion@engr.uconn.edu; craig. ples to specific clades [1,2]. Unfortunately, these
nelson@uconn.edu approaches rely on typing large numbers of autosomal
1
Computer Science and Engineering Department, University of Connecticut, loci that may not survive long periods of degradation.
Storrs, CT, USA
2
Molecular and Cell Biology Department, University of Connecticut, Storrs, Mitochondrial DNA, however, due to its high-copy
CT, USA number, is recoverable even from minute or highly
Full list of author information is available at the end of the article

© 2011 Lee et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 2 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

degraded samples. Furthermore, due to its high poly- Methods


morphism and maternal inheritance, mitochondrial In this section, we introduce the four methods of ethni-
DNA has proved to be an excellent marker for the infer- city assignment investigated in this study and the data-
ence of ethnic affiliation. Indeed, several studies includ- sets used to evaluate their empirical performance. We
ing [3-5] have previously shown the feasibility of begin by briefly introducing principal component analy-
inferring the probable ethnicity and/or geographic origin sis (PCA), a dimensionality reduction technique used as
from the sequence of the hypervariable region (HVR) of a preprocessing step for three of the four methods. We
the mitochondrial genome. These studies clearly demon- then describe the four classification algorithms – sup-
strate that, although the mitochondrial sequence alone port vector machines (SVM), linear discriminant analysis
does not by itself determine one’s ethnicity, the two are (LDA), quadratic discriminant analysis (QDA) and 1-
nevertheless strongly associated. nearest neighbor (1NN). Finally, we describe the data-
In this paper we test the utility and robustness of sev- sets used for evaluation, the conversion of mtDNA
eral methods for the classification of HVR mitochondrial sequence profiles into feature vectors, and methods of
sequences into coarse ethnic groups as previously encoding sequences with missing regions.
assigned by investigators from the FBI, self-assigned by
study subjects, or by anthropologists. The goal was to Principal component analysis
identify a method that could most accurately reproduce PCA (see [6] for an introduction) is a factor analysis
these classifications using only a small region of the technique of dimensionality reduction. Given m samples
mitochondrial genome. As Egeland et al. [5], we con- over n variables, the m samples can be represented as a
sider a supervised learning approach to ethnicity infer- m × n matrix X. We further assume that the sample
ence. In this setting, mtDNA sequences with annotated mean of each variable is 0, that is, ∑ m i =1 X ij = 0 for
ethnicity are used to “train” a classification function that every j. Projecting the m samples onto n new axes yields
is then used to assign ethnicities to new mtDNA another m × n matrix Y = XP, where P is a n × n
sequences. Adopting this approach allows us to draw on orthogonal matrix whose columns are unit vectors
the large body of knowledge developed within the defining the n new axes. PCA finds a P such that the
machine learning community (see, e.g., [6]). The main sample covariance matrix of the n new variables is a
goal of the paper is to assess the performance of four diagonal matrix, that is,
well-known classification algorithms (support vector
machines, linear discriminant analysis, quadratic discri- ∑Y = 1
m
Y ΤY = 1
m
( XP) Τ XP = P Τ£ XP = D, (1)
minant analysis, and nearest neighbor) on a variety of
benchmark datasets including realistic levels of missing where D is a diagonal matrix, and ΣX and ΣY are the
data and training data bias. sample covariance matrices of the original and new vari-
Comprehensive experiments conducted on mtDNA ables, respectively. The orthogonal matrix P can be
profiles extracted from the mtDNA population database easily obtained by eigenvalue decomposition of ΣX. PCA
[7] show that the support vector machine algorithm is is a dimensionality reduction technique in that only k of
the most accurate of compared methods, outperforming the n new variables are kept for further analysis. A stan-
both discriminant analysis methods previously employed dard approach is to pick the k variables with the largest
in [3-5]) as well as a nearest neighbor algorithm similar sample variances. Therefore, all we need to do is to pick
to that used for haplogroup inference in [8]. In both the value of k. Fortunately, when PCA is used in con-
cross-validation and experiments conducted on indepen- junction with supervised learning algorithms like classifi-
dently collected training and test data, SVM achieves an cation algorithms, the best value of k can be selected by
overall accuracy of 80-90%, matching the accuracy of performing cross-validation. In this study, k was selected
human experts making ethnicity assignments based on by performing 5-fold cross-validation (CV) on the train-
physical measurements of the skull and large bones ing data for each combination of dataset and classifica-
[9,10], and coming close to the accuracy achieved by tion algorithm.
using approximately sixty autosomal loci [11]. These
results demonstrate that SVM effectively classifies Classification algorithms
sequences from a small segment of the mitochondrial Support vector machines
genome and that these classifications can be used to The SVM [12] is a binary classification algorithm. In the
predict the probable assignment of coarse ethnicity with case of perfectly separable classes, SVM seeks a separating
reasonable accuracy. The superiority of SVM in this hyperplane with maximum margin, while for non-separ-
classification problem suggests that it is also likely to be able classes the goal is to maximize a linear combination
superior in similar sequence classification applications. of the separation margin and the total amount by which
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 3 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

SVM predictions fall on the wrong side of their margin. given test sample t is assigned to the class with the
Given n-element feature vectors xi, i = 1,…, m, and an m- highest posterior probability
element label vector y such that yi ? {1, –1}, this amounts argmaxg Pr(G = g|X = t).
to solving the following optimization problem: In this study, we used MCLUST Version 3 [15] to
conduct all LDA and QDA experiments.
m


1 1-nearest neighbor (1NN)
min  Τ  + C  i subject to 1NN is a simple non-parametric classification algorithm,
 ,  0 , 2 (2)
i =1 which does not have a training process. Given a set of
 i ≥ 0, y i ( Τ(x i ) +  0 ) ≥ 1 −  ii = 1, K , m reference samples and a test sample, 1NN searches the
reference dataset for the sample nearest to the test sam-
where C > 0 is a penalty constant, ξi is the slack vari- ple and assigns the test sample to the class to which the
able allowing misclassification of sample i, F(⋅) is a nearest sample belongs. In case there are multiple near-
function that maps xi to a high-dimensional space, often est reference samples, voting is used to assign the test
called the feature space, and b, b0 define the optimum sample to the class containing the largest number of
separating hyperplane b T z + b 0 = 0 in feature space. nearest reference samples. As discussed below, mtDNA
Once the optimal separating hyperplane is found, a test profiles are encoded into binary feature vectors. We
sample t is classified according to the sign of bTF(t) + used the number of mismatch positions (a.k.a. the Ham-
b0. ming distance) to measure the distance between sam-
In practice, the solution to the convex optimization ples, and did not apply PCA to the data before applying
problem (2) is obtained by solving the so-called Wolfe 1-NN.
dual. Instead of explicitly mapping samples to the fea-
ture space, solving the dual requires only a kernel func- Datasets
tion K(x 1 ,x 2 ) = F(x ι ) T F(x 2 ), which implicitly maps We used the forensic and published tables in the
samples to the feature space and simultaneously com- mtDNA population database [7] to empirically evaluate
putes the inner product [12]. In this study, we used the the performance of the four algorithms for ethnicity
software package LIBSVM [13] to conduct all SVM assignment. The forensic table contains 4,839 samples
experiments. LIBSVM uses the “one-against-one” collected and typed by the Federal Bureau of Investiga-
approach [14] when more than two classes are present. tion (FBI), while the published table contains 6,106 sam-
For all SVM experiments we used the radial basis kernel ples collected from the literature.
K(x1,x2) = exp(-g|x1-x2|2), where g is a parameter. The In this study, we focus only on the samples annotated
penalty constant C and the parameter g were tuned as belonging to one of the four coarse ethnic groups –
using 5-fold cross-validation on the training data. Caucasian, African, Asian and Hispanic. Filtering the
Linear and quadratic discriminant analysis forensic and published tables by this criteria results in
LDA and QDA assume that for each class the feature 4,426 and 3,976 samples, respectively. In the rest of the
vectors follow a multivariate normal distribution [6]. paper we will refer to the two filtered tables simply as
That is, the conditional probability of a sample x given the forensic and published datasets. The forensic dataset
that it belongs to class g is given by contains 1,674 Caucasian (37.8%), 1,305 African (29.5%),
761 Asian (17.2%) and 686 Hispanic (15.5%) samples,
fg( x) = Pr( X = x | G = g) while the published dataset is comprised of 2,807 Cau-
1 (3) casian (70.6%), 254 African (6.4%) and 915 Asian (23%)
= e − 12 ( −  g )   −g 1( −  g )
| 2  g |
1
2 samples.
Additional file 1 shows the percentage of samples
By applying Bayes’ theorem, we obtain the posterior sequenced at each position for the forensic and pub-
distribution as follows. lished datasets. We note that the forensic dataset has a
significantly better coverage than the published dataset.
fg(x) g All the samples in the forensic dataset cover portions of
Pr(G = g | X = x) = (4)
∑ ik=1 fi(x) i both hypervariable region 1 (HVR1) and hypervariable
region 2 (HVR2) of mtDNA, whereas over 60% of sam-
where πg is the prior probability of class g. The para- ples in the published dataset do not cover HVR2 and
meters of the multivariate normal distribution are esti- around 5% of them do not cover HVR1.
mated using the training dataset. LDA assumes that the To better characterize and compare the forensic and
classes have a common covariance matrix (i.e., Σg = Σ published datasets, we assign each sample in the two
for every g) therefore fewer parameters need to be esti- datasets to one of the 23 basal haplogroups defined in
mated for LDA compared to QDA. For both methods, a [8]. Haplogroup assignment was performed using the
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 4 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

unweighted 1NN algorithm described in [8] along with to the revised Cambridge Reference Sequence (rCRS).
the Genographic Project open resource mitochondrial For example, 16298C denotes a substitution at position
DNA database (the consented database) of 21,164 sam- 16298 and 16124.1C denotes the insertion of a C after
ples [16]. Behar et al. [8] reported a leave-one-out cross- position 16124. For a fixed dataset, we represent each
validation accuracy of 96.72% on a reference database of sample as an n-element binary vector, where n is the
16,609 samples. We observed a comparable accuracy of number of unique polymorphisms present in the data-
96.51% on the consented database. Therefore, we expect set. An element in the binary vector of a sample is set
the inferred haplogroups of samples in the forensic and to 1 if the sample harbors the corresponding poly-
published datasets to have a similarly high accuracy. morphism, and to 0 otherwise. This encoding method
The ethnicity composition of each haplogroup and the works well when all the samples in the dataset are
inferred haplogroup composition of each broad ethnic sequenced over the same or very similar ranges. An
group represented in the forensic and published datasets example is the forensic dataset, in which all samples
are given in Additional file 2. Additional file 2(A) sup- cover range 16024-16365 of HVR1 and range 73-340 of
ports the well known fact that many haplogroups are HVR2. While most of our experiments were obtained
strongly associated with a specific ancestry. For example, using the above binary encoding, we also discuss and
most samples with inferred haplogroup H, J, K, R0*, T, evaluate in the Results section several alternative
U*, and V are Caucasian, most samples with inferred schemes for encoding mtDNA profiles with significant
haplogroup B, D, M, N, and R9 are Asian, and most amounts of missing data.
samples with inferred haplogroup L are African. How-
ever, the association is not perfect, and significant per-
centages of these haplogroups are present in other Results
ethnic groups. For some haplogroups, such as B, N1*, Comparison of the four classification algorithms
W, and X the association with ethnicity is particularly For an initial evaluation of the four classification algo-
weak, with two or three ethnicities being represented in rithms, we performed cross-validation (CV) analysis
almost equal proportions. Additional file 2 further using the trimmed forensic dataset. Cross-validation is
shows that the forensic and published datasets have sig- one of the simplest and most widely used methods for
nificant differences in their ethnic and haplogroup com- estimating the accuracy of classification algorithms.
positions. Most strikingly, Caucasians are significantly Briefly, available samples are randomly split into K
over-represented and Hispanics are completely missing roughly equal parts, and then each part is used to evalu-
from the published dataset. Such differences are most ate classification accuracy of a model trained on the
likely due to the procedure used to assemble the pub- remaining K – 1 parts. In our experiments we used K =
lished dataset, and reflects preferential use of samples 5, i.e., 5-fold cross-validation.
from some ethnic groups in published studies. In addition to ethnicity-wise average accuracies, we
For some of the experiments described in the Results also use micro- and macro-accuracy as measures of the
section, we used specific subsets of the forensic and overall performance of the classification algorithms.
published datasets. The full-length forensicdataset con- These metrics, similar to the micro-average and macro-
sists of the 1,904 samples typed for the most extensive average of [17], are defined as follows:
ranges of HVR1 (16024–16569) and HVR2 (1–576).
∑ iK=1 C i
This dataset is comprised of 222 Caucasian (11.7%), 820 Micro-Accuracy = ; (5)
African (43.1%), 415 Asian (21.8%) and 447 Hispanic ∑ iK=1 N i
(23.5%) samples. The trimmed forensic dataset was pro-
duced by trimming the samples in the forensic dataset K

∑N
such that only the region of 16024–16365 in HVR1 is 1 Ci
Micro-Accuracy = , (6)
kept. It has the same ethnicity composition as the foren- K i
i =1
sic dataset since all samples in the forensic dataset are
typed in this range. The trimmed publisheddataset was where K is the number of classes in the dataset, Ni is
created in a similar fashion, except that only 2,540 sam- the number of samples in class i and Ci is the number
ples covering the 16024-16365 region were kept. This of samples correctly labeled by the classifier in class i.
subset contains 1,956 Caucasian (77%), 134 African Note that micro- and macro-accuracy become the same
(5.3%) and 450 Asian (17.7%) samples. when classes sizes are balanced, i.e., N1 = N2 = ... = NK.
For imbalanced class sizes, micro-accuracy tends to
Encoding mtDNA profiles into feature vectors over-emphasize the performance on the largest classes
Each sample in the forensic and published datasets is compared to macro-accuracy, which gives equal weight
given as a list of polymorphic changes when compared to the accuracy achieved for each class.
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 5 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

Table 1 summarizes the 5-fold CV accuracy metrics a microaccuracy of over 80%, very close to the microac-
for PCA-QDA, PCA-LDA, 1NN, and PCA-SVM on the curacy achieved on this set when using the entire HVR
trimmed forensic dataset. PCA-SVM consistently out- region, i.e., HVR1+HVR2.
performs the other three classification algorithms with
respect to all accuracy measures. Since the performance Validating SVM on independent test data
of different classification algorithms may depend signifi- Cross-validation may overestimate the practical perfor-
cantly on the typed mtDNA region, we conducted three mance of classifiers since it ignores potentially signifi-
additional experiments to assess its effect on the classifi- cant biases in the assembly of reference databases. To
cation accuracy of the four compared algorithms. In all obtain a more reliable estimate for the practical accu-
three of them we started from the full-length forensics racy of PCA-SVM, we evaluated its performance using
dataset. In the first experiment, we iteratively deleted the trimmed forensic dataset as training data and the
10% of the polymorphisms, starting from the HVR2 end trimmed published dataset as test data. Table 2 gives
non-adjacent to HVR1. Similarly, in the second experi- the so called confusion table for this experiment. There
ment, we iteratively deleted 10% of the polymorphisms is no “Hispanic” row since there are no samples anno-
starting from the HVR1 end non-adjacent to HVR2. tated as Hispanic in the trimmed published dataset used
Finally, in the third experiment, we used a sliding win- for testing. Since the Hispanic samples are present in
dow approach to generate 20 different datasets, each of the trimmed forensic dataset used for training, test sam-
which retained from the full-length forensics profiles ples may be mis-classified as Hispanic, and thus we do
10% of the nucleotides. include a “Hispanic” column. PCA-SVM micro-accuracy,
Figure 1 gives the 5-fold CV micro-accuracy achieved as well as ethnicity-wise accuracies for the Caucasian
by PCA-QDA, PCA-LDA, 1NN, and PCA-SVM in these and African ethnic groups are similar to the cross-vali-
three experiments. Again, PCA-SVM consistently out- dation results in Table 1. However, ethnicity-wise accu-
performs the other three classification algorithms inves- racy for the Asian group is almost 17% lower than the
tigated in this study. PCA-QDA is typically accuracy achieved in the cross-validation experiment.
outperformed by the other methods, except that it out- This is largely explained by large mismatches between
performs 1NN when the entire HVR is used. 1NN and Asian profiles used for training and testing in this
PCA-LDA have comparable performance, but PCA-LDA experiment. The 761 Asian profiles in the Forensic data-
performs slightly better than 1NN for near-complete set used for training come from only 5 countries: China
mtDNA profiles. Conversely, 1NN performs better than (356 profiles), Japan (163), Korea (182), Pakistan (8),
PCA-LDA for some short typed regions. Indeed, for and Thailand (52), with a strong bias towards East Asia.
short windows consisting of only 10% of the nucleotides Not surprisingly, a large percentage of misclassifications
in the entire dataset, the performance of 1NN is often errors (90 out of the total of 145) are for profiles col-
as good as that of PCA-SVM, see Figure 1(C). lected from two countries (Kazakhstan and Kyrgyzstan)
Figure 1(C) further shows that, regardless of the classi- that are not represented in the training dataset. Profiles
fication method used, certain regions of HVR1 and with unknown country of origin are also poorly classi-
HVR2 are more informative than others for the purpose fied (10 errors out of 22 samples) suggesting that they
of ethnicity inference. Additional file 3 gives the 5-fold may come from regions that are poorly represented in
CV micro-accuracy for 6 selected windows of 165- the forensics dataset too.
271bp spanning the most informative regions of HVR1
and HVR2. Interestingly, when using about 200bp from Comparison of methods for handling missing data
the information-rich region of HVR1, PCA-SVM yields In practice, forensic mtDNA profiles are determined by
Sanger sequencing of PCR amplicons that span hyper-
variable regions HVR1 and HVR2. Different laboratories
Table 1 Comparison of 5-fold CV accuracy measures on use different PCR primer pairs, some of which amplify
the trimmed forensic dataset only parts of HVR1 and HVR2. Quality trimming of
# Samples Classification Algorithm Sanger chromatograms further results in confident poly-
PCA-QDA PCA-LDA 1NN PCA-SVM morphism calls for a (sample dependent) subinterval of
Caucasian 1674 83.15 90.2 93.73 94.62 each amplicon. The end result are mtDNA profiles with
Asian 761 72.93 74.11 83.31 84.76 a variable degree of sequence coverage, i.e., with
African 1305 84.6 88.28 86.59 89.81 unknown polymorphism status for some parts of HVR1
Hispanic 686 71.57 68.22 72.01 72.59
and/or HVR2. In the experiments reported in previous
sections we relied on training and test sequences cover-
Micro-Accuracy 4426 80.03 83.46 86.47 88.10
ing essentially the same range, so missing data was not
Macro-Accuracy 4426 78.06 80.20 83.91 85.45
an issue. In this section we reassess the accuracy of
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 6 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

Figure 1 Effects of incomplete data on accuracy Comparison of PCA-QDA, PCA-LDA, 1NN, and PCA-SVM 5-fold CV micro-accuracy on regions
obtained by iteratively deleting groups of 10% polymorphisms starting from HVR1 towards HVR2 (A), respectively from HVR2 towards HVR1 (B),
and on sliding windows spanning 10% of the nucleotides in HVR1+HVR2 (C).

PCA-SVM under more realistic levels of missing data. • rCRS. In this approach we simply assume that miss-
Specifically, we report results of experiments performed ing regions are identical to the rCRS. While easy to
using as training and test data the (untrimmed) forensic implement, this scheme is likely to introduce a strong
and published datasets, respectively; as shown in Addi- bias towards the Caucasian ethnicity since the rCRS
tional file 1, the published dataset has indeed highly sequence is of a Caucasian.
non-uniform coverage of different HVR regions. • Probability. In this approach we augment the fea-
We investigated three different approaches of dealing ture encoding scheme described in the Methods section
with missing data: by adding a set of l additional variables, where l is the
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 7 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

Table 2 Confusion table of the PCA-SVM test results on explain why, as shown in Additional file 5, SVM poster-
the trimmed published dataset ior probabilities typically under-estimate the observed
True Ethnicity # Samples Predicted Ethnicity accuracy.
Caucasian Asian African Hispanic
Caucasian 1956 92.59 5.47 1.53 0.41 Discussion
Asian 450 25.78 67.78 3.11 3.33 Correspondence between investigator assigned ethnicity
and mitochondrial haplogroup
African 134 5.22 3.73 87.31 3.73
Human mitochondrial haplogroups have arisen from
Micro-Accuracy: 87.91%
mutation and migration during human evolution. As
Macro-Accuracy: 82.56%
such, these haplogroups have been extremely powerful
tools in understanding human evolution and particularly
total length of HVR1 and HVR2 in bases. For typed in understanding patterns of geographical migration of
bases, these variables hold the mutation status of the human populations. Prior to modern travel, mitochon-
base – 1 if there is a polymorphism at this base and 0 drial haplogroups were largely restricted to the geo-
otherwise. For bases that are not covered by sequencing, graphic regions of their origin and subsequent
the corresponding variable is set to a fractional value migration. For this reason, they are often superimposed
between 0 and 1 representing the polymorphism rate on maps of the globe as representative of the human
observed at this position in the training data. While less populations derived from those regions of the planet.
biased than the rCRS scheme, this scheme may still Similarly, but more crudely, the coarsest ethnic group-
introduce unwanted biases in case some ethnicities are ings of humans are also reflective of geographic ances-
over- or under-represented in the training data. try. Africans, Caucasians, and Asians all have clear
• Common region. In this approach we compute, for geographic associations, while Hispanic is often regarded
each test profile, the intersection between the region as a less well defined mix of New World and European
sequenced in the test profile and each training sample. ancestry. Because of the clear associations of both mito-
Only these common regions of the training sequences chondrial haplogroups and ethnic categories with geo-
are then used to infer the ethnicity of the test sample. graphy, one might naively expect a simple correlation
The common region approach is computationally more between the two classifications. When we analyze the
demanding than the other two, since it may require run- association between mitochondrial haplogroup and
ning PCA and training a new SVM for each test sample. investigator assigned ethnicity however, we find a com-
Additional file 4 summarizes the results obtained by plex relationship between the two categories. While, for
using the three approaches to handling missing data in instance, there is broad correspondence between the L
experiments in which the forensic and published data- haplogroups and African ethnicity assignments, African
sets are used for training and evaluation classification ethnicity assignments are present to varying degrees in
accuracy, respectively. Consistent to its bias towards virtually every haplogroup analyzed and almost every
Caucasians, the rCRS approach has almost 97% accuracy haplogroup contains members of each of the four ethni-
for this ethnicity but very much lower accuracy for cities. This is not particularly surprising due to the fact
Asian and African ethnicities (about 31% and 59%, that mitochondrial DNA represents only a very small
respectively), resulting in relatively poor overall micro- segment of the complex mosaic of a human’s genetic
and macro-accuracies. The probability approach is still ancestry, and it suggests that the ability to infer coarse
biased towards the Caucasian ethnicity, although less ethnic identity from mitochondrial sequence would be
strongly than the rCRS approach. The best overall per- very limited. In fact, however, we find that mitochon-
formance is achieved by the common region approach, drial DNA can be used to infer the probable assignment
which has micro- and macro-accuracies (as well as eth- of coarse ethnicity with almost 90% accuracy, levels
nicity-wise accuracies) very close to those observed in approaching those obtainable with approximately sixty
the experiments performed on the trimmed forensic and autosomal loci [11]. This level of accuracy in predicting
published datasets (see Table 2). This suggests that the investigator assigned ethnicity could be very useful in
common region approach is a good method of dealing forensic investigations.
with missing data, at least in conjunction with the PCA-
SVM method for ethnicity inference. Information content in HVR1 and HVR2
A potential concern with using the common interval As noted above, there is a great deal of variability in the
approach is that different amounts of training data are precise regions of HVR1 and HVR2 genotyped in prac-
used in classifying different test samples. This can make tice. Sequence coverage within the mitochondrial con-
it difficult to compare posterior probabilities returned trol region is often laboratory and/or study dependent.
by classification methods such as SVM, and may partly Variability of these boundaries severely limits the utility
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 8 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

of individual datasets in the assembly of large datasets drug response profiles [21], and other “race based” ther-
representative of complex populations. Recently, Tzen et apeutics [22].
al. [18] sought to redefine HVR1 on the basis of genetic When applied to independent test data our SVM clas-
diversity and laboratory tractability. They show that the sifier performs reasonably well despite significant differ-
237-bp segment from 16126-16362 (the “redefined” ences between the training and test sets. In particular,
HVR1, or rHVR1) had a global genetic diversity of the absence of a Hispanic classification in the published
0.9905 and the 154-bp segment from 16209-16362 had dataset, and the inclusion of geographic regions in the
a global diversity of 0.9735, where the genetic diversity test set that are not represented in the training set (for
for a sample with n haplotypes with population frequen- instance Kazakhstan and Kyrgyzstan) is likely to have
cies x i , i = 1,…,n, is computed as (1 − Σ in=1x i2 )n / (n − 1) . The contributed significantly to errors in our inferences.
results of [18] match very closely with our scans of the Such errors are likely to recede as larger, more geogra-
inferential power of windows across the control region; phically balanced training sets are assembled.
Tzen’s rHVR1 overlaps precisely with the region of
greatest discriminative power in HVR1. The correspon- Handling missing data
dence between these results suggests that HVR2 might In the last few years several authors have pointed out
be similarly standardized to a region between 93-310, the presence of sequence errors in public and forensic
where the greatest discriminative power of HVR2 is mtDNA databases [23-27]. Moreover, precise boundaries
found. The identification of small regions of sequence of HVR1 and HVR2 are not always consistent across
that have maximal discriminative power could be quite studies and real-world samples may be severely
useful in forensic and anthropological settings where degraded, further contributing to errors or missing data
severe degradation can limit the size of PCR products in samples to be classified. We evaluated several statisti-
recoverable from sample material. Di Bernardo et al. cal approaches to dealing with missing data and evalu-
[19] report that the longest amplifiable DNA fragments ated these approaches for accuracy under simulated
extracted from 2000-year-old remains from Pompeii are scenarios of data dropout or loss. We found that despite
between 139 and 360 bp. Sequences of this size from a small loss of accuracy incurred by data dropout,
the most informative regions of HVR1 and HVR2 would restricting analysis to the region of intersection between
allow inference of coarse ethnic identity with reasonably the test sample and training samples provides the most
high accuracy. reliable inference of the ethnicity of the sample.
Attempts to impute any missing data based on the rCRS
SVM as classifier or a probabilistic model based of the training set
Many applications in human genetics require the discri- resulted in prediction bias toward Caucasian due to the
minative classification of samples into groups, and a origin of the rCRS and the preponderance of Caucasian
number of methods for this task have been proposed. samples in the FBI forensic data set. Until very large,
Lately, machine learning approaches have been used to ethnically balanced training sets are available, restricting
good effect in a number of biological scenarios including analysis to the region of intersection between test and
the classification of Y-haplogroups [20]. In this study we training samples is likely to remain the most accurate
use support vector machines (SVM) to develop statisti- and unbiased approach to inference.
cal models capable of predicting the ethnicity of mito-
chondrial DNA samples. We compare the performance Conclusions
of SVM under simulations of real-world scenarios with In this study, we compared four classification algo-
several other methods previously proposed for the clas- rithms for the prediction of probable assignment of
sification of mitochondrial sequences into geographically coarse ethnic identity using short DNA sequences
defined groups, including QDA and LDA [3-5]. In all from the hypervariable region of mtDNA. Comprehen-
tests SVM provides accuracy greater or equal to that of sive empirical studies showed that, regardless of
the other methods tested. SVM consistently provides sequence length, support vector classification is the
the best accuracy in simulations of degradation form most accurate classifier among those compared and
either end of the mitochondrial hypervariable regions, approaches 90% accuracy in predicting the assignment
and when small subsections of the hypervariable regions of course ethnic identity. Our experiments also identi-
are used. With only 218bp of mtDNA sequence, the fied high accuracy segments in HVR, which agree well
overall accuracy of SVM predictions exceeds 80%. The with the genetically diverse regions reported in pre-
success of SVM in this classification problem suggests vious work. Finally, our experiments showed that, in
that it may also be the best method for related classifi- dealing with missing data, it is advisable to use only
cation problems including inferring the geographic ori- segments shared by reference sequences and the
gin of DNA samples [4,5], haplogroup membership [8], sequence under test.
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 9 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11

Additional material Genographic Project Public Participation Mitochondrial DNA Database.


PLoS Genet 2007, 3(6):e104.
9. Dibennardo R, Taylor JV: Multiple discriminant function analysis of sex
Additional file 1: Coverage of samples Percentage of samples and race in the postcranial skeleton. American Journal of Physical
covering each position of HVR1 and HVR2 in the forensic (A) and Anthropology 1983, 61(3):305-314.
published (B) datasets. 10. İşcan MY: A Topical Guide to the American Journal of Physical Anthropology:
Additional file 2: Sample composition of the forensic and published Volumes 22-53 (1964-1980) Wiley-Liss; 1983.
datasets Ethnicity composition of each haplogroup (A) and haplogroup 11. Bamshad M, Wooding S, Salisbury BA, Stephens JC: Deconstructing the
composition of each ethnic group (B) for the forensic and published relationship between genetics and race. Nature Reviews Genetics 2004,
datasets. 5(8):598-609.
12. Vapnik V: Statistical Learning Theory Wiley; 1998.
Additional file 3: Accuracy of short segments of HVR Comparison of
13. Chang CC, Lin CJ: LIBSVM: a library for support vector machines. 2001
PCA-QDA, PCA-LDA, 1NN, and PCA-SVM 5-fold CV micro-accuracy on 6
[http://www.csie.ntu.edu.tw/~cjlin/libsvm].
selected windows of 165-271bp spanning the most informative regions
14. Knerr S, Personnaz L, Dreyfus G: Single-layer learning revisited: a stepwise
of HVR1 and HVR2.
procedure for building and training a neural network. In Neurocomputing:
Additional file 4: Accuracy of PCA-SVM using different schemes for Algorithms Architectures and Application Fogelman J, Springer-Verlag 1990.
handling missing data 15. Fraley C, Raftery AE: Enhanced Model-Based Clustering, Density
Additional file 5: Calibration of PCA-SVM posterior probabilities for Estimation, and Discriminant Analysis Software: MCLUST. Journal of
the FBI published dataset The actual accuracy rates are slightly higher Classification 2003, 20(2):263-286.
than the estimated posterior probabilities. 16. M Behar D, Rosset S, Blue-Smith J, Balanovsky O, Tzur S, Comas D,
Mitchell RJ, Quintana-Murci L, Tyler-Smith C, Wells RS, Consortium TG:
Correction: The Genographic Project Public Participation Mitochondrial
DNA Database. PLoS Genet 2007 3(9):e169.
17. Lewis DD: Evaluating text categorization. In Proceedings of Speech and
Acknowledgements
Natural Language Workshop Morgan Kaufmann; 1991, 312-318.
This work was supported in part by NSF grants CCF-0755373, DBI-0543365,
18. Tzen J, Hsu H, MN W: Redefinition of hypervariable region I in
IIS-0546457, and IIS-0916948.
mitochondrial DNA control region and comparing its diversity among
This article has been published as part of BMC Proceedings Volume 5
various ethnic groups. Mitochondrion 2008, 8(2):146-154.
Supplement 2, 2011: Proceedings of the 6th International Symposium on
19. Di Bernardo G, Del Gaudio S, Galderisi U, Cipollaro M: 2000 Year-old
Bioinformatics Research and Applications (ISBRA’10). The full contents of the
ancient equids: an ancient-DNA lesson from pompeii remains. Journal of
supplement are available online at http://www.biomedcentral.com/1753-
Experimental Zoology Part B: Molecular and Developmental Evolution 2004,
6561/5?issue=S2.
302B(6):550-556.
20. Schlecht J, Kaplan ME, Barnard K, Karafet T, Hammer MF, Merchant NC:
Author details
1 Machine-Learning Approaches for Classifying Haplogroup from Y
Computer Science and Engineering Department, University of Connecticut,
Chromosome STR Data. PLoS Comput Biol 2008, 4(6):e1000093.
Storrs, CT, USA. 2Molecular and Cell Biology Department, University of
21. Schelleman H, Limdi NA, Kimmel SE: Ethnic differences in warfarin
Connecticut, Storrs, CT, USA.
maintenance dose requirement and its relationship with genetics.
Pharmacogenomics 2008, 9(9):1331-1346.
Authors contributions
22. Yancy CW: Race-based therapeutics. Current Hypertension Reports 2008,
IIM and CEN conceived the study. CL conducted the experiments. All the
10(4):276-285.
authors contributed valuable ideas to this study, drafted the manuscript, and
23. Bandelt HJ, Lahermo P, Richards M, Macaulay V: Detecting errors in
were involved in manuscript revision. All authors have read and approved
mtDNA data by phylogenetic analysis. International Journal of Legal
the final manuscript.
Medicine 2001, 115(2):64-69.
24. Bandelt H, Quintana-Murci L, Salas A, Macaulay V: The fingerprint of
Competing Interests
phantom mutations in mitochondrial DNA data. American journal of
The authors declare that they have no competing interests.
human genetics 2002, 71(5):1150-1160.
25. Forster P: To err is human. Annals of Human Genetics 2003, 67:2-4.
Published: 28 April 2011
26. Dennis C: Error reports threaten to unravel databases of mitochondrial
DNA. Nature 2003, 421(6925):773-774.
References 27. Bandelt H, Salas A, Bravi C: Problems in FBI mtDNA database. Science
1. Shriver MD, Smith MW, Jin L, Marcini A, Akey JM, Deka R, Ferrell RE: Ethnic- 2004, 305(5689):1402-1404.
affiliation estimation by use of population-specific DNA markers.
American Journal of Human Genetics 1997, 60(4):957-964. doi:10.1186/1753-6561-5-S2-S11
2. Phillips C, Salas A, Sánchez J, Fondevila M, Gómez-Tato A, Álvarez Dios J, Cite this article as: Lee et al.: Inferring ethnicity from mitochondrial
Calaza M, de Cal MC, Ballard D, Lareu M, Carracedo A: Inferring ancestral DNA sequence. BMC Proceedings 2011 5(Suppl 2):S11.
origin using a single multiplex assay of ancestry-informative marker
SNPs. Forensic Science International:Genetics 2007, 1(3-4):273-280.
3. Connor A, Stoneking M: Assessing ethnicity from human mitochondrial
DNA types determined by hybridization with sequence-specific Submit your next manuscript to BioMed Central
oligonucleotides. Journal of forensic sciences 1994, 39(6):1360-1371.
and take full advantage of:
4. Rohl A, Brinkmann B, Forster L, Forster P: An annotated mtDNA database.
International Journal of Legal Medicine 2001, 115(29):39.
5. Egeland T, Bøvelstad HM, Storvik GO, Salas A: Inferring the most likely • Convenient online submission
geographical origin of mtDNA sequence profiles. Annals of human • Thorough peer review
genetics 2004, 68(5):461-471.
• No space constraints or color figure charges
6. Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning.
Springer;, 2 2009. • Immediate publication on acceptance
7. Monson KL, Miller KWP, Wilson MR, DiZinno JA, Budowle B: The mtDNA • Inclusion in PubMed, CAS, Scopus and Google Scholar
Population Database: An Integrated Software and Database Resource for
Forensic Comparison. Forensic Science Communications 2002, 4(2). • Research which is freely available for redistribution
8. Behar DM, Rosset S, Blue-Smith J, Balanovsky O, Tzur S, Comas D,
Mitchell RJ, Quintana-Murci L, Tyler-Smith C, Wells RS, Consortium TG: The Submit your manuscript at
www.biomedcentral.com/submit

You might also like