Inferring Ethnicity From Mitochondrial DNA Sequence: Proceedings Open Access
Inferring Ethnicity From Mitochondrial DNA Sequence: Proceedings Open Access
Inferring Ethnicity From Mitochondrial DNA Sequence: Proceedings Open Access
Abstract
Background: The assignment of DNA samples to coarse population groups can be a useful but difficult task. One
such example is the inference of coarse ethnic groupings for forensic applications. Ethnicity plays an important role
in forensic investigation and can be inferred with the help of genetic markers. Being maternally inherited, of high
copy number, and robust persistence in degraded samples, mitochondrial DNA may be useful for inferring coarse
ethnicity. In this study, we compare the performance of methods for inferring ethnicity from the sequence of the
hypervariable region of the mitochondrial genome.
Results: We present the results of comprehensive experiments conducted on datasets extracted from the mtDNA
population database, showing that ethnicity inference based on support vector machines (SVM) achieves an overall
accuracy of 80-90%, consistently outperforming nearest neighbor and discriminant analysis methods previously
proposed in the literature. We also evaluate methods of handling missing data and characterize the most
informative segments of the hypervariable region of the mitochondrial genome.
Conclusions: Support vector machines can be used to infer coarse ethnicity from a small region of mitochondrial
DNA sequence with surprisingly high accuracy. In the presence of missing data, utilizing only the regions common
to the training sequences and a test sequence proves to be the best strategy. Given these results, SVM algorithms
are likely to also be useful in other DNA sequence classification applications.
© 2011 Lee et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 2 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11
SVM predictions fall on the wrong side of their margin. given test sample t is assigned to the class with the
Given n-element feature vectors xi, i = 1,…, m, and an m- highest posterior probability
element label vector y such that yi ? {1, –1}, this amounts argmaxg Pr(G = g|X = t).
to solving the following optimization problem: In this study, we used MCLUST Version 3 [15] to
conduct all LDA and QDA experiments.
m
∑
1 1-nearest neighbor (1NN)
min Τ + C i subject to 1NN is a simple non-parametric classification algorithm,
, 0 , 2 (2)
i =1 which does not have a training process. Given a set of
i ≥ 0, y i ( Τ(x i ) + 0 ) ≥ 1 − ii = 1, K , m reference samples and a test sample, 1NN searches the
reference dataset for the sample nearest to the test sam-
where C > 0 is a penalty constant, ξi is the slack vari- ple and assigns the test sample to the class to which the
able allowing misclassification of sample i, F(⋅) is a nearest sample belongs. In case there are multiple near-
function that maps xi to a high-dimensional space, often est reference samples, voting is used to assign the test
called the feature space, and b, b0 define the optimum sample to the class containing the largest number of
separating hyperplane b T z + b 0 = 0 in feature space. nearest reference samples. As discussed below, mtDNA
Once the optimal separating hyperplane is found, a test profiles are encoded into binary feature vectors. We
sample t is classified according to the sign of bTF(t) + used the number of mismatch positions (a.k.a. the Ham-
b0. ming distance) to measure the distance between sam-
In practice, the solution to the convex optimization ples, and did not apply PCA to the data before applying
problem (2) is obtained by solving the so-called Wolfe 1-NN.
dual. Instead of explicitly mapping samples to the fea-
ture space, solving the dual requires only a kernel func- Datasets
tion K(x 1 ,x 2 ) = F(x ι ) T F(x 2 ), which implicitly maps We used the forensic and published tables in the
samples to the feature space and simultaneously com- mtDNA population database [7] to empirically evaluate
putes the inner product [12]. In this study, we used the the performance of the four algorithms for ethnicity
software package LIBSVM [13] to conduct all SVM assignment. The forensic table contains 4,839 samples
experiments. LIBSVM uses the “one-against-one” collected and typed by the Federal Bureau of Investiga-
approach [14] when more than two classes are present. tion (FBI), while the published table contains 6,106 sam-
For all SVM experiments we used the radial basis kernel ples collected from the literature.
K(x1,x2) = exp(-g|x1-x2|2), where g is a parameter. The In this study, we focus only on the samples annotated
penalty constant C and the parameter g were tuned as belonging to one of the four coarse ethnic groups –
using 5-fold cross-validation on the training data. Caucasian, African, Asian and Hispanic. Filtering the
Linear and quadratic discriminant analysis forensic and published tables by this criteria results in
LDA and QDA assume that for each class the feature 4,426 and 3,976 samples, respectively. In the rest of the
vectors follow a multivariate normal distribution [6]. paper we will refer to the two filtered tables simply as
That is, the conditional probability of a sample x given the forensic and published datasets. The forensic dataset
that it belongs to class g is given by contains 1,674 Caucasian (37.8%), 1,305 African (29.5%),
761 Asian (17.2%) and 686 Hispanic (15.5%) samples,
fg( x) = Pr( X = x | G = g) while the published dataset is comprised of 2,807 Cau-
1 (3) casian (70.6%), 254 African (6.4%) and 915 Asian (23%)
= e − 12 ( − g ) −g 1( − g )
| 2 g |
1
2 samples.
Additional file 1 shows the percentage of samples
By applying Bayes’ theorem, we obtain the posterior sequenced at each position for the forensic and pub-
distribution as follows. lished datasets. We note that the forensic dataset has a
significantly better coverage than the published dataset.
fg(x) g All the samples in the forensic dataset cover portions of
Pr(G = g | X = x) = (4)
∑ ik=1 fi(x) i both hypervariable region 1 (HVR1) and hypervariable
region 2 (HVR2) of mtDNA, whereas over 60% of sam-
where πg is the prior probability of class g. The para- ples in the published dataset do not cover HVR2 and
meters of the multivariate normal distribution are esti- around 5% of them do not cover HVR1.
mated using the training dataset. LDA assumes that the To better characterize and compare the forensic and
classes have a common covariance matrix (i.e., Σg = Σ published datasets, we assign each sample in the two
for every g) therefore fewer parameters need to be esti- datasets to one of the 23 basal haplogroups defined in
mated for LDA compared to QDA. For both methods, a [8]. Haplogroup assignment was performed using the
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 4 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11
unweighted 1NN algorithm described in [8] along with to the revised Cambridge Reference Sequence (rCRS).
the Genographic Project open resource mitochondrial For example, 16298C denotes a substitution at position
DNA database (the consented database) of 21,164 sam- 16298 and 16124.1C denotes the insertion of a C after
ples [16]. Behar et al. [8] reported a leave-one-out cross- position 16124. For a fixed dataset, we represent each
validation accuracy of 96.72% on a reference database of sample as an n-element binary vector, where n is the
16,609 samples. We observed a comparable accuracy of number of unique polymorphisms present in the data-
96.51% on the consented database. Therefore, we expect set. An element in the binary vector of a sample is set
the inferred haplogroups of samples in the forensic and to 1 if the sample harbors the corresponding poly-
published datasets to have a similarly high accuracy. morphism, and to 0 otherwise. This encoding method
The ethnicity composition of each haplogroup and the works well when all the samples in the dataset are
inferred haplogroup composition of each broad ethnic sequenced over the same or very similar ranges. An
group represented in the forensic and published datasets example is the forensic dataset, in which all samples
are given in Additional file 2. Additional file 2(A) sup- cover range 16024-16365 of HVR1 and range 73-340 of
ports the well known fact that many haplogroups are HVR2. While most of our experiments were obtained
strongly associated with a specific ancestry. For example, using the above binary encoding, we also discuss and
most samples with inferred haplogroup H, J, K, R0*, T, evaluate in the Results section several alternative
U*, and V are Caucasian, most samples with inferred schemes for encoding mtDNA profiles with significant
haplogroup B, D, M, N, and R9 are Asian, and most amounts of missing data.
samples with inferred haplogroup L are African. How-
ever, the association is not perfect, and significant per-
centages of these haplogroups are present in other Results
ethnic groups. For some haplogroups, such as B, N1*, Comparison of the four classification algorithms
W, and X the association with ethnicity is particularly For an initial evaluation of the four classification algo-
weak, with two or three ethnicities being represented in rithms, we performed cross-validation (CV) analysis
almost equal proportions. Additional file 2 further using the trimmed forensic dataset. Cross-validation is
shows that the forensic and published datasets have sig- one of the simplest and most widely used methods for
nificant differences in their ethnic and haplogroup com- estimating the accuracy of classification algorithms.
positions. Most strikingly, Caucasians are significantly Briefly, available samples are randomly split into K
over-represented and Hispanics are completely missing roughly equal parts, and then each part is used to evalu-
from the published dataset. Such differences are most ate classification accuracy of a model trained on the
likely due to the procedure used to assemble the pub- remaining K – 1 parts. In our experiments we used K =
lished dataset, and reflects preferential use of samples 5, i.e., 5-fold cross-validation.
from some ethnic groups in published studies. In addition to ethnicity-wise average accuracies, we
For some of the experiments described in the Results also use micro- and macro-accuracy as measures of the
section, we used specific subsets of the forensic and overall performance of the classification algorithms.
published datasets. The full-length forensicdataset con- These metrics, similar to the micro-average and macro-
sists of the 1,904 samples typed for the most extensive average of [17], are defined as follows:
ranges of HVR1 (16024–16569) and HVR2 (1–576).
∑ iK=1 C i
This dataset is comprised of 222 Caucasian (11.7%), 820 Micro-Accuracy = ; (5)
African (43.1%), 415 Asian (21.8%) and 447 Hispanic ∑ iK=1 N i
(23.5%) samples. The trimmed forensic dataset was pro-
duced by trimming the samples in the forensic dataset K
∑N
such that only the region of 16024–16365 in HVR1 is 1 Ci
Micro-Accuracy = , (6)
kept. It has the same ethnicity composition as the foren- K i
i =1
sic dataset since all samples in the forensic dataset are
typed in this range. The trimmed publisheddataset was where K is the number of classes in the dataset, Ni is
created in a similar fashion, except that only 2,540 sam- the number of samples in class i and Ci is the number
ples covering the 16024-16365 region were kept. This of samples correctly labeled by the classifier in class i.
subset contains 1,956 Caucasian (77%), 134 African Note that micro- and macro-accuracy become the same
(5.3%) and 450 Asian (17.7%) samples. when classes sizes are balanced, i.e., N1 = N2 = ... = NK.
For imbalanced class sizes, micro-accuracy tends to
Encoding mtDNA profiles into feature vectors over-emphasize the performance on the largest classes
Each sample in the forensic and published datasets is compared to macro-accuracy, which gives equal weight
given as a list of polymorphic changes when compared to the accuracy achieved for each class.
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 5 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11
Table 1 summarizes the 5-fold CV accuracy metrics a microaccuracy of over 80%, very close to the microac-
for PCA-QDA, PCA-LDA, 1NN, and PCA-SVM on the curacy achieved on this set when using the entire HVR
trimmed forensic dataset. PCA-SVM consistently out- region, i.e., HVR1+HVR2.
performs the other three classification algorithms with
respect to all accuracy measures. Since the performance Validating SVM on independent test data
of different classification algorithms may depend signifi- Cross-validation may overestimate the practical perfor-
cantly on the typed mtDNA region, we conducted three mance of classifiers since it ignores potentially signifi-
additional experiments to assess its effect on the classifi- cant biases in the assembly of reference databases. To
cation accuracy of the four compared algorithms. In all obtain a more reliable estimate for the practical accu-
three of them we started from the full-length forensics racy of PCA-SVM, we evaluated its performance using
dataset. In the first experiment, we iteratively deleted the trimmed forensic dataset as training data and the
10% of the polymorphisms, starting from the HVR2 end trimmed published dataset as test data. Table 2 gives
non-adjacent to HVR1. Similarly, in the second experi- the so called confusion table for this experiment. There
ment, we iteratively deleted 10% of the polymorphisms is no “Hispanic” row since there are no samples anno-
starting from the HVR1 end non-adjacent to HVR2. tated as Hispanic in the trimmed published dataset used
Finally, in the third experiment, we used a sliding win- for testing. Since the Hispanic samples are present in
dow approach to generate 20 different datasets, each of the trimmed forensic dataset used for training, test sam-
which retained from the full-length forensics profiles ples may be mis-classified as Hispanic, and thus we do
10% of the nucleotides. include a “Hispanic” column. PCA-SVM micro-accuracy,
Figure 1 gives the 5-fold CV micro-accuracy achieved as well as ethnicity-wise accuracies for the Caucasian
by PCA-QDA, PCA-LDA, 1NN, and PCA-SVM in these and African ethnic groups are similar to the cross-vali-
three experiments. Again, PCA-SVM consistently out- dation results in Table 1. However, ethnicity-wise accu-
performs the other three classification algorithms inves- racy for the Asian group is almost 17% lower than the
tigated in this study. PCA-QDA is typically accuracy achieved in the cross-validation experiment.
outperformed by the other methods, except that it out- This is largely explained by large mismatches between
performs 1NN when the entire HVR is used. 1NN and Asian profiles used for training and testing in this
PCA-LDA have comparable performance, but PCA-LDA experiment. The 761 Asian profiles in the Forensic data-
performs slightly better than 1NN for near-complete set used for training come from only 5 countries: China
mtDNA profiles. Conversely, 1NN performs better than (356 profiles), Japan (163), Korea (182), Pakistan (8),
PCA-LDA for some short typed regions. Indeed, for and Thailand (52), with a strong bias towards East Asia.
short windows consisting of only 10% of the nucleotides Not surprisingly, a large percentage of misclassifications
in the entire dataset, the performance of 1NN is often errors (90 out of the total of 145) are for profiles col-
as good as that of PCA-SVM, see Figure 1(C). lected from two countries (Kazakhstan and Kyrgyzstan)
Figure 1(C) further shows that, regardless of the classi- that are not represented in the training dataset. Profiles
fication method used, certain regions of HVR1 and with unknown country of origin are also poorly classi-
HVR2 are more informative than others for the purpose fied (10 errors out of 22 samples) suggesting that they
of ethnicity inference. Additional file 3 gives the 5-fold may come from regions that are poorly represented in
CV micro-accuracy for 6 selected windows of 165- the forensics dataset too.
271bp spanning the most informative regions of HVR1
and HVR2. Interestingly, when using about 200bp from Comparison of methods for handling missing data
the information-rich region of HVR1, PCA-SVM yields In practice, forensic mtDNA profiles are determined by
Sanger sequencing of PCR amplicons that span hyper-
variable regions HVR1 and HVR2. Different laboratories
Table 1 Comparison of 5-fold CV accuracy measures on use different PCR primer pairs, some of which amplify
the trimmed forensic dataset only parts of HVR1 and HVR2. Quality trimming of
# Samples Classification Algorithm Sanger chromatograms further results in confident poly-
PCA-QDA PCA-LDA 1NN PCA-SVM morphism calls for a (sample dependent) subinterval of
Caucasian 1674 83.15 90.2 93.73 94.62 each amplicon. The end result are mtDNA profiles with
Asian 761 72.93 74.11 83.31 84.76 a variable degree of sequence coverage, i.e., with
African 1305 84.6 88.28 86.59 89.81 unknown polymorphism status for some parts of HVR1
Hispanic 686 71.57 68.22 72.01 72.59
and/or HVR2. In the experiments reported in previous
sections we relied on training and test sequences cover-
Micro-Accuracy 4426 80.03 83.46 86.47 88.10
ing essentially the same range, so missing data was not
Macro-Accuracy 4426 78.06 80.20 83.91 85.45
an issue. In this section we reassess the accuracy of
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 6 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11
Figure 1 Effects of incomplete data on accuracy Comparison of PCA-QDA, PCA-LDA, 1NN, and PCA-SVM 5-fold CV micro-accuracy on regions
obtained by iteratively deleting groups of 10% polymorphisms starting from HVR1 towards HVR2 (A), respectively from HVR2 towards HVR1 (B),
and on sliding windows spanning 10% of the nucleotides in HVR1+HVR2 (C).
PCA-SVM under more realistic levels of missing data. • rCRS. In this approach we simply assume that miss-
Specifically, we report results of experiments performed ing regions are identical to the rCRS. While easy to
using as training and test data the (untrimmed) forensic implement, this scheme is likely to introduce a strong
and published datasets, respectively; as shown in Addi- bias towards the Caucasian ethnicity since the rCRS
tional file 1, the published dataset has indeed highly sequence is of a Caucasian.
non-uniform coverage of different HVR regions. • Probability. In this approach we augment the fea-
We investigated three different approaches of dealing ture encoding scheme described in the Methods section
with missing data: by adding a set of l additional variables, where l is the
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 7 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11
Table 2 Confusion table of the PCA-SVM test results on explain why, as shown in Additional file 5, SVM poster-
the trimmed published dataset ior probabilities typically under-estimate the observed
True Ethnicity # Samples Predicted Ethnicity accuracy.
Caucasian Asian African Hispanic
Caucasian 1956 92.59 5.47 1.53 0.41 Discussion
Asian 450 25.78 67.78 3.11 3.33 Correspondence between investigator assigned ethnicity
and mitochondrial haplogroup
African 134 5.22 3.73 87.31 3.73
Human mitochondrial haplogroups have arisen from
Micro-Accuracy: 87.91%
mutation and migration during human evolution. As
Macro-Accuracy: 82.56%
such, these haplogroups have been extremely powerful
tools in understanding human evolution and particularly
total length of HVR1 and HVR2 in bases. For typed in understanding patterns of geographical migration of
bases, these variables hold the mutation status of the human populations. Prior to modern travel, mitochon-
base – 1 if there is a polymorphism at this base and 0 drial haplogroups were largely restricted to the geo-
otherwise. For bases that are not covered by sequencing, graphic regions of their origin and subsequent
the corresponding variable is set to a fractional value migration. For this reason, they are often superimposed
between 0 and 1 representing the polymorphism rate on maps of the globe as representative of the human
observed at this position in the training data. While less populations derived from those regions of the planet.
biased than the rCRS scheme, this scheme may still Similarly, but more crudely, the coarsest ethnic group-
introduce unwanted biases in case some ethnicities are ings of humans are also reflective of geographic ances-
over- or under-represented in the training data. try. Africans, Caucasians, and Asians all have clear
• Common region. In this approach we compute, for geographic associations, while Hispanic is often regarded
each test profile, the intersection between the region as a less well defined mix of New World and European
sequenced in the test profile and each training sample. ancestry. Because of the clear associations of both mito-
Only these common regions of the training sequences chondrial haplogroups and ethnic categories with geo-
are then used to infer the ethnicity of the test sample. graphy, one might naively expect a simple correlation
The common region approach is computationally more between the two classifications. When we analyze the
demanding than the other two, since it may require run- association between mitochondrial haplogroup and
ning PCA and training a new SVM for each test sample. investigator assigned ethnicity however, we find a com-
Additional file 4 summarizes the results obtained by plex relationship between the two categories. While, for
using the three approaches to handling missing data in instance, there is broad correspondence between the L
experiments in which the forensic and published data- haplogroups and African ethnicity assignments, African
sets are used for training and evaluation classification ethnicity assignments are present to varying degrees in
accuracy, respectively. Consistent to its bias towards virtually every haplogroup analyzed and almost every
Caucasians, the rCRS approach has almost 97% accuracy haplogroup contains members of each of the four ethni-
for this ethnicity but very much lower accuracy for cities. This is not particularly surprising due to the fact
Asian and African ethnicities (about 31% and 59%, that mitochondrial DNA represents only a very small
respectively), resulting in relatively poor overall micro- segment of the complex mosaic of a human’s genetic
and macro-accuracies. The probability approach is still ancestry, and it suggests that the ability to infer coarse
biased towards the Caucasian ethnicity, although less ethnic identity from mitochondrial sequence would be
strongly than the rCRS approach. The best overall per- very limited. In fact, however, we find that mitochon-
formance is achieved by the common region approach, drial DNA can be used to infer the probable assignment
which has micro- and macro-accuracies (as well as eth- of coarse ethnicity with almost 90% accuracy, levels
nicity-wise accuracies) very close to those observed in approaching those obtainable with approximately sixty
the experiments performed on the trimmed forensic and autosomal loci [11]. This level of accuracy in predicting
published datasets (see Table 2). This suggests that the investigator assigned ethnicity could be very useful in
common region approach is a good method of dealing forensic investigations.
with missing data, at least in conjunction with the PCA-
SVM method for ethnicity inference. Information content in HVR1 and HVR2
A potential concern with using the common interval As noted above, there is a great deal of variability in the
approach is that different amounts of training data are precise regions of HVR1 and HVR2 genotyped in prac-
used in classifying different test samples. This can make tice. Sequence coverage within the mitochondrial con-
it difficult to compare posterior probabilities returned trol region is often laboratory and/or study dependent.
by classification methods such as SVM, and may partly Variability of these boundaries severely limits the utility
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 8 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11
of individual datasets in the assembly of large datasets drug response profiles [21], and other “race based” ther-
representative of complex populations. Recently, Tzen et apeutics [22].
al. [18] sought to redefine HVR1 on the basis of genetic When applied to independent test data our SVM clas-
diversity and laboratory tractability. They show that the sifier performs reasonably well despite significant differ-
237-bp segment from 16126-16362 (the “redefined” ences between the training and test sets. In particular,
HVR1, or rHVR1) had a global genetic diversity of the absence of a Hispanic classification in the published
0.9905 and the 154-bp segment from 16209-16362 had dataset, and the inclusion of geographic regions in the
a global diversity of 0.9735, where the genetic diversity test set that are not represented in the training set (for
for a sample with n haplotypes with population frequen- instance Kazakhstan and Kyrgyzstan) is likely to have
cies x i , i = 1,…,n, is computed as (1 − Σ in=1x i2 )n / (n − 1) . The contributed significantly to errors in our inferences.
results of [18] match very closely with our scans of the Such errors are likely to recede as larger, more geogra-
inferential power of windows across the control region; phically balanced training sets are assembled.
Tzen’s rHVR1 overlaps precisely with the region of
greatest discriminative power in HVR1. The correspon- Handling missing data
dence between these results suggests that HVR2 might In the last few years several authors have pointed out
be similarly standardized to a region between 93-310, the presence of sequence errors in public and forensic
where the greatest discriminative power of HVR2 is mtDNA databases [23-27]. Moreover, precise boundaries
found. The identification of small regions of sequence of HVR1 and HVR2 are not always consistent across
that have maximal discriminative power could be quite studies and real-world samples may be severely
useful in forensic and anthropological settings where degraded, further contributing to errors or missing data
severe degradation can limit the size of PCR products in samples to be classified. We evaluated several statisti-
recoverable from sample material. Di Bernardo et al. cal approaches to dealing with missing data and evalu-
[19] report that the longest amplifiable DNA fragments ated these approaches for accuracy under simulated
extracted from 2000-year-old remains from Pompeii are scenarios of data dropout or loss. We found that despite
between 139 and 360 bp. Sequences of this size from a small loss of accuracy incurred by data dropout,
the most informative regions of HVR1 and HVR2 would restricting analysis to the region of intersection between
allow inference of coarse ethnic identity with reasonably the test sample and training samples provides the most
high accuracy. reliable inference of the ethnicity of the sample.
Attempts to impute any missing data based on the rCRS
SVM as classifier or a probabilistic model based of the training set
Many applications in human genetics require the discri- resulted in prediction bias toward Caucasian due to the
minative classification of samples into groups, and a origin of the rCRS and the preponderance of Caucasian
number of methods for this task have been proposed. samples in the FBI forensic data set. Until very large,
Lately, machine learning approaches have been used to ethnically balanced training sets are available, restricting
good effect in a number of biological scenarios including analysis to the region of intersection between test and
the classification of Y-haplogroups [20]. In this study we training samples is likely to remain the most accurate
use support vector machines (SVM) to develop statisti- and unbiased approach to inference.
cal models capable of predicting the ethnicity of mito-
chondrial DNA samples. We compare the performance Conclusions
of SVM under simulations of real-world scenarios with In this study, we compared four classification algo-
several other methods previously proposed for the clas- rithms for the prediction of probable assignment of
sification of mitochondrial sequences into geographically coarse ethnic identity using short DNA sequences
defined groups, including QDA and LDA [3-5]. In all from the hypervariable region of mtDNA. Comprehen-
tests SVM provides accuracy greater or equal to that of sive empirical studies showed that, regardless of
the other methods tested. SVM consistently provides sequence length, support vector classification is the
the best accuracy in simulations of degradation form most accurate classifier among those compared and
either end of the mitochondrial hypervariable regions, approaches 90% accuracy in predicting the assignment
and when small subsections of the hypervariable regions of course ethnic identity. Our experiments also identi-
are used. With only 218bp of mtDNA sequence, the fied high accuracy segments in HVR, which agree well
overall accuracy of SVM predictions exceeds 80%. The with the genetically diverse regions reported in pre-
success of SVM in this classification problem suggests vious work. Finally, our experiments showed that, in
that it may also be the best method for related classifi- dealing with missing data, it is advisable to use only
cation problems including inferring the geographic ori- segments shared by reference sequences and the
gin of DNA samples [4,5], haplogroup membership [8], sequence under test.
Lee et al. BMC Proceedings 2011, 5(Suppl 2):S11 Page 9 of 9
http://www.biomedcentral.com/1753-6561/5/S2/S11