Purpose: Despite the successful progress next-generation sequencing technologies has achieved in ... more Purpose: Despite the successful progress next-generation sequencing technologies has achieved in diagnosing the genetic cause of rare Mendelian diseases, the current diagnostic rate is still far from satisfactory because of heterogeneity, imprecision, and noise in disease phenotype descriptions and insufficient utilization of expert knowledge in clinical genetics. To overcome these difficulties, we present a novel method called Xrare for the prioritization of causative gene variants in rare disease diagnosis. Methods: We propose a new phenotype similarity scoring method called Emission-Reception Information Content (ERIC), which is highly tolerant of noise and imprecision in clinical phenotypes. We utilize medical genetic domain knowledge by designing genetic features implementing American College of Medical Genetics and Genomics (ACMG) guidelines. Results: ERIC score ranked consistently higher for disease genes than other phenotypic similarity scores in the presence of imprecise and noisy phenotypes. Extensive simulations and real clinical data demonstrated that Xrare outperforms existing alternative methods by 10-40% at various genetic diagnosis scenarios. Conclusion: The Xrare model is learned from a large database of clinical variants, and derives its strength from the tight integration of medical genetics features and phenotypic features similarity scores. Xrare provides the clinical community with a robust and powerful tool for variant prioritization.
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation in eukaryo... more Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation in eukaryotic genomes. SNPs may be functionally responsible for specific traits or phenotypes, or they may be informative for tracing the evolutionary history of a species or the pedigree of a variety. As genetic markers, SNPs are rapidly replacing simple sequence repeats (SSRs) because they are more abundant, stable, amenable to automation, efficient, and increasingly cost-effective. The integration of high throughput SNP genotyping capability promises to accelerate genetic gain in a breeding program, but also imposes a series of economic, organizational and technical hurdles. To begin to address these challenges, SNP-based resources are being developed and made publicly available for broad application in rice research. These resources include large SNP datasets, tools for identifying informative SNPs for targeted applications, and a suite of custom-designed SNP assays for use in marker-assisted and genomic selection, association and QTL mapping, positional cloning, pedigree analysis, variety identification and seed purity testing. SNP resources also make it possible for breeders to more efficiently evaluate and utilize the wealth of natural variation that exists in both wild and cultivated germplasm with the aim of improving the productivity and sustainability of agriculture.
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (... more The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and imp...
We apply an analysis based upon mixed-models to the Genetic Analysis Workshop 15, Problem 3 simul... more We apply an analysis based upon mixed-models to the Genetic Analysis Workshop 15, Problem 3 simulated data. Such models are commonly used to mitigate the tendency for population structure, or cryptic relatedness, to inflate the false-positive rate of test statistics. They also allow for explicit modeling of varying degrees of relatedness in samples in which some individuals are related by (possibly unknown) pedigree, whereas others are not. Furthermore, the implementation of the method we describe here is quick enough to be used effectively on genome-wide data. We present an analysis of the data for Genetic Analysis Workshop 15, Problem 3, in which we show that these methods can effectively find signals in this data. Somewhat disappointingly, the false-positive rate does not appear to be reduced, but this is largely because the method used to simulate the data appears not to have encompassed effects, such as population stratification, that might have led to inflation of p-values.
Given the increasing size of modern genetic data sets and, in particular, the move towards genome... more Given the increasing size of modern genetic data sets and, in particular, the move towards genome-wide studies, there is merit in considering analyses that gain computational efficiency by being more heuristic in nature. With this in mind, we present results of cladistic analyses methods on the Genetic Analysis Workshop 15 Problem 3 simulated data (answers known). Our analysis attempts to capture similarities between individuals using a series of trees, and then looks for regions in which mutations on those trees can successfully explain a phenotype of interest. Existing varieties of such algorithms assume haplotypes are known, or have been inferred, an assumption that is often unrealistic for genome-wide data. We therefore present an extension of these methods that can successfully analyze genotype, rather than haplotype, data.
We present an overview of a research platform that provides essential germplasm, genotypic and ph... more We present an overview of a research platform that provides essential germplasm, genotypic and phenotypic data and analytical tools for dissecting phenotype-genotype associations in rice. These resources include a diversity panel of 400 Oryza sativa and 100 Oryza rufipogon accessions that have been purified by single seed descent, a customdesigned Affymetrix array consisting of 44,100 SNPs, an Illumina GoldenGate assay consisting of 1,536 SNPs, and a suite of low-resolution 384-SNP assays for the Illumina BeadXpress Reader that are designed for applications in breeding, genetics and germplasm management. Our longterm goal is to empower basic research discoveries in rice by linking sequence diversity with physiological, morphological, and agronomic variation. This research platform will also help increase breeding efficiency by providing a database of diversity information that will enable researchers to identify useful DNA polymorphisms in genes and germplasm of interest and convert that information into cost-effective tools for applied plant improvement.
The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. W... more The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. We used a strategy that exploits the breed structure of dogs to investigate the genetic basis of size. First, through a genome-wide scan, we identified a major quantitative trait locus (QTL) on chromosome 15 influencing size variation within a single breed. Second, we examined genetic variation in the 15megabase interval surrounding the QTL in small and giant breeds and found marked evidence for a selective sweep spanning a single gene (IGF1), encoding insulin-like growth factor 1. A single IGF1 single-nucleotide polymorphism haplotype is common to all small breeds and nearly absent from giant breeds, suggesting that the same causal sequence variant is a major contributor to body size in all small dogs. Size variation in the domestic dog is extreme and surpasses that of all other living and extinct species in the dog family, Canidae (1,2). However, the genetic origin of this diversity is obscure. Explanations include increased recombination or mutation rates (3,4), a unique role of short repeat loci near genes (3), expansion of specific short interspersed nuclear elements (5),
Background: The domestication of Asian rice (Oryza sativa) was a complex process punctuated by ep... more Background: The domestication of Asian rice (Oryza sativa) was a complex process punctuated by episodes of introgressive hybridization among and between subpopulations. Deep genetic divergence between the two main varietal groups (Indica and Japonica) suggests domestication from at least two distinct wild populations. However, genetic uniformity surrounding key domestication genes across divergent subpopulations suggests cultural exchange of genetic material among ancient farmers. Methodology/Principal Findings: In this study, we utilize a novel 1,536 SNP panel genotyped across 395 diverse accessions of O. sativa to study genome-wide patterns of polymorphism, to characterize population structure, and to infer the introgression history of domesticated Asian rice. Our population structure analyses support the existence of five major subpopulations (indica, aus, tropical japonica, temperate japonica and GroupV) consistent with previous analyses. Our introgression analysis shows that most accessions exhibit some degree of admixture, with many individuals within a population sharing the same introgressed segment due to artificial selection. Admixture mapping and association analysis of amylose content and grain length illustrate the potential for dissecting the genetic basis of complex traits in domesticated plant populations. Conclusions/Significance: Genes in these regions control a myriad of traits including plant stature, blast resistance, and amylose content. These analyses highlight the power of population genomics in agricultural systems to identify functionally important regions of the genome and to decipher the role of human-directed breeding in refashioning the genomes of a domesticated species.
Background: Canine hip dysplasia (HD) is a common polygenic trait characterized by hip malformati... more Background: Canine hip dysplasia (HD) is a common polygenic trait characterized by hip malformation that results in osteoarthritis (OA). The condition in dogs is very similar to developmental dysplasia of the human hip which also leads to OA. Methodology/Principal Findings: A total of 721 dogs, including both an association and linkage population, were genotyped. The association population included 8 pure breeds (Labrador retriever, Greyhounds, German Shepherd, Newfoundland, Golden retriever, Rottweiler, Border Collie and Bernese Mountain Dog). The linkage population included Labrador retrievers, Greyhounds, and their crosses. Of these, 366 dogs were genotyped at ,22,000 single nucleotide polymorphism (SNP) loci and a targeted screen across 8 chromosomes with ,3,300 SNPs was performed on 551 dogs (196 dogs were common to both sets). A mixed linear model approach was used to perform an association study on this combined association and linkage population. The study identified 4 susceptibility SNPs associated with HD and 2 SNPs associated with hip OA. Conclusion/Significance: The identified SNPs included those near known genes (PTPRD, PARD3B, and COL15A1) reported to be associated with, or expressed in, OA in humans. This suggested that the canine model could provide a unique opportunity to identify genes underlying natural HD and hip OA, which are common and debilitating conditions in both dogs and humans.
Background: The genome-wide association (GWA) approach represents an alternative to biparental li... more Background: The genome-wide association (GWA) approach represents an alternative to biparental linkage mapping for determining the genetic basis of trait variation. Both approaches rely on recombination to rearrange the genome, and seek to establish correlations between phenotype and genotype. The major advantages of GWA lie in being able to sample a much wider range of the phenotypic and genotypic variation present, in being able to exploit multiple rounds of historical recombination in many different lineages and to include multiple accessions of direct relevance to crop improvement. Results: A 191 accessions eggplant (Solanum melongena L.) association panel, comprising a mixture of breeding lines, old varieties and landrace selections originating from Asia and the Mediterranean Basin, was SNP genotyped and scored for anthocyanin pigmentation and fruit color at two locations over two years. The panel formed two major clusters, reflecting geographical provenance and fruit type. The global level of linkage disequilibrium was 3.4 cM. A mixed linear model appeared to be the most appropriate for GWA. A set of 56 SNP locus/phenotype associations was identified and the genomic regions harboring these loci were distributed over nine of the 12 eggplant chromosomes. The associations were compared with the location of known QTL for the same traits.
Molybdenum (Mo) is an essential micronutrient for plants, serving as a cofactor for enzymes invol... more Molybdenum (Mo) is an essential micronutrient for plants, serving as a cofactor for enzymes involved in nitrate assimilation, sulfite detoxification, abscisic acid biosynthesis, and purine degradation. Here we show that natural variation in shoot Mo content across 92 Arabidopsis thaliana accessions is controlled by variation in a mitochondrially localized transporter (Molybdenum Transporter 1-MOT1) that belongs to the sulfate transporter superfamily. A deletion in the MOT1 promoter is strongly associated with low shoot Mo, occurring in seven of the accessions with the lowest shoot content of Mo. Consistent with the low Mo phenotype, MOT1 expression in low Mo accessions is reduced. Reciprocal grafting experiments demonstrate that the roots of Ler-0 are responsible for the low Mo accumulation in shoot, and GUS localization demonstrates that MOT1 is expressed strongly in the roots. MOT1 contains an N-terminal mitochondrial targeting sequence and expression of MOT1 tagged with GFP in protoplasts and transgenic plants, establishing the mitochondrial localization of this protein. Furthermore, expression of MOT1 specifically enhances Mo accumulation in yeast by 5-fold, consistent with MOT1 functioning as a molybdate transporter. This work provides the first molecular insight into the processes that regulate Mo accumulation in plants and shows that novel loci can be detected by association mapping.
There is currently tremendous interest in the possibility of using genome-wide association mappin... more There is currently tremendous interest in the possibility of using genome-wide association mapping to identify genes responsible for natural variation, particularly for human disease susceptibility. The model plant Arabidopsis thaliana is in many ways an ideal candidate for such studies, because it is a highly selfing hermaphrodite. As a result, the species largely exists as a collection of naturally occurring inbred lines, or accessions, which can be genotyped once and phenotyped repeatedly. Furthermore, linkage disequilibrium in such a species will be much more extensive than in a comparable outcrossing species. We tested the feasibility of genome-wide association mapping in A. thaliana by searching for associations with flowering time and pathogen resistance in a sample of 95 accessions for which genomewide polymorphism data were available. In spite of an extremely high rate of false positives due to population structure, we were able to identify known major genes for all phenotypes tested, thus demonstrating the potential of genome-wide association mapping in A. thaliana and other species with similar patterns of variation. The rate of false positives differed strongly between traits, with more clinal traits showing the highest rate. However, the false positive rates were always substantial regardless of the trait, highlighting the necessity of an appropriate genomic control in association studies.
A potentially serious disadvantage of association mapping is the fact that marker-trait associati... more A potentially serious disadvantage of association mapping is the fact that marker-trait associations may arise from confounding population structure as well as from linkage to causative polymorphisms. Using genome-wide marker data, we have previously demonstrated that the problem can be severe in a global sample of 95 Arabidopsis thaliana accessions, and that established methods for controlling for population structure are generally insufficient. Here, we use the same sample together with a number of flowering-related phenotypes and data-perturbation simulations to evaluate a wider range of methods for controlling for population structure. We find that, in terms of reducing the falsepositive rate while maintaining statistical power, a recently introduced mixed-model approach that takes genomewide differences in relatedness into account via estimated pairwise kinship coefficients generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls. The importance of study design is clear; our study is severely under-powered both in terms of sample size and marker density. Our results also provide a striking demonstration of confounding by population structure. While statistical methods can be used to ameliorate this problem, they cannot always be effective and are certainly not a substitute for independent evidence, such as that obtained via crosses or transgenic experiments. Ultimately, association mapping is a powerful tool for identifying a list of candidates that is short enough to permit further genetic study.
The detection of footprints of natural selection in genetic polymorphism data is fundamental to u... more The detection of footprints of natural selection in genetic polymorphism data is fundamental to understanding the genetic basis of adaptation, and has important implications for human health. The standard approach has been to reject neutrality in favor of selection if the pattern of variation at a candidate locus was significantly different from the predictions of the standard neutral model. The problem is that the standard neutral model assumes more than just neutrality, and it is almost always possible to explain the data using an alternative neutral model with more complex demography. Today's wealth of genomic polymorphism data, however, makes it possible to dispense with models altogether by simply comparing the pattern observed at a candidate locus to the genomic pattern, and rejecting neutrality if the pattern is extreme. Here, we utilize this approach on a truly genomic scale, comparing a candidate locus to thousands of alleles throughout the Arabidopsis thaliana genome. We demonstrate that selection has acted to increase the frequency of early-flowering alleles at the vernalization requirement locus FRIGIDA. Selection seems to have occurred during the last several thousand years, possibly in response to the spread of agriculture. We introduce a novel test statistic based on haplotype sharing that embraces the problem of population structure, and so should be widely applicable.
We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that inc... more We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that included stock center accessions as well as a hierarchical sample from natural populations. Although A. thaliana is a selfing weed, the pattern of polymorphism in general agrees with what is expected for a widely distributed, sexually reproducing species. Linkage disequilibrium decays rapidly, within 50 kb. Variation is shared worldwide, although population structure and isolation by distance are evident. The data fail to fit standard neutral models in several ways. There is a genome-wide excess of rare alleles, at least partially due to selection. There is too much variation between genomic regions in the level of polymorphism. The local level of polymorphism is negatively correlated with gene density and positively correlated with segmental duplications. Because the data do not fit theoretical null distributions, attempts to infer natural selection from polymorphism data will require genome-wide surveys of polymorphism in order to identify anomalous regions. Despite this, our data support the utility of A. thaliana as a model for evolutionary functional genomics.
High-throughput phenotyping of root systems requires a combination of specialized techniques and ... more High-throughput phenotyping of root systems requires a combination of specialized techniques and adaptable plant growth, root imaging and software tools. A custom phenotyping platform was designed to capture images of whole root systems, and novel software tools were developed to process and analyse these images. The platform and its components are adaptable to a wide range root phenotyping studies using diverse growth systems (hydroponics, paper pouches, gel and soil) involving several plant species, including, but not limited to, rice, maize, sorghum, tomato and Arabidopsis. The RootReader2D software tool is free and publicly available and was designed with both user-guided and automated features that increase flexibility and enhance efficiency when measuring root growth traits from specific roots or entire root systems during large-scale phenotyping studies. To demonstrate the unique capabilities and high-throughput capacity of this phenotyping platform for studying root systems, genome-wide association studies on rice (Oryza sativa) and maize (Zea mays) root growth were performed and root traits related to aluminium (Al) tolerance were analysed on the parents of the maize nested association mapping (NAM) population.
Purpose: Despite the successful progress next-generation sequencing technologies has achieved in ... more Purpose: Despite the successful progress next-generation sequencing technologies has achieved in diagnosing the genetic cause of rare Mendelian diseases, the current diagnostic rate is still far from satisfactory because of heterogeneity, imprecision, and noise in disease phenotype descriptions and insufficient utilization of expert knowledge in clinical genetics. To overcome these difficulties, we present a novel method called Xrare for the prioritization of causative gene variants in rare disease diagnosis. Methods: We propose a new phenotype similarity scoring method called Emission-Reception Information Content (ERIC), which is highly tolerant of noise and imprecision in clinical phenotypes. We utilize medical genetic domain knowledge by designing genetic features implementing American College of Medical Genetics and Genomics (ACMG) guidelines. Results: ERIC score ranked consistently higher for disease genes than other phenotypic similarity scores in the presence of imprecise and noisy phenotypes. Extensive simulations and real clinical data demonstrated that Xrare outperforms existing alternative methods by 10-40% at various genetic diagnosis scenarios. Conclusion: The Xrare model is learned from a large database of clinical variants, and derives its strength from the tight integration of medical genetics features and phenotypic features similarity scores. Xrare provides the clinical community with a robust and powerful tool for variant prioritization.
Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation in eukaryo... more Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation in eukaryotic genomes. SNPs may be functionally responsible for specific traits or phenotypes, or they may be informative for tracing the evolutionary history of a species or the pedigree of a variety. As genetic markers, SNPs are rapidly replacing simple sequence repeats (SSRs) because they are more abundant, stable, amenable to automation, efficient, and increasingly cost-effective. The integration of high throughput SNP genotyping capability promises to accelerate genetic gain in a breeding program, but also imposes a series of economic, organizational and technical hurdles. To begin to address these challenges, SNP-based resources are being developed and made publicly available for broad application in rice research. These resources include large SNP datasets, tools for identifying informative SNPs for targeted applications, and a suite of custom-designed SNP assays for use in marker-assisted and genomic selection, association and QTL mapping, positional cloning, pedigree analysis, variety identification and seed purity testing. SNP resources also make it possible for breeders to more efficiently evaluate and utilize the wealth of natural variation that exists in both wild and cultivated germplasm with the aim of improving the productivity and sustainability of agriculture.
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (... more The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and imp...
We apply an analysis based upon mixed-models to the Genetic Analysis Workshop 15, Problem 3 simul... more We apply an analysis based upon mixed-models to the Genetic Analysis Workshop 15, Problem 3 simulated data. Such models are commonly used to mitigate the tendency for population structure, or cryptic relatedness, to inflate the false-positive rate of test statistics. They also allow for explicit modeling of varying degrees of relatedness in samples in which some individuals are related by (possibly unknown) pedigree, whereas others are not. Furthermore, the implementation of the method we describe here is quick enough to be used effectively on genome-wide data. We present an analysis of the data for Genetic Analysis Workshop 15, Problem 3, in which we show that these methods can effectively find signals in this data. Somewhat disappointingly, the false-positive rate does not appear to be reduced, but this is largely because the method used to simulate the data appears not to have encompassed effects, such as population stratification, that might have led to inflation of p-values.
Given the increasing size of modern genetic data sets and, in particular, the move towards genome... more Given the increasing size of modern genetic data sets and, in particular, the move towards genome-wide studies, there is merit in considering analyses that gain computational efficiency by being more heuristic in nature. With this in mind, we present results of cladistic analyses methods on the Genetic Analysis Workshop 15 Problem 3 simulated data (answers known). Our analysis attempts to capture similarities between individuals using a series of trees, and then looks for regions in which mutations on those trees can successfully explain a phenotype of interest. Existing varieties of such algorithms assume haplotypes are known, or have been inferred, an assumption that is often unrealistic for genome-wide data. We therefore present an extension of these methods that can successfully analyze genotype, rather than haplotype, data.
We present an overview of a research platform that provides essential germplasm, genotypic and ph... more We present an overview of a research platform that provides essential germplasm, genotypic and phenotypic data and analytical tools for dissecting phenotype-genotype associations in rice. These resources include a diversity panel of 400 Oryza sativa and 100 Oryza rufipogon accessions that have been purified by single seed descent, a customdesigned Affymetrix array consisting of 44,100 SNPs, an Illumina GoldenGate assay consisting of 1,536 SNPs, and a suite of low-resolution 384-SNP assays for the Illumina BeadXpress Reader that are designed for applications in breeding, genetics and germplasm management. Our longterm goal is to empower basic research discoveries in rice by linking sequence diversity with physiological, morphological, and agronomic variation. This research platform will also help increase breeding efficiency by providing a database of diversity information that will enable researchers to identify useful DNA polymorphisms in genes and germplasm of interest and convert that information into cost-effective tools for applied plant improvement.
The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. W... more The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. We used a strategy that exploits the breed structure of dogs to investigate the genetic basis of size. First, through a genome-wide scan, we identified a major quantitative trait locus (QTL) on chromosome 15 influencing size variation within a single breed. Second, we examined genetic variation in the 15megabase interval surrounding the QTL in small and giant breeds and found marked evidence for a selective sweep spanning a single gene (IGF1), encoding insulin-like growth factor 1. A single IGF1 single-nucleotide polymorphism haplotype is common to all small breeds and nearly absent from giant breeds, suggesting that the same causal sequence variant is a major contributor to body size in all small dogs. Size variation in the domestic dog is extreme and surpasses that of all other living and extinct species in the dog family, Canidae (1,2). However, the genetic origin of this diversity is obscure. Explanations include increased recombination or mutation rates (3,4), a unique role of short repeat loci near genes (3), expansion of specific short interspersed nuclear elements (5),
Background: The domestication of Asian rice (Oryza sativa) was a complex process punctuated by ep... more Background: The domestication of Asian rice (Oryza sativa) was a complex process punctuated by episodes of introgressive hybridization among and between subpopulations. Deep genetic divergence between the two main varietal groups (Indica and Japonica) suggests domestication from at least two distinct wild populations. However, genetic uniformity surrounding key domestication genes across divergent subpopulations suggests cultural exchange of genetic material among ancient farmers. Methodology/Principal Findings: In this study, we utilize a novel 1,536 SNP panel genotyped across 395 diverse accessions of O. sativa to study genome-wide patterns of polymorphism, to characterize population structure, and to infer the introgression history of domesticated Asian rice. Our population structure analyses support the existence of five major subpopulations (indica, aus, tropical japonica, temperate japonica and GroupV) consistent with previous analyses. Our introgression analysis shows that most accessions exhibit some degree of admixture, with many individuals within a population sharing the same introgressed segment due to artificial selection. Admixture mapping and association analysis of amylose content and grain length illustrate the potential for dissecting the genetic basis of complex traits in domesticated plant populations. Conclusions/Significance: Genes in these regions control a myriad of traits including plant stature, blast resistance, and amylose content. These analyses highlight the power of population genomics in agricultural systems to identify functionally important regions of the genome and to decipher the role of human-directed breeding in refashioning the genomes of a domesticated species.
Background: Canine hip dysplasia (HD) is a common polygenic trait characterized by hip malformati... more Background: Canine hip dysplasia (HD) is a common polygenic trait characterized by hip malformation that results in osteoarthritis (OA). The condition in dogs is very similar to developmental dysplasia of the human hip which also leads to OA. Methodology/Principal Findings: A total of 721 dogs, including both an association and linkage population, were genotyped. The association population included 8 pure breeds (Labrador retriever, Greyhounds, German Shepherd, Newfoundland, Golden retriever, Rottweiler, Border Collie and Bernese Mountain Dog). The linkage population included Labrador retrievers, Greyhounds, and their crosses. Of these, 366 dogs were genotyped at ,22,000 single nucleotide polymorphism (SNP) loci and a targeted screen across 8 chromosomes with ,3,300 SNPs was performed on 551 dogs (196 dogs were common to both sets). A mixed linear model approach was used to perform an association study on this combined association and linkage population. The study identified 4 susceptibility SNPs associated with HD and 2 SNPs associated with hip OA. Conclusion/Significance: The identified SNPs included those near known genes (PTPRD, PARD3B, and COL15A1) reported to be associated with, or expressed in, OA in humans. This suggested that the canine model could provide a unique opportunity to identify genes underlying natural HD and hip OA, which are common and debilitating conditions in both dogs and humans.
Background: The genome-wide association (GWA) approach represents an alternative to biparental li... more Background: The genome-wide association (GWA) approach represents an alternative to biparental linkage mapping for determining the genetic basis of trait variation. Both approaches rely on recombination to rearrange the genome, and seek to establish correlations between phenotype and genotype. The major advantages of GWA lie in being able to sample a much wider range of the phenotypic and genotypic variation present, in being able to exploit multiple rounds of historical recombination in many different lineages and to include multiple accessions of direct relevance to crop improvement. Results: A 191 accessions eggplant (Solanum melongena L.) association panel, comprising a mixture of breeding lines, old varieties and landrace selections originating from Asia and the Mediterranean Basin, was SNP genotyped and scored for anthocyanin pigmentation and fruit color at two locations over two years. The panel formed two major clusters, reflecting geographical provenance and fruit type. The global level of linkage disequilibrium was 3.4 cM. A mixed linear model appeared to be the most appropriate for GWA. A set of 56 SNP locus/phenotype associations was identified and the genomic regions harboring these loci were distributed over nine of the 12 eggplant chromosomes. The associations were compared with the location of known QTL for the same traits.
Molybdenum (Mo) is an essential micronutrient for plants, serving as a cofactor for enzymes invol... more Molybdenum (Mo) is an essential micronutrient for plants, serving as a cofactor for enzymes involved in nitrate assimilation, sulfite detoxification, abscisic acid biosynthesis, and purine degradation. Here we show that natural variation in shoot Mo content across 92 Arabidopsis thaliana accessions is controlled by variation in a mitochondrially localized transporter (Molybdenum Transporter 1-MOT1) that belongs to the sulfate transporter superfamily. A deletion in the MOT1 promoter is strongly associated with low shoot Mo, occurring in seven of the accessions with the lowest shoot content of Mo. Consistent with the low Mo phenotype, MOT1 expression in low Mo accessions is reduced. Reciprocal grafting experiments demonstrate that the roots of Ler-0 are responsible for the low Mo accumulation in shoot, and GUS localization demonstrates that MOT1 is expressed strongly in the roots. MOT1 contains an N-terminal mitochondrial targeting sequence and expression of MOT1 tagged with GFP in protoplasts and transgenic plants, establishing the mitochondrial localization of this protein. Furthermore, expression of MOT1 specifically enhances Mo accumulation in yeast by 5-fold, consistent with MOT1 functioning as a molybdate transporter. This work provides the first molecular insight into the processes that regulate Mo accumulation in plants and shows that novel loci can be detected by association mapping.
There is currently tremendous interest in the possibility of using genome-wide association mappin... more There is currently tremendous interest in the possibility of using genome-wide association mapping to identify genes responsible for natural variation, particularly for human disease susceptibility. The model plant Arabidopsis thaliana is in many ways an ideal candidate for such studies, because it is a highly selfing hermaphrodite. As a result, the species largely exists as a collection of naturally occurring inbred lines, or accessions, which can be genotyped once and phenotyped repeatedly. Furthermore, linkage disequilibrium in such a species will be much more extensive than in a comparable outcrossing species. We tested the feasibility of genome-wide association mapping in A. thaliana by searching for associations with flowering time and pathogen resistance in a sample of 95 accessions for which genomewide polymorphism data were available. In spite of an extremely high rate of false positives due to population structure, we were able to identify known major genes for all phenotypes tested, thus demonstrating the potential of genome-wide association mapping in A. thaliana and other species with similar patterns of variation. The rate of false positives differed strongly between traits, with more clinal traits showing the highest rate. However, the false positive rates were always substantial regardless of the trait, highlighting the necessity of an appropriate genomic control in association studies.
A potentially serious disadvantage of association mapping is the fact that marker-trait associati... more A potentially serious disadvantage of association mapping is the fact that marker-trait associations may arise from confounding population structure as well as from linkage to causative polymorphisms. Using genome-wide marker data, we have previously demonstrated that the problem can be severe in a global sample of 95 Arabidopsis thaliana accessions, and that established methods for controlling for population structure are generally insufficient. Here, we use the same sample together with a number of flowering-related phenotypes and data-perturbation simulations to evaluate a wider range of methods for controlling for population structure. We find that, in terms of reducing the falsepositive rate while maintaining statistical power, a recently introduced mixed-model approach that takes genomewide differences in relatedness into account via estimated pairwise kinship coefficients generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls. The importance of study design is clear; our study is severely under-powered both in terms of sample size and marker density. Our results also provide a striking demonstration of confounding by population structure. While statistical methods can be used to ameliorate this problem, they cannot always be effective and are certainly not a substitute for independent evidence, such as that obtained via crosses or transgenic experiments. Ultimately, association mapping is a powerful tool for identifying a list of candidates that is short enough to permit further genetic study.
The detection of footprints of natural selection in genetic polymorphism data is fundamental to u... more The detection of footprints of natural selection in genetic polymorphism data is fundamental to understanding the genetic basis of adaptation, and has important implications for human health. The standard approach has been to reject neutrality in favor of selection if the pattern of variation at a candidate locus was significantly different from the predictions of the standard neutral model. The problem is that the standard neutral model assumes more than just neutrality, and it is almost always possible to explain the data using an alternative neutral model with more complex demography. Today's wealth of genomic polymorphism data, however, makes it possible to dispense with models altogether by simply comparing the pattern observed at a candidate locus to the genomic pattern, and rejecting neutrality if the pattern is extreme. Here, we utilize this approach on a truly genomic scale, comparing a candidate locus to thousands of alleles throughout the Arabidopsis thaliana genome. We demonstrate that selection has acted to increase the frequency of early-flowering alleles at the vernalization requirement locus FRIGIDA. Selection seems to have occurred during the last several thousand years, possibly in response to the spread of agriculture. We introduce a novel test statistic based on haplotype sharing that embraces the problem of population structure, and so should be widely applicable.
We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that inc... more We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that included stock center accessions as well as a hierarchical sample from natural populations. Although A. thaliana is a selfing weed, the pattern of polymorphism in general agrees with what is expected for a widely distributed, sexually reproducing species. Linkage disequilibrium decays rapidly, within 50 kb. Variation is shared worldwide, although population structure and isolation by distance are evident. The data fail to fit standard neutral models in several ways. There is a genome-wide excess of rare alleles, at least partially due to selection. There is too much variation between genomic regions in the level of polymorphism. The local level of polymorphism is negatively correlated with gene density and positively correlated with segmental duplications. Because the data do not fit theoretical null distributions, attempts to infer natural selection from polymorphism data will require genome-wide surveys of polymorphism in order to identify anomalous regions. Despite this, our data support the utility of A. thaliana as a model for evolutionary functional genomics.
High-throughput phenotyping of root systems requires a combination of specialized techniques and ... more High-throughput phenotyping of root systems requires a combination of specialized techniques and adaptable plant growth, root imaging and software tools. A custom phenotyping platform was designed to capture images of whole root systems, and novel software tools were developed to process and analyse these images. The platform and its components are adaptable to a wide range root phenotyping studies using diverse growth systems (hydroponics, paper pouches, gel and soil) involving several plant species, including, but not limited to, rice, maize, sorghum, tomato and Arabidopsis. The RootReader2D software tool is free and publicly available and was designed with both user-guided and automated features that increase flexibility and enhance efficiency when measuring root growth traits from specific roots or entire root systems during large-scale phenotyping studies. To demonstrate the unique capabilities and high-throughput capacity of this phenotyping platform for studying root systems, genome-wide association studies on rice (Oryza sativa) and maize (Zea mays) root growth were performed and root traits related to aluminium (Al) tolerance were analysed on the parents of the maize nested association mapping (NAM) population.
Uploads
Papers by Keyan Zhao