Structure - Distruct Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Structure - Bodega Phylogenetics Wiki http://bodegaphylo.wikispot.

org/structure

User name:

Bodega Phylogenetics Wiki


Password:
(new user) Login

Structure Info
Search:

Front Page Primary Contact(s) Created Required Software Example Datafile

2013 Workshop Bob Thomson 1 March 2007 Structure (command line version), Distruct marm_struct.input

Topics
Summary
Software

Recent Changes This tutorial outlines the basics of running Jonathan Pritchard's Structure and Noah Rosenberg's Distruct
software packages. It uses a small example dataset containing SNP data for Western Pond Turtles (Emys
Contacts
marmorata).
Structure
Introduction

Structure is a versatile tool that can be used in several ways. This tutorial shows how the command line version
of "Structure" can be used with the sister program "Distruct" to investigate population structure in a group of
California turtles. Although we advise anyone interested in "Structure" to learn the command line version of this
software, we provide supplementary notes about installation and use of the "Structure" graphical front end as a
foot-note.

Structure works by grouping individuals into clusters such that Hardy-Weinberg equilibrium is maximized within
clusters. It can use most commonly employed genetic markers as data, including AFLPS, SNPs, microsatellites,
sequence haplotypes, Hapstrs, SNPstrs, etc. The program works by assuming a model of K populations, each of
which are characterized by a set of allele frequencies at each locus. The program groups individuals into the K
populations in such a way as to maximize HW equilibrium within populations. By varying the K parameter across
several runs of the program and inspecting the resulting probabilities under different values, we can infer a
likely value of K which best captures the variation present in the data. This program is often used in conjunction
with a second program, Distruct, which makes it easy to visualize the results. A final note before proceeding is
that a graphical version of the program exists for windows and various flavors of Unix, we're going to use the
command line version here because its more general, easier to customize, and easy to batch.

The data set we are using consists of 6 SNP loci (4 nuclear, 2 mitochondrial) for 43 individuals of the western
pond turtle (Emys marmorata). These turtles fall out into 5 well-supported mitochondrial DNA clades, which are
not well supported on a gene-by-gene basis at the nuclear level. You'd want more loci in a real analysis, but I've
kept the dataset small here to allow for reasonable run times. The instructions ought to work for both OS X and
Windows, though the Unix commands may not work for Windows users. I've tried to include instructions for both
operating systems where it's critical.

Structure

Getting Started
Start by going into the folder containing the datafiles for structure...

cd ~/desktop/bodega_structure/datafiles/structure (or wherever you've stored your copy)

and list the contents...


ls (or dir for you non-unix folks)

You'll see three files; 2 parameter files for structure (mainparams and extraparams) and a datafile
(marm_struct.input). The settings for each run are controlled by these two parameter files, which structure
always looks for upon execution. It will also look for an input data file, the name of which you specify in the
mainparams file.

File Format
open -e marm_struct.input (or just open in wordpad in Windows)
This is the basic Structure input file format. Eight columns correspond respectively to the name of the individual,
which mtDNA clade each individual belongs to, and six SNP loci. Each SNP locus is coded using 1's and 0's (for
the alternative alleles), and -9's (for missing data). Each diploid individual takes up two rows in the data file,
which allows you to code homozygous and heterozygous positions, e.g.:

3259_Mader 1 1 1 0 0 -9 0
3259_Mader 1 1 1 0 1 -9 0

So individual 3259_Mader is in mtDNA clade 1, is homozygous for SNP loci 1,2,3, and 6; is heterozygous for
locus 4; and is missing data for locus 5. Easy!

Now on to setting up Structure 's two parameter files.

1 of 6 10/5/12 8:15 PM
Structure - Bodega Phylogenetics Wiki http://bodegaphylo.wikispot.org/structure

Main Parameter File


This file contains the main parameters that you'll need to set in order to run the program. Each parameter looks
something like:

#define MISSING -9 // (int) value given to missing genotype data

This ought to look familiar to those of you with experience programming in C, the define statement tells
Structure that the value for MISSING data is -9. If you code your missing data some other way, simply change
the number to match your coding scheme. Notice that each parameter takes a certain type of data, the options
are integer (int), Boolean(B), or String(str), which means that you can't code your missing data with '?', you can
only use integers.

On to the parameters themselves...

Most are self-explanatory, INFILE and OUTFILE correspond to your input datafile name and name of the file you
want to output to, respectively. Note that the program will overwrite the outputfile if it already exists, this is
important to remember when you are doing multiple runs at different values of K.

The next 14 parameters refer to the format of the data file, and tell the program what, precisely, we have in
there. NUMINDS and NUMLOCI refer to the numbers of individuals and loci in your input data file. For our
case,these should be set to 43 and 6 respectively. LABEL wants to know if we have labeled our individuals in the
first column of the input file. We have, so it should be set to 1. Likewise, we've labeled individuals according to
mtDNA clades so POPDATA should be set to 1. Note that we have the option of using this population data when
inferring K and clustering individuals during the structure run. Here we are interested in letting the genetic data
tell us what K is, so we won't use the population information. If you did want to do this, you would set the
USEPOPINFO parameter in the extraparams file to 1, or you could include the statement USEPOPINFO==1 in
your inputfile, this is what the POPFLAG parameter refers to, we didn't use this flag so set it to 0.

PHENOTYPE, EXTRACOLS, PHASEINFO, and MARKOVPHASE all refer to additional data that you can include in
your data file for various reasons, none of which we've done here so they should all be 0.

MISSING and PLOIDY refer to exactly what they sound like. Our missing data is coded as -9 and turtles are
diploid. The last three formatting parameters refer to more file format options and data that can be included, set
these to 0 for this exercise.

Three final parameters in this file refer to the settings for the actual program run. You need to tell structure
which value of K to assume using the MAXPOPS parameter, we want to infer the best value of K from the data
so we'll run the program at several values and go back and compare the results (more on this below). Our
mtDNA phylogeny suggests 5 clades so we'll run the program with MAXPOPS set at 1 through 5. Start with
MAXPOPS equal to 2. Finally, we need to set the BURNIN and NUMREPS to do after the burnin. We're purposely
using a simple dataset here so that we can run the program fast and complete this exercise in a reasonable
amount of time. For us, these values can be set to 1000 and 100000 respectively. For more complex (i.e.
realistic) datasets you'd want each of these to be an order of magnitude or so higher in order to make sure that
the run converges. After the run finishes we'll check to make sure that we've run the markov chain long enough.
Save and close this file.

Extra Parameter File


This file is set up similarly to the mainparams file, it contains parameters concerning what model of population
genetic structure Structure uses, output options, priors and a few other parameters. We won't go through them
all, as you'll rarely need to change most of them. If you're still curious, the program's documentation goes
through them in detail.

One parameter to be concerned with here is FREQSCORR. Setting FREQSCORR to 1 causes structure to use a
model in which the allele frequencies among closely related populations can be correlated. This can be useful in
trying to differentiate between closely related populations, and other 'hard' structure problems. It is prone to
overestimating K however so it should be used with care, particularly if you are interested in species boundaries.

Enough parameters, lets run the program already...

Running Structure
Structure automatically looks for the mainparams and extraparams files, as well as whatever input file you tell it
(in the mainparams file) when you run the program. So just make sure these three files are in the same folder
as the structure executable. If you're using OS X execute the program by typing ./structureOSX, on a PC just
type structure.exe. You'll see a lot of output on the screen, and when the run finishes it will print the results
to the file you've specified.

Before going any further, check the parameter estimates printed by the program throughout the run. You want
to make sure that the estimates of alpha, D, and ln Likelihood have converged before the end of the burnin
period is reached. It is also advisable to repeat runs several times to ensure that you get consistent answers.
The run lengths I've recommended appear to be adequate for this dataset, but you would need to worry about
this more with a real dataset.

If the parameter estimates look like they've converged, go ahead and open your output file.

open -e marm_struct.output_f

2 of 6 10/5/12 8:15 PM
Structure - Bodega Phylogenetics Wiki http://bodegaphylo.wikispot.org/structure

This file has some information about the settings for the run, a table that shows the proportion of each
predefined population assigned to each cluster, a table of likelihoods, a table of the proportion of each
individual's alleles assigned to each cluster, as well as some other information. Compare your estimate for the ln
probability of the data around before moving on, it may not be exactly the same for everyone but it ought to be
close.

Distruct

Now that we've got output from Structure let's make one of those pretty figures that everyone's putting in their
papers these days. This is done by giving some of the output from Structure to the separate program Distruct,
which uses this data to produce a figure showing the proportion of each individual's ancestry inferred to be from
each cluster, as well as how this relates to your predefined populations. You can glean the same information by
staring at the tables in the Structure output file, but thats no fun and this way is much easier.

Distruct operates much like Structure in that it looks for a certain set of files when it runs. It has a single
parameter file and needs two data files, it will also optionally take a few more files that tell it how to label
populations, which colors to use, etc.

File Format
The two datafiles are simply the individual and population tables from the Structure output file. Simply copy and
paste these two tables into their own plain text files. You only want the data itself, NOT the labels and
formatting that structure puts around it. Your population file should contain text that looks like the following:

1: 0.109 0.891 23
2: 0.075 0.925 2
3: 0.987 0.013 5
4: 0.969 0.031 10
5: 0.883 0.117 3

and your individual file should look like (but longer, this is only the top of the file):

1 0001_Merce (0) 4 : 0.970 0.030 (0.800,1.000) (0.000,0.200)


2 0002_Merce (0) 4 : 0.955 0.045 (0.694,1.000) (0.000,0.306)
3 0227_Kern (0) 4 : 0.970 0.030 (0.802,1.000) (0.000,0.198)
4 0331_San_D (0) 3 : 0.990 0.010 (0.941,1.000) (0.000,0.059)
5 0669_XX (0) 1 : 0.304 0.696 (0.000,0.814) (0.186,1.000)

Distruct Parameter File


Once you have each data file made, open Distruct 's parameter file, which is called drawparams. Input the
names of the two datafiles into the first two parameters. The next two parameters want the names of files which
contain labels for your predefined populations. Our predefined populations are labeld 1 through 5, but we can
have distruct translate these numbers to geographic locations. So just make a file containing the text:

1 North
2 San Bernardino
3 South
4 San Joaquin
5 Santa Barbara

save it, and input the filename into the INFILE_LABEL_BELOW parameter. We won't use any labels on top so
skip the INFILE_LABEL_ATOP parameter. The INFILE_CLUST_PERM parameter takes a file which specificies the
order in which to print the clusters as well as which colors to use. This file looks like

1 red
2 green
3 yellow
4 blue
5 orange

The order of numbers controls the order in which the clusters are printed, and the color labels tell the program
which color to use. Finally, give the program the name of a file to output to.

The next section of the drawparams file wants the value of K used, set this to match whichever value of
MAXPOPS you set back in the structure run, the NUMPOPS is 5 (the number of mtDNA clades), and the
NUMINDS is 43. The remaining parameters are all printing options and are self-explanatory. Most can be left at
the defaults, but make sure that you set PRINT_LABEL_ATOP to 0 and PRINT_LABEL_BELOW to 1, because this
is how we've set up our label files.

Now you should be all set to actually run the program.

Running Distruct
Again, make sure all the necessary files are in the same folder and type ./distruct_OSX or distruct.exe as
the case may be...
You should now have a postscript file named whatever you put into the drawparams file (Figure1). If you're on a
mac just open it with Preview, and it should automatically convert to a pdf and open. On a PC, you can use
whatever postscript program you already have, or download Ghostscript and Ghostview, which is a free
postscript viewing package.

3 of 6 10/5/12 8:15 PM
Structure - Bodega Phylogenetics Wiki http://bodegaphylo.wikispot.org/structure

Each vertical bar represents a single individual.


The individuals are grouped into our 5 predefined
populations. The colors represent the proportion
of each individual's loci that are drawn from each
of K = 2 clusters. For example, the leftmost
individual in the above figure is from the
northern mtDNA clade, and about a third of it's
alleles are drawn from the other cluster. With
K=2, Structure lumps the the northern mtDNA
clade and the San Bernadino and Monterey clade
into one cluster and all the other clades into
another. The San Bernadino/Monterey clade
seems odd, however this clade is only
represented by 2 individuals and there is some
evidence that the San Bernadino population is
non-natural. It's possible that the San Bernadino
individual is a relatively recent introduction from
the Monterey area. This aside, the groupings
make good biological sense.

Figure 1 - Example output from distruct with K set to 2.

Inferring K
In order to infer the `best' value of K, we now need to do a run at each value of K that we are considering. Go
back to you mainparams file change the value of K and the outputfile (so you don't overwrite your results), and
repeat the run for K=1,...,5.

There is no set way to choose the value of K which best fits the data, there are several informal guides that can
be used however. You should always be careful in considering the biological reality of the `best' value of K,
because Structure can be prone to overestimating K given even mild departures from the underlying model.
However, when you're careful it appears to work well in practice. The goal should be to choose the lowest value
of K which effectively captures the major structure present in the data. One way to do this is to compare the
probabilities of the data given K for each value of K that you tried. It is often the case that the probability will be
low at values of K less than the appropriate value, then plateau at and above the appropriate value. For
example, in one set of my runs I got

K ln P(X|K)
1 -276.2
2 -186.7
3 -155.6
4 -155
5 -158.3

as you can see, ln P(X|K) plateaus around -155 at K=3. Alternatively, you can make this comparison by
computing posterior probabilities using a uniform prior on K. I won't go through this here, however the
calculation is straightforward and is laid out in detail in the Structure documentation. In practice, one tends to
see the same pattern as for ln P(X|K), with very low P(K) at values less than the appropriate value, and then a
plateau.

4 of 6 10/5/12 8:15 PM
Structure - Bodega Phylogenetics Wiki http://bodegaphylo.wikispot.org/structure

So inspection of the probabilities suggests that K is around 3 or 4. Now lets look at the Distruct outputs and see
if this is biologically realistic. We've already seen the breakdown for K=2, K=3 and K=4 are printed below.

Figure 2 - Example output from distruct with K set to 3. Figure 3 - Example output from distruct with K set to 4.

For K=3, Structure cleanly splits the Southern CA population from San Joaquin, while the Santa Barbara
population appears to be drawn from each of these clusters. The northern population remains relatively
unchanged. So K=3 is a better fit statistically, and it appears to capture more structure in the data than K=2.
When we move to K=4, little changes except that the Northern/SB/Monterey cluster splits into two clusters with
each individual admixed from both clusters. This occurs because the Northern population consists of largely
invariant individuals. Because little variation exists, little change in Hardy-Weinberg equilibrium occurs by
splitting the individuals into more and more populations and so the program chooses this as the least costly
place to introduce an additional cluster. Clearly this doesn't make biological sense, so in this case I'd say K=3
provides the best fit to the data.

With more complex datasets, you may find that it's difficult to decide between two similar values of K. In these
cases it can be useful to pull out the ambiguous populations and analyze them separately. This reduces the
problem to a K=1 vs K=2 problem, which is generally easier to decide between than say K=15 vs. K=16.

Conclusions

These are the basic steps involved in using Structure and Distruct. It is also possible to use this tool to assign
individuals to populations as well as to do association tests (by using locality and phenotype data in the
clustering process, respectively). Keys to remember are to always be cautious in the biological interpretation of
K and carefully consider how you to set up the parameters of each run. Without careful thought, odd conclusions
are likely. Even in the simple dataset we've used here, straightforward computation of the posterior on K would
lead you to conclude that K=4 is more likely than K=3, a result that makes little biological sense. In larger more
complex datasets, this problem can become more difficult to overcome. Additionally, with large datasets it
becomes increasingly critical to check that your burnin is long enough and to perform multiple runs to make
sure that the chain has converged. As you might guess, the run time for the program increases with the size of
the data set, and with the size of K; so complex data sets can require seriously long run times.

Downloading and Installing the Structure 2.2. Graphical Front End


Pritchard et al. have developed a graphical front end for Structure, however the installation instructions are brief
and have caused some students trouble. Users who are just becoming familiar with the basics of file architecture
may benefit from the supplementary instructions below, which are particularly intended to help those running
MacOSX on an Intel processor.

1. Download the compressed installation file to your desktop and decompress this file by double-clicking it (if
you're OS hasn't done this automatically). Place this folder in MacOSXs Applications folder.
2. Open the MacOSX Terminal application and navigate to the folder you just unzipped. If you haven't changed
the default terminal preferences (which should start you off in the home directory of the default user) and have
absolutely no experience navigating around folders in the Terminal you should be able to do this by typing "cd
../../Applications/structure2.2.3_install" at the prompt.
3. To install the package, all you need to do is to use the Terminal to run the install script included in the
unzipped folder. You may find that simply typing "./install" results in the following error "mkdir: /usr/local
/Structure/: Permission denied." If this is the case, you can either change permissions on this folder, or simply
install the required information elsewhere by typing the name of the desired location after "./install".
4. Assuming everything worked, you should be able to launch the graphical front-end by typing "./structure" in
the Terminal window. Of course, your working directory (i.e., the place you are in the Terminal) must
correspond with the location of the structure executable, which should be the folder that you uncompressed

5 of 6 10/5/12 8:15 PM
Structure - Bodega Phylogenetics Wiki http://bodegaphylo.wikispot.org/structure

earlier.

Further Reading

-Pritchard JK, and W Wen. 2002. Documentation for Structure software: Version 2 ( Documentation for
Structure.)

-Rosenberg NA. 2003. Distruct: a program for the graphical display of population structure (Documentation for
Distruct.)

-Pritchard JK, M Stephens, P Donnelly. 2000. Inference of Population Structure using multilocus genotype data.
Genetics 155:945-959 (This is the original paper describing the method used by Structure, if you want the
mathematical nitty-gritty this is where its at. Its relatively accessible for such a math-y paper.)

-Falush D, M Stephens, JK Pritchard. 2003. Inference of population structure using multilocus genotype data:
linked loci and correlated allele frequencies. Genetics 164: 1567-1587( This paper describes Structure version
2.0 which allows for correlations in allele frequencies between closely related populations as well as linkage.)

-Rosenberg NA, et al. 2001. Empirical evaluation of genetic clustering methods using multilocus genotypes from
20 chicken breeds. Genetics 159: 699-713( This is one of the early applications of the method using
microsatellites for chicken breeds. Its a good example of using the method for complex datasets where K is
large.)

Comments:
Note: You must be logged in to add comments

Except where otherwise noted, this content is licensed under a Creative Commons Attribution License. See
Copyrights.

This is a Wiki Spot wiki. Wiki Spot is a 501(c)3 non-profit organization that helps communities
collaborate via wikis.

6 of 6 10/5/12 8:15 PM

You might also like