Fstat 294
Fstat 294
Fstat 294
Jérôme Goudet
Department of Ecology & Evolution,
BB, Lausanne University,
CH-1015 Dorigny
Switzerland.
E-mail: jerome.goudet@ie-zea.unil.ch
1
http://www.unil.ch/izea/softwares/fstat.html
1
This software could not have been developed without the financial support of the
Swiss National Science Foundation, Grants N◦ 31-43443.95, 31-55945.98 and 31- .2000. I
am indebted to the many users who sent me comments, bugs or cheerful messages that
helped improve Fstat: François Balloux, Lars Berg, Michel Chapuisat, Thierry De Meeüs,
Loı̈c Degen, Greg Douhan, Guillaume Evanno, Arnaud Estoup, Laurent Excoffier, Pierre
Fontanillas, Luca Fumagalli, Barbara Giles, Chris Gliddon, Alexandre Hirzel, Michael
Krawczak, Martin Lascoux, Lance Barrett-Lennard, Margaret Mackinnon, Patrick Meir-
mans, Eric Petit, Alan Raybould, Michel Raymond, Max Reuter, François Rousset, San-
drine Trouvé and many others, whose name slipped my mind.
Contents
1 Introduction 4
1.1 What the program FSTAT does . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 What’s new in version 2.9.4? . . . . . . . . . . . . . . . . . . 6
1.2 What the program FSTAT does NOT do . . . . . . . . . . . . . . . 7
3 File menu 10
3.1 Open submenu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Save Default Options subMenu . . . . . . . . . . . . . . . . . . . . . 10
3.3 Exit subMenu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Global statistics 15
6.1 Nei’s estimators of F-statistics . . . . . . . . . . . . . . . . . . . . . . 15
6.2 Weir & Cockerham estimators of F-statistics . . . . . . . . . . . . . 16
6.2.1 Jackknife variance and Bootstrap confidence interval . . . . . 17
6.3 RST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.4 FST per pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Testing 19
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 Global tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2.1 Hardy Weinberg Within samples . . . . . . . . . . . . . . . . 21
7.2.2 Hardy Weinberg Overall samples . . . . . . . . . . . . . . . . 21
7.2.3 Population Differentiation NOT assuming Hardy Weinberg
within . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2.4 Population Differentiation assuming Hardy Weinberg within . 21
7.3 Tests per sample or pair of samples . . . . . . . . . . . . . . . . . . . 21
1
CONTENTS 2
8 Composite disequilibrium 23
8.1 Estimating composite disequilibrium . . . . . . . . . . . . . . . . . . 23
8.2 Testing for genotypic disequilibrium . . . . . . . . . . . . . . . . . . 25
9 Run button 27
11 Biased dispersal 31
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
11.2 Design of the tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
11.2.1 Assignment index AI . . . . . . . . . . . . . . . . . . . . . . 31
11.2.2 FST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2.3 FIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2.4 HO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2.5 HS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
11.2.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.3 Running the program . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.3.1 Input file format . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.3.2 Running the program . . . . . . . . . . . . . . . . . . . . . . 34
11.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
12 Mantelize it! 36
12.1 Matrices to Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.2 Multiple regression and partial mantel . . . . . . . . . . . . . . . . . 36
13 Utilities menus 38
13.1 Reset seeds Random number generator . . . . . . . . . . . . . . . . . 38
13.2 File conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
14 Files of results 39
Introduction
4
CHAPTER 1. INTRODUCTION 5
Hardware requirements
A PC running Windows 95 or above. It might work with windows 3.1 and wins32,
but I have not tried it. . . I guess a Pentium processor would be a necessity if ran-
domisations are to be carried out. You should also be able to run it on a (recent
and powerful) Macintosh with PC emulation software.
Cite as
Goudet, J. 2003. Fstat (ver. 2.9.4), a program to estimate and test population ge-
netics parameters. Available from http://www.unil.ch/izea/softwares/fstat.
html Updated from Goudet [1995]
• Fstat is not suited for dominant data, obtained from e.g. RAPDs or AFLPs.
If you have such data, please have a look at Hickory(Holsinger et al. [2002])[http:
//darwin.eeb.uconn.edu/hickory/hickory.html]
• Fstat is not suited for sequence data. One of the suitable softwares for this
is Arlequin(Schneider et al. [2000])[http://lgb.unige.ch/arlequin/].
• Fstat will not assign individuals to populations. A very good software for
this is the Structure software(Pritchard et al. [2000])[http://pritch.bsd.
uchicago.edu/].
• check Genepop (Raymond and Rousset [1995b]) [ftp://ftp.cefe.cnrs-mop.
fr/genepop/] for testing for isolation by distance among individuals.
• The following web sites contain links to programs of interest for population
genetics analysis:
– Department of Biological sciences at Louisiana state university http:
//www.biology.lsu.edu/general/software.html
– Joe Felsenstein list at http://evolution.genetics.washington.edu/
phylip/software.html
Chapter 2
The program is run by double clicking on the icon (a red butterfly) named Fs-
tat294.exe.
The first time you run the program, you will be prompted for 2 numbers, which
are the seeds of the Random Number Generator. Any 2 numbers will do. The
chosen generator is described in L’Ecuyer [1988]. It is a combination of 2 of the best
Multiple Linear Congruential Generator known, and has been thoroughly checked
(Goudet [1993]). In future run, the seeds will be loaded from the file Fstat.INI,
created when you exit the program and updated after each run. If for one reason
or another, you want to use a set of specific seeds, just edit the file Fstat.INI or
use the Utility, Reset seeds random number generator menu.
8
CHAPTER 2. RUNNING THE PROGRAM 9
there are less than nu alleles (e.g. if only alleles 641 and 657 are present, nu = 657).
An example of 2 valid input files is given in Appendix A.
If you have a file with the Genepop format, Fstat now translate this format
into its own. See the File Conversion menu under Utilities
It might be difficult memory-wise to run a data set with all these parameters to
the maximum, but I guess it would also mean quite some time reading gels!
If you need more capacity, please contact me : jerome.goudet@ie-zea.unil.ch.
Chapter 3
File menu
1 by clicking on the desired files while keeping the Ctrl key pressed
10
Chapter 4
This menu is greyed as long as you have not chosen a data file (menu file, open).
Under this menu, you will select what you want Fstat to do. Some of the
options are selected by default (those with a tick mark). To select an unselected
option, or to deselect one, just click on it. Tick marks will then either appears or
disappears.
11
CHAPTER 4. CHOOSE OPTIONS MENU 12
If you choose the default filename, it will be the original name, followed by -P
(for populations), followed by the numbers identifying the selected samples.
The chosen filename will be used for the input data as well as for all the output
files, with the appropriate extensions.
The samples in the new file will appear with the order they had in the original
.dat file. That is, if you select (in this order) samples 5, 6 2 & 1, the new .dat file
will contains (in this order) samples 1, 2, 5 & 6. If a label file was associated with
the original .dat file, a new label file will be created, with labels corresponding to
the sampled populations (and in the right order1 ).
1 if you would like samples to be identified by a label, you have to first select the file containing
the labels, an only then to pick the samples. Proceeding the other way around (unless you have
already created a label file with the subset of samples you are interested in) will not associate
correctly samples with their respective labels.
Chapter 5
h i
i −1)
Ai Ai = n pi (2np
(2n−1) ,
h i
(2npj )
Ai Aj = 2n pi (2n−1)
13
CHAPTER 5. PER LOCUS AND SAMPLE STATISTICS 14
where Ni is the number of alleles of type i among the 2N genes. Note that each
term under the sum corresponds to the probability of sampling allele i at least once
in a sample of size 2n. If allele i is so common that we are certain to sample it
-when 2n > (2N − Ni )- the ratio is undefined but the probability of sampling the
allele is set to 1.
For Rt , the same sub-sample size n is kept, but N is now the overall samples
number of individuals genotyped at the locus under consideration. Rt is reported
in the last column.
Differences in allelic richness among groups of populations could be tested, see
tab-sheet Comp. Among groups of samples.
5.6 Fis
k
k HO
Estimates FIS for each locus and sample as FIS = 1− k .
HS
Here, the generic name
FIS is given, rather than GIS or f (see below). This is because the statistic is
estimated for each sample, in which case the difference between GIS and f (due to
the different weightings of varying sample size, see below) vanishes.
Chapter 6
Global statistics
F-statistics are a set of tools devised by Wright [1921, 1969] to partition heterozy-
gote deficiency into a within and an among population component. They are widely
used by population biologists to assess levels of structuring in samples of natural
populations. FIS measures the heterozygote deficit within populations, FST among
populations (a measure of the Wahlund effect), and FIT the global deficit of het-
erozygotes.
Estimation of F-statistics has been debated in the literature for the past 30 years,
since the early work of Cockerham [1969, 1973] and Nei [1973, 1975]. Advantages
and problems of either estimator have been discussed at length (Slatkin and Barton
[1989], Chakraborty and Danker-Hopfe [1991], Cockerham and Weir [1993]). A good
review of these and other estimators can be found in Excoffier [2001].
Fstat calculates Nei and Weir & Cockerham estimators of gene diversities and
differentiation.
15
CHAPTER 6. GLOBAL STATISTICS 16
• HT0 = HS + DST
0
DST 2
• FST = HT .
0
0 DST
• FST = HT
HO
• FIS = 1 − HS
Here, the pki are unweighted by sample size. These statistics are estimated for
each locus and an overall loci estimates is also given, as the unweighted average
of the per locus estimates. In this way, monomorphic loci are accounted for (with
estimated value of 0) in the overall estimates.
Note that the equations used here all rely on genotypic rather than allelic number
and are corrected for heterozygosity (see Nei and Chesser [1983]). A thorough
description of many estimators of gene diversity and differentiation is available in
the excellent review of Laurent Excoffier (Excoffier [2001])
2 k
σa = P 2
1
P n
k k
k nk −
P
np−1 n k
k
P
nk pki
where (p¯i = P
k
n
), the between individual within sample component
k
k
− p̄)2 − (1/4)
P P P P
nk p̄(1 − p̄) − nk (pk nk HOk nk HOk
σb2 = k kP k
− kP
k nk − np 4 k nk
For more details about these different quantities, refer to the original paper by
Weir and Cockerham [1984] as well as Weir [1996].
2 2
If all the samples have the same size, then σw is the same as HO , (σw +σb2 ) as
2 2 2 0 0
HS , (σw +σb +σa ) as HT , θ as FST and f as FIS . This is no longer true if sample
sizes are different. Weir & Cockerham (WC) weight allele frequencies according to
sample sizes (as would be done in an ANalysis Of VAriance), whereas Nei weights
all samples equally, whatever the sample size is. When sample sizes vary a lot, this
can lead to large differences between the two families of estimators.
2 This is not the same as Nei’s G
ST . Nei’s GST is an estimator of FST based on allele frequencies
only
CHAPTER 6. GLOBAL STATISTICS 17
Nei and WC also treat differently monomorphic loci: Under Nei’s family, when
0
a locus is completely monomorphic, FIS and FST (and FST ) are all 0, while WC
3
consider that the estimators cannot be defined .
A measure of Hamilton [1971] relatedness is also included, calculated using an
estimator strictly equivalent to Queller and Goodnight [1989] (see in particular
appendix B of this paper), namely: R = 2θ/(1 + F ). This measure is the average
relatedness of individuals within samples when compared to the whole.
Pamilo [1984, 1985] pointed out that when there is inbreeding, relatedness is
biased and proposed an ”inbreeding corrected relatedness”, labelled relatc in Fstat,
and estimated as: Rc = [R − 2F/(1 + F )]/[1 − 2F/(1 + F )] One should be wary
however, that this estimator is not bounded between 0 and 1. In particular when
working with species undergoing partial selfing, relatc seems to produce invalid
results. On the other hand, when the population is structured, relatc adequately
removes the increase in relatedness due to this structuring (see Chapuisat et al.
[1997] for an example)
2
Differences among groups for HO (σw ), HS (σb2 +σw
2
), f , θ, relat and relatc could
all be tested, see tab-sheet Comp. Among groups of samples.
6.3 RST
RST is an estimator of gene differentiation accounting for variance in allele size and
defined for genetic markers undergoing a stepwise mutation model (Slatkin [1995]).
It is not worth bothering about unless you are pretty sure that mutation follows
a stepwise mutation model quite strictly, and that mutation can not be neglected
compared to others forces such as migration (Balloux et al. [2000], Balloux and
Goudet [2002]). Fstat estimates RST for each locus following Rousset [1996].
This estimator does not depend on the number of samples. It also outputs the
different components of variance of allele size: Va for among samples, Vb for among
individuals within samples and Vw for within individuals. Vt = Va + Vb + Vw is thus
an unbiased estimate of the total overall variance in allele size.
Three estimators of overall loci RST are provided:
1. one according to Rousset [1996], where each locus is weighted by its amount
of allelic variance.
2. Because variances in allele size commonly vary by orders of magnitude among
loci, Fstat also gives the estimator of Goodman [1997]. Calling Mo and Vo
the mean and variance in allele size at a locus, Goodman
√ [1997] suggested
to use the centred normalised allele size (x − Mo )/ Vo instead of allele size
x in the estimation of RST . Individual loci RST will not be affected by this
centering-rescaling, but the overall loci estimator will, because each individual
3 it is sometimes found in the literature that F 0
ST or FST cannot be negative. While this is true
for the statistic GST because this statistic does not include a correction for heterozygosity (Nei
[1973]), it is wrong for subsequent versions accounting for heterozygosity. For details on this issue
and more about the comparison between Nei and WC estimators, Cockerham and Weir [1993]
CHAPTER 6. GLOBAL STATISTICS 18
locus variance component will be divided by its locus allelic size variance (since
s2 (aX + b) = a2 s2 (X)). Goodman’s RST will thus be expressed as:
Pl
Vai /Voi
U RST = Pi=1
l
i=1 Vti /Voi
4 in the old days (v 1.2), Fstat produced a file containing the so-called ”Nm” values. These
values were simply a function of pairwise FST , namely 1/(4 FST ) - 1/4. In many cases however,
the assumptions necessary to transform FST into a number of migrants are not fulfilled (Whitlock
and McCauley [1999]). To avoid misuse, Fstat does not produce these values anymore
Chapter 7
Testing
7.1 Introduction
All the tests in Fstat are randomisation based. The principle behind randomisation
based tests is the following: Data sets fitting the null hypothesis to be tested are
generated by randomising the appropriate units (alleles, genotypes. . . see below).
A statistic is calculated on these randomised data sets and its value compared to
the statistic obtained from the observed data set. The proportion of statistics from
randomised data sets that give a value as large or larger than the observed provides
an unbiased estimation of the probability that the null hypothesis is true (P-value
of the test). These tests are carried out for each locus individually and then over
all loci. Bonferroni procedures should be applied when using individual population
tests or individual locus tests (Rice [1989]).
Units of randomisation are NOT the same for the different tests:
For HW overall
Alleles are permuted among samples. The statistic used to compare the randomised
data sets to the observed is FIT (F )
over loci is valid. If you’d like to use all possible single locus genotype to estimate the probability
of differentiation for individual loci, you’ll need to create separate file for each locus (using the
options, loci to use sub-menu)
19
CHAPTER 7. TESTING 20
np X
nu
X nik
G = −2 nik log ,
n k pi
k=1 i=1
where nik is the number of allele i in sample k, nk is the number of alleles (twice
the number of individuals) in sample k, and pi is the frequency of allele i in the
whole dataset for the overall test, and in the two samples considered for the pairwise
test (with np = 2 in this case). To test for population differentiation among all loci,
the following statistic is used to classify contingency tables:
np X
nl X nu
X nikl
G = −2 nikl log ,
nkl pil
l=1 k=1 i=1
where the first sum is over loci and the second over alleles. For the test based
on F , it is:
Pnl Pnu 2
σa + σb2
f = Pnl Pnu ,
(σa2 + σb2 + σw 2)
and for the tests of differentiation based on the likelihood ratio statistic, Fstat uses
as an overall test statistic the sum of individual loci G-values
np X
nl X nu
X nikl
G = −2 nikl log ,
nkl pil
l=1 k=1 i=1
CHAPTER 7. TESTING 21
A brief description of the performance of this last test is given in Petit et al.
[2001].
gives the non-adjusted P-value for each pair. The second gives the pairwise signifi-
cance after standard Bonferroni corrections3 . ”***” corresponds to significance at
the 0.1% nominal level, ”**” significance at the 1% nominal level and ”*” signifi-
cance at the 5% nominal level. ”NS” stands for non-significant and ”NA” for not
available.
3 the reported significance levels are after strict (not sequential) Bonferroni corrections based
Composite disequilibrium
When two or more loci are present, it is of interest to verify whether alleles at the
different loci assort independently. The classical measure of gametic disequilibrium
is D = pAB − pA pB where pAB represents the frequency of gamete carrying allele A
at the first locus and B at the second, and pA and pB represents the frequency of
alleles A and allele B respectively (Lewontin and Kojima [1960]). With genotypic
data, estimation of gametic disequilibrium is impossible unless one is willing to
assume Hardy-Weinberg at each locus (Weir [1979]) because gametic type in the
double heterozygotes cannot be inferred. This is where the composite disequilibrium
is useful.
AA AĀ ĀĀ
BB n1 n2 n3
B B̄ n4 n5 n6
B̄ B̄ n7 n8 n9
23
CHAPTER 8. COMPOSITE DISEQUILIBRIUM 24
∆AB
∆0AB = , ∆AB < 0
2 × min(pA pB , pĀ pB̄ )
∆AB
∆0AB = , ∆AB ≥ 0
2 × min(pA pB̄ , pĀ pB )
The statistical properties of multiallelic ∆0 have not been investigated, but those
of the equivalent estimator of gametic disequilibrium D0 have been thoroughly in-
vestigated by Zapata [2000], Zapata et al. [2001]. In particular, Zapata et al. [2001]
showed that under many conditions, the distribution of multi-allelic D0 do not de-
viate from normality.
An overall samples estimator of ∆0 is also provided1 as:
Pnp 0
k=1 nk ∆k
∆0 = P np
k=1 nk
• nAB = 2n1 + n2 + n4 + n5 /2
n
• ∆d
AB = n−1 (nAB /n − pA pB )
• ∆0AB
• RAB
Last, it might be useful to have a list of all the contingency tables of genotypes
by genotypes. If you check the box Save contingency Tables, a file with the
extension -tables.ld will be created. It can be quite large, as it will contain for
each pair of loci in each sample the cross-table of 2 locus genotypes.
In most case you are only interested in overall samples level of significance. You
should then choose the option Tests between all pairs of loci. The number
of randomisations for this option is fixed by the nominal level for multiple test radio-
group box. It will be 20×nl×(nl−1)/2, 100×nl×(nl−1)/2 or 1000×nl×(nl−1)/2
randomisations for a 5%, 1% or 0.1% nominal level, respectively. With 10 loci,
these numbers are 900, 4’500 and 45’000 respectively. It can still take some time
for loci with a large number of alleles, as for each randomisation, the test needs to
reconstruct the cross-table. With 20 alleles at the 2 loci for instance, the dimension
of the table to scan is 210 × 210!.
Because the number of randomisations can turn out to be very large, I added
the possibility to fix this number, using the fixed number option combined with
the number of permutations box (1000 by default).
Chapter 9
Run button
This button is greyed as long as you have not chosen a data file
Once you have selected all the options you want, just click here to run the
analysis. The results are written (appended if the file already exists) to the .OUT
result files.
If you have selected a subset of samples and/or loci, when you click on this
menu, Fstat will prompt you for an alternative filename for the subset of data.
The default is the original filename, followed by ”-L” and numbers corresponding
to the selected loci, followed by ”-P” and numbers corresponding to the selected
samples. It is a good idea to change this default filename, that could turn out to be
very long, and not very meaningful after a few weeks without using the data set!.
27
Chapter 10
Under this tab-sheet, tests for difference among groups of populations for a
number of statistics (allelic richness, Observed heterozygosity, Gene diversity, FIS ,
FST , relatedness and corrected relatedness) can be carried out. The number of
groups can be anything between 2 and 12. Each group needs to be made of at least
2 samples. If only two groups are compared, the tests could be one or two sided,
otherwise, they are two sided.
28
CHAPTER 10. COMP. AMONG GROUPS OF SAMPLES 29
10.1.6 Run
Run the tests. Results will be stored in a file with the name filename_test.out.
Each time you run a new test, results are appended to this file.
Hint: the units of randomisation for these tests are the samples. To increase
Degrees of freedom (and therefore the tests’ power), it is better to compare many
samples with perhaps fewer individuals rather than the opposite. I guess if you are
using Fstat now, it might be already too late to change the sampling design.
Chapter 11
Biased dispersal
11.1 Introduction
This option tests for biases in dispersal among two apriori defined groups of individ-
uals using information from codominant genetic markers. See Goudet et al. [2002]
for details of the principle and methods, and a power analysis of the different tests.
This option can also be used to test for other things than dispersal such as different
levels of inbreeding (using the observed heterozygosity HO , see below), colour, size,
parasitic state etc. . . .
for l loci and n individuals in the kth sample. The distribution of AIc will
therefore be centered around 0. A positive value indicates a genotype more likely
31
CHAPTER 11. BIASED DISPERSAL 32
than average to occur in its sample (likely a resident individual), while a negative
value indicates a genotype less likely than average (potentially a disperser).
mean of AIc (mAIc ) Because immigrants tend to have lower AIc values than
residents, under sex biased dispersal, the average index for the sex that disperses
most is expected to be lower than that for the more philopatric sex. A t-statistic
mAIcp −mAIcd
is used for the test: t = q ,where np and nd are the number of
s2 p /np +s
2
d
/nd
AIc AIc
individuals in the more philopatric and the more dispersing group respectively.
variance of AIc (vAIc ) Because members of the dispersing sex will include both
residents (with common genotypes) and immigrants (with rare genotypes), vAIc for
the sex dispersing most should be largest.
11.2.2 FST
FST is a statistic expressing the proportion of the total genetic variance that resides
among populations (Hartl and Clarck [1997]). Allelic frequencies for individuals of
the sex dispersing most should be more homogeneous than those for individuals of
the more philopatric sex. We therefore expect FST for the more philopatric sex to
be higher than that of the more dispersing sex. Among the available estimators of
FST , we choose Weir and Cockerham [1984], because it is the most commonly used
and it is also unbiased.
11.2.3 FIS
FIS is a statistic describing how well the genotype frequencies within populations
FIT with Hardy Weinberg expectation (Hartl and Clarck [1997]). If only males
disperse, the males sampled from a single patch will be a mixture of two populations,
residents and immigrants; due to the Wahlund effect, the sample should show a
heterozygote deficit and a positive FIS . In general, members of the dispersing sex
should therefore display a higher FIS than the more philopatric sex. Among the
several estimators of FIS , we also choose Weir and Cockerham [1984].
Relatedness
relatedness is related to FST as relat=2 FST /(1+ FIT ). A test based on relatedness
has essentially the same properties as one based on FST , and is provided here for
convenience.
11.2.4 HO
The observed heterozygosity, HO , is NOT expected to change with the dispersal
status (nor is the total gene diversity HT , providing it is calculated with a weighting
proportional to the size of each group). But HO should differ among inbred and
outbred individuals. It might be of interest to test whether an individual’s status
is linked to inbreeding. For instance, one could test whether parasitized individuals
tend to be more inbred than healthy ones. If you have such categories, then a test
based on HO is of interest. See Trouvé et al. [2003] For an application of this test.
11.2.5 HS
The within group gene diversity HS should be largest for the group dispersing most.
CHAPTER 11. BIASED DISPERSAL 33
11.2.6 Testing
To test whether these statistics differ significantly between the two sexes, a ran-
domisation approach is used. Under the null hypothesis that males and females
disperse equally, the four statistics do not depend on the variable ’sex’. Letting
Xd and Xp be the statistic of interest for the dispersing and the philopatric sex
respectively, the program proceeds as follows for one sided tests.
1. It first calculates the statistic for each sex over all populations and either take
the difference for FIS , HS and HO (for HO the group supposed to be most
outbred is assimilated to the group dispersing most); the t-statistic defined
above for mAIc ; the difference for FST and relatedness; or the ratio for vAIc .
2. It randomly assigns a sex to each of the multi-locus genotypes (keeping the
genotypes in their original sample, and the sex ratio in each sample constant).
3. It recalculates the appropriate difference or ratio for the randomised data set.
4. steps 2/ to 3/ are repeated numbperm times.
The probability that dispersal is unbiased by sex is then estimated as the propor-
tion of times where the relevant statistic is larger or equal to the observed one. The
two-tailed test are constructed under the same principle using either the absolute
value of the differences, or the ratio of the largest to smallest variance.
• The file name has the extension .GEN. No other extension will be recognised
by the program.
• The first line contains 3 numbers:
1 See menu utilities, file conversion for creating Genepop format using Fstat
CHAPTER 11. BIASED DISPERSAL 34
Here is a screen shot of the program windows. To run the program, do the
following:
1. choose the file containing the data under the File, open menu.
2. type in each box (Letter for the most philopatric group and Letter
for the other group) the letter used in the data file to define members of
the two groups
3. choose whether you want a one or a two-sided test
4. choose which test(s) statistic you want to use
5. enter the number of randomizations for the tests (between 100 and 10’000)
6. click the GO button.
CHAPTER 11. BIASED DISPERSAL 35
11.3.3 Results
The results are stored in a file with the extension .res and the same name as the
input file. An example of such a file is given in Appendix B. The program also
produces 4 other files:
• One is with the extension .per and contains the results of all the permutations.
The first line of the file contains the observed values for the differences between
groups of the different statistics (in the order Mean ass, Var ass, FST and FIS ,
relatedness, HO and HS ), preceded by the number 0. The remaining lines
contain the same information for each of the (numbperm-1) randomisations.
With this file, you can generate the distribution of the statistics under the
null hypothesis, as is done in the article describing the tests (Goudet et al.
[2002]).
• Three are files with the extension .dat and with the data reformatted so that
they can be analysed with Fstat . One of these files contains the whole
data set, the two others contain the data for each group, with the letter
corresponding to the group identifier appended to the file name.
Chapter 12
Mantelize it!
Under this menu, one can carry out multiple regression or partial Mantel test. This
latter option is available only providing the data originated from distance matrices.
While Raufaste and Rousset [2001] have criticized partial mantel, these tests are
still useful (rewrite)
36
CHAPTER 12. MANTELIZE IT! 37
You also have the possibility to save residuals in a file. The default number of
permutations for the test is set at 2000. Once everything is selected, click on run.
The results appear in the window at the lower right hand of the panel, while the
two upper right panels show scatterplots of the estimates and residuals, and fitted
and observed values. By left-clicking the graph, you will see the residuals (right
panel) and the estimates (left panel) as a function of the estimates (right panel) or
and the observed values (left panel).
The result file is structured as follows:
A brief description of the input file is given (name and comments). Then follows
the number of randomisations used in the test.
The total sum of square of the dependant variable is then given, followed by the
result of the regression: the (partial) correlation of each explanatory variable with
the dependant variables, the coefficient associated with each explanatory variable
and the sum of squares explained by the explanatory variables.
The overall percentage of the variance explained by the model (R-squared is
then given).
Last, the P-value for the coefficient associated with each variable is given, to-
gether with the P-value associated with the proportion of variance explained by each
explanatory variable. These P-value are given as percentage, that is, a 5 would mean
5% of the randomisation gave values as large or larger than the observed.
Each time a new test is run, it is appended at the end of the result file.
Chapter 13
Utilities menus
38
Chapter 14
Files of results
Results will be stored in a file named FILENAME.OUT, located in the same folder as
the input file.
This file first contains a line saying when the analysis has been run. On the
following lines, one find the identifier of the sample, then, for each locus, the size
of each sample, followed by the frequency per sample and overall frequency of each
allele present at the locus. The rest of the output file should be self explanatory.
If you requested that pairwise FST be estimated under the menu choose options,
F-statistics, FST per pair 2 files are created:
• A file FILENAME.FST. This file contains the estimated pairwise FST
• A File FILENAME.MAT. It contains two half matrices. The first one corresponds
to the pairwise sum of variance components (σa2 , σb2 and σw
2
) representing the
denominator of Cockerham estimator of FST . The second corresponds to the
pairwise variance component σa2 , the numerator of Cockerham estimator of
FST .
If you requested that genotypic frequencies be calculated under the menu choose
options, gene diversities, genotypic frequencies, a File FILENAME.X2 will be cre-
ated. It contains for each genotype at each locus the observed and expected geno-
typic count, where the expected genotypic count is calculated using unbiased esti-
mation:
h i
i −1)
Ai Ai = n pi (2np
(2n−1) ,
h i
(2npj )
Ai Aj = 2n pi (2n−1)
39
CHAPTER 14. FILES OF RESULTS 40
detailed output, a file -details.cd will be generated. Last, if you requested Fstat
to save contingency tables, a file with extension -tables.ld will be generated.
All the separators in the output files are tabs, which allows direct importation
of the results into commercial packages such as the Microsoft spreadsheet Excel,
therefore facilitating printing and graphical representation of the results.
Chapter 15
Note that all genotypes are homozygous. When running this file, FIS (f ) and
FIT (F ) will be meaningless. FST (θ) will however be an appropriate measure of the
Wahlund effect. The file HAPLO.DAT, given in Weir [1996] ’Genetic Data Analysis’,
is distributed with the example files.
CAUTION : one should not mix haploid and diploid data in the same data file.
If you have autosomal markers and mitochondrial or Y chromosome markers, do
not put them in the same file but rather create two files, one for the autosomal loci,
the other for the haploid loci. The same goes for haplo-diploid species. This is to
insure that your overall loci statistics have some meaning!
41
CHAPTER 15. SPECIAL USE OF FSTAT 42
One digit encoding for data file Two A.2 Two digits encoding
digits encoding for data file Label file
for input data file
The same file with alleles encoded with
A.1 One digit encoding a 2 digit number:
for input data file 6 5 4 2
loc-1
loc-2
A file encoded with 1 digit number (This loc-3
loc-4
is the example file DIPLOID.DAT, given loc-5
1 0404 0403 0403 0303 0404
in Weir’s (1990) ’Genetic Data Analysis’. 1 0404 0404 0403 0303 0404
1 0404 0404 0403 0403 0404
1 0404 0404 0 0303 0404
6 5 4 1
1 0404 0404 0204 0304 0404
loc-1
1 0404 0404 0 0403 0404
loc-2
1 0404 0404 0403 0403 0404
loc-3
1 0404 0404 0 0403 0404
loc-4
2 0404 0404 0303 0302 0404
loc-5
2 0404 0303 0404 0403 0404
1 44 43 43 33 44
2 0404 0403 0404 0403 0404
1 44 44 43 33 44
2 0404 0404 0303 0303 0404
1 44 44 43 43 44
2 0404 0403 0404 0404 0404
1 44 44 0 33 44
2 0404 0404 0404 0202 0404
1 44 44 24 34 44
2 0404 0404 0403 0403 0404
1 44 44 0 43 44
2 0404 0404 0404 0404 0404
1 44 44 43 43 44
3 0404 0404 0404 0403 0404
1 44 44 0 43 44
3 0404 0404 0404 0404 0404
2 44 44 33 32 44
3 0404 0404 0403 0201 0404
2 44 33 44 43 44
3 0404 0404 0303 0403 0404
2 44 43 44 43 44
3 0404 0404 0403 0201 0404
2 44 44 33 33 44
4 0404 0404 0403 0404 0404
2 44 43 44 44 44
4 0404 0404 0403 0403 0404
2 44 44 44 22 44
4 0404 0404 0403 0403 0404
2 44 44 43 43 44
4 0404 0404 0403 0404 0404
2 44 44 44 44 44
4 0404 0404 0403 0404 0404
3 44 44 44 43 44
4 0404 0404 0404 0303 0404
3 44 44 44 44 44
4 0404 0404 0404 0404 0404
3 44 44 43 21 44
5 0404 0404 0404 0201 0404
3 44 44 33 43 44
5 0404 0404 0404 0303 0404
3 44 44 43 21 44
5 0404 0404 0403 0403 0404
4 44 44 43 44 44
5 0404 0404 0403 0403 0404
4 44 44 43 43 44
5 0404 0404 0404 0404 0404
4 44 44 43 43 44
5 0404 0404 0404 0403 0404
4 44 44 43 44 44
5 0404 0404 0403 0403 0404
4 44 44 43 44 44
5 0404 0404 0404 0 0404
4 44 44 44 33 44
5 0404 0403 0404 0403 0404
4 44 44 44 44 44
6 0404 0404 0404 0403 0404
5 44 44 44 21 44
6 0404 0404 0403 0303 0404
5 44 44 44 33 44
6 0404 0404 0404 0302 0404
5 44 44 43 43 44
6 0404 0404 0403 0401 0404
5 44 44 43 43 44
6 0404 0404 0404 0404 0404
5 44 44 44 44 44
6 0404 0404 0404 0402 0404
5 44 44 44 43 44
6 0404 0404 0404 0403 0404
5 44 44 43 43 44
5 44 44 44 0 44
5 44 43 44 43 44
6 44 44 44 43 44
6 44 44 43 33 44
6 44 44 44 32 44
6 44 44 43 41 44
6 44 44 44 44 44
6 44 44 44 42 44
6 44 44 44 43 44
43
APPENDIX A. EXAMPLES OF INPUT FILES 44
B.1 DIPLOID.OUT
****************************************************************************************************************************
* The following results were generated the 09.08.2001 at 13:44:36 with \textsc{Fstat} for windows, V2.9.3 from file diploid2.dat. *
****************************************************************************************************************************
Stade Twicke Arms P Millen Lansdo Murray All_W All_UW
Locus: loc-1
N 8 8 5 7 9 7
p: 4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Locus: loc-2
N 8 8 5 7 9 7
p: 3 0.063 0.250 0.000 0.000 0.056 0.000 0.068 0.061
p: 4 0.938 0.750 1.000 1.000 0.944 1.000 0.932 0.939
Locus: loc-3
N 5 8 5 7 9 7
p: 2 0.100 0.000 0.000 0.000 0.000 0.000 0.012 0.017
p: 3 0.400 0.313 0.400 0.357 0.167 0.143 0.280 0.297
p: 4 0.500 0.688 0.600 0.643 0.833 0.857 0.707 0.687
Locus: loc-4
N 8 8 5 7 8 7
p: 1 0.000 0.000 0.200 0.000 0.063 0.071 0.047 0.056
p: 2 0.000 0.188 0.200 0.000 0.063 0.143 0.093 0.099
p: 3 0.688 0.375 0.200 0.286 0.438 0.357 0.407 0.390
p: 4 0.313 0.438 0.400 0.714 0.438 0.429 0.453 0.455
Locus: loc-5
N 8 8 5 7 9 7
p: 4 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
************************************************
************************************************
************************************************
************************************************
45
APPENDIX B. FILES OF RESULTS FROM THE INPUT FILE IN APPENDIX 1.46
************************************************
************************************************
Nei’s estimation of heterozygosity
Overall 0.234 0.235 0.239 0.004 0.005 0.240 0.018 0.022 0.005
************************************************
Weir & Cockerham (1984) estimation of Fit (CapF), Fst (theta) and Fis (smallF).
relat is Relatedness estimated following Queller & Goodnight (1989)
relatc is relatedness inbreeding corrected following Pamilo (1984, 1985)
sig_a, sig_b and sig_w are the component of variance
among samples, among individuals within samples and within individuals respectively.
************************************************
************************************************
Rst Over all samples estimated following Rousset (1996) and Goodman (1997)
************************************************
************************************************
Jackknifing over populations.
************************************************
************************************************
Bootstrapping over Loci.
************************************************
************************************************
Randomising alleles overall samples
************************************************
Randomising genotypes among samples.
*******************************************************
P-value for genotypic disequilibrium
based on 6000 permutations.
Adjusted P-value for 5\% nominal level is : 0.000833
Adjusted P-value for 1\% nominal level is : 0.000167
DIPLOID.MAT
1.3980
1.3177 1.4438
1.1724 1.2953 1.0982
1.1452 1.2730 1.1690 1.0243
APPENDIX B. FILES OF RESULTS FROM THE INPUT FILE IN APPENDIX 1.48
0.0290
0.0876 -0.0563
0.1291 0.0155 -0.0006
0.0806 -0.0235 0.0171 0.0351
0.1245 -0.0168 -0.0116 0.0374 -0.0508
B.3 DIPLOID.X2
Observed and expected genotype frequencies
****************************************************************************************************************************
* The following results were generated the 03.08.2001 at 16:26:37 with \textsc{Fstat} for windows, V2.9.3 from file diploid2.dat. *
****************************************************************************************************************************
Samples of group 1 (G1) : pop1, pop2, pop3,
Allelic richness: 1.772 Ho: 0.241 Hs: 0.273 Fis: 0.117 Fst: 0.013 Rel: 0.024 Relc: -0.264
Samples of group 2 (G2) : pop4, pop5, pop6,
Allelic richness: 1.621 Ho: 0.214 Hs: 0.197 Fis: -0.083 Fst: 0.005 Rel: 0.011 Relc: 0.153
Put here any comments you want data from the file :
E:\fstatdev\Fstat2.9.3\data\multiple regression\YANOM.ALL
Correlation (Partial if # expl. variables > 1), Coefficient (Beta) and Sum of squares (SS) for the observed data:
Variable (Partial) Corr. Beta SS
------------------------------------------------------------
var1 (YANOM.ANT) 0.299551 0.029754 2719.9512
------------------------------------------------------------
Error sum of squares: 27592.4707
50
BIBLIOGRAPHY 51
S.J. Goodman. Rst calc: a collection of computer programs for calculating esti-
mates of genetic differentition from microsatellite data and a determining their
significance. Molecular Ecology, 6:881–885, 1997. URL http://helios.bto.ed.
ac.uk/evolgen/rst/rst.html.
J. Goudet. The genetics of geographically structured populations. PhD thesis, Uni-
versity of Wales at bangor, 1993. URL http://www.unil.ch/popgen/research/
reprints/.
J. Goudet. Fstat (vers.1.2): a computer program to calculate F -statistics. Jour-
nal of Heredity, 86:485–486, 1995. URL http://www.unil.ch/izea/softwares/
fstat.html.
J. Goudet. An improved procedure for testing key innovations. American Natural-
ist, 53:549–555, 1999. URL http://www.unil.ch/popgen/research/reprints/
goudet_amnat_1999.pdf.
J. Goudet, T. De Meeüs, A.J. Day, and C.J. Gliddon. Genetics and Evolution
of Aquatic Organisms, chapter The different levels of population structuring of
dogwhelks, Nucella lapillus, along the south Devon coast. Chapman and Hall,
London, 1994. URL http://www.unil.ch/popgen/research/reprints/.
J. Goudet, N. Perrin, and P. Waser. Tests for sex-biased dispersal using bi-parentally
inherited genetic markers. Molecular Ecology, 11:1103–1114, 2002. URL http:
//www.unil.ch/popgen/research/reprints/goudetetal_mec_2002.pdf.
J. Goudet, M. Raymond, T. Demeeus, and F. Rousset. Testing differentiation in
diploid populations. Genetics, 144:1933–1940, 1996. URL http://www.unil.ch/
popgen/research/reprints/goudetetal_genetics_1996.pdf.
W.D. Hamilton. Man and Beast: Comparative Social Behavior, chapter Selection of
selfish and altruistic behaviour in some extreme models, pages 57–91. Eisenberg
and Dillon, washington, DC, 1971.
D.L. Hartl and A.G. Clarck. Principles of Population Genetics. Sinauer Associates,
third edition, 1997.
P.W. Hedrick. Genetics of Population. Jones and Bartlett, second edition, 2000.
K. E. Holsinger, P. O. Lewis, and D. K. Dey. A bayesian method for analysis
of genetic population structure with dominant marker data. Molecular Ecology,
11:1157–1164, 2002. URL http://darwin.eeb.uconn.edu/hickory/hickory.
html.
S.H. Hurlbert. The nonconcept of species diversity: a critique and alternative
parameters. Ecology, 52:577–586, 1971.
P. L’Ecuyer. Efficient and portable random number generators. Communications
of the ACM, 31:147–157, 1988.
R.C. Lewontin and K. Kojima. The evolutionary dynamics of complex polymor-
phism. Evolution, 14:458–472, 1960.
B.J.F. Manly. The Statistics of Natural Selection. Chapman and Hall, 1985.
B.J.F. Manly. Randomization and Monte Carlo methods in biology. Chapman et
Hall., second edition, 1997.
M. Nei. Analysis of gene diversity in subdivided populations. Proceedings of the
National Academy of Sciences USA, 70:3321–3323, 1973.
BIBLIOGRAPHY 52
M. Nei. Molecular Population Genetics and Evolution. Elsevier, 1975. URL http:
//www.bio.psu.edu/People/Faculty/Nei/Lab/BOOK.pdf.
M. Nei. Molecular Evolutionary Genetics. Columbia University Press, 1987.
M. Nei and R.K. Chesser. Estimation of fixation indices and gene diversities. Annals
of Human Genetics, 47:253–259, 1983.
M. Slatkin and N.H. Barton. A comparison of three methods for estimating average
levels of gene flow. Evolution, 43:1349–1368, 1989.
S. Trouvé, L. Degen, F. Renaud, and J. Goudet. Population structure of the fresh-
water snail lymnaea truncatula: the importance and consequences of selfing.
Evolution, 57:In Press, 2003. URL http://www.unil.ch/popgen/research/
reprints/trouveetal_evolution_2003_pp.pdf.
B.S. Weir and C.C. Cockerham. Estimating F -statistics for the analysis of popula-
tion structure. Evolution, 38:1358–1370, 1984.
M.C. Whitlock and D. McCauley. Indirect measures of gene flow and migration:
FST 6= 1/(4N m + 1). Heredity, 82:117–125, 1999.
S. Wright. Systems of mating. Genetics, 6:111–178, 1921.
S. Wright. Evolution and the genetics of populations. II. The theory of gene fre-
quencies, volume 2. University of Chicago Press, 1969.
C. Zapata. The D0 measure of overall gametic disequilibrium between pairs of
multiallelic loci. Evolution, 54:1809–1812, 2000.
C. Zapata, C. Carollo, and Rodriguez S. Sampling variance and distribution of the
D0 measure of overall gametic disequilibrium between multiallelic loci. Annals of
Human Genetics, 65:395–406, 2001.
D. Zaykin, L. Zhivotovsky, and B.S. Weir. Exact tests for association between alleles
at arbitrary numbers of loci. Genetica, 96:169–178, 1995.