Tess Version 2.3 - Reference Manual August 2009: Eric Durand Chibiao Chen Olivier François
Tess Version 2.3 - Reference Manual August 2009: Eric Durand Chibiao Chen Olivier François
Tess Version 2.3 - Reference Manual August 2009: Eric Durand Chibiao Chen Olivier François
3 - Reference Manual
August 2009∗
Eric Durand Chibiao Chen Olivier François
Contents
1 Overview 2
3 Method Description 3
3.1 Model Without Admixture . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Model With Admixture (tess > 2.0) . . . . . . . . . . . . . . . . . . . . . 4
4 Dominant markers 5
5 Data Format 5
11 Tutorial 25
12 FAQs 27
∗
contact: [email protected] or [email protected]
1
1 Overview
tess is a free computer program that implements a Bayesian clustering algorithm for
spatial population genetic studies (Chen et al. 2007 for tess 1.1, François et al. 2006).
The method is based on a hierarchical mixture model where the prior distribution on clus-
ter labels (without admixture model) or on admixture proportions (admixture model) is
defined as a Hidden Markov Random Field (HMRF, no admixture model) or as a Hidden
Gaussian Random Field (HGRF, admixture model) on a spatial individual network (tes-
sellation). The program seeks population structure from individual multilocus genotypes
sampled at distinct geographical locations without assuming predefined populations. It
takes input data files in a format compatible with existing non-spatial clustering algo-
rithms like structure (Pritchard et al., 2000), and returns membership probabilities
and geographical cluster assignments of individuals.
tess has been implemented using the C++ programming language. It contains a
command-line engine and a Graphical User Interface (GUI) shell. The command-line
engine is mainly designed for expert users who demand simplicity and flexibility and
for users who need to batch-analyze a large amount of data. It accepts data files in a
structure-like format, including individual spatial coordinates in their first columns.
It produces output results in both textual and graphical formats. The textual format
stores the estimated membership probabilities and allele frequencies in a plain ASCII
file. The text files may be post-processed using the distruct program (Rosenberg,
2004) or the clumpp program (Jakobsson and Rosenberg, 2007). There are four types of
graphical outputs. The first graphics shows the log-likelihood history of a run, which can
be used for the diagnosis of convergence of the algorithm; the second shows the estimated
membership probabilities in a bar chart, as does distruct, so that users can check
whether individuals share ancestry to multiple clusters or see the fraction of individual
genomes assigned to each cluster (admixture model); the third displays the geographical
assignment of individuals (hard clustering with geographical information); the fourth
shows the convergence history of trend coefficients and can be used to diagnose their
convergence. When invoked without any options, the command-line engine shows its
typical usage with some explanatory notes. Mandatory options are input data file, number
of individuals in the sample, ploidy, number of loci, (maximal) number of clusters, spatial
interaction parameter, parameter of allele frequency model, the type of run (with ou
without admixture), total number of sweeps of MCMC, and burn in number of sweeps of
MCMC. Other (optional) options will include the use of geographical distances between
individuals, the use of dummy individuals, etc and they will be explained later.
The GUI shell can help newbies to familiarize themselves with the software, and it
is generally a convenient way to use the tess program. It provides facilities for creating
and managing projects. A project is a coherent unit which groups the input data, the
algorithmic parameter settings, and the output results altogether. Projects are saved in
chosen folders automatically. By interacting with the GUI shell, users can check their
data, specify the parameter settings, run the MCMC algorithm, and visualize the results
without mastering the usage of the command-line engine.
tess also provides a couple of additional features. It is able to start from a pre-
specified clustering pattern obtained from the neighbor-joining algorithm, continue from
a previous run and it allows its user to modify the automatically generated neighborhood
system (accessible only via the GUI shell). It contains a routine that can help users
to generate individual spatial coordinates from population sample coordinates or spatial
2
ranges and a routine that can compute geographic distances between individuals. It can
perform multiple runs with distinct numbers of clusters, it can display a summary of
all runs and sort them by their values of the Deviance Information Criterion (DIC), a
statistical measure of the model prediction capabilities.
3 Method Description
Here we give a brief description of the methods used in the software. Denoting by (si ),
i = 1, . . . , N , the set of observed sites, each si is surrounded by points which are closer
to si than to any other sites. This set of points is known as the Dirichlet cell. We say
that two sites are neighbors if their corresponding cells share a common edge. The tess
program is based on a hierarchical MRF model whose neighborhood system is obtained
from the Voronoi tessellation (François et al., 2006; Chen et al., 2007). The program
offers the possibility to modify the neighborhood system by connecting additional sites
or by breaking links between sites. This option may be useful in order to include known
geographical barriers, and it generally allows users to specify their particular individual
network. The default weights on the network are set to one. For irregular sampling
designs, it is possible and useful to incorporate weights that depend on geographic distance
between sampling sites. This can be done via the “Compute Geographic Distance” option
before creating projects. In the case where geographic distances are avaible, the weight
between individuals i and j is set to wij = exp(−dij /θ) where dij is the distance between
individuals i and j and θ is a scale parameter (default value is the average distance
between individuals).
The data z = (zi` ) consist of N multilocus genotypes obtained from individuals located
at the sampled sites. An individual genotype zi records paired alleles at L loci, where
the number of possible alleles at locus ` is equal to J` . Although we are assuming diploid
individuals, the program is also able to handle haploid individuals, by setting the ploidy
parameter to A = 1 (default A = 2).
3
3.1 Model Without Admixture
We denote by ci the cluster from which the individual i originates, and we assume the
existence of at most Kmax clusters (ci ∈ {1, . . . , Kmax }). In practice the constant Kmax
could be considered larger than the true (or presumed true) number of clusters, K. As
in the structure program, the tess program performs the statistical inference of the
multidimensional parameter (c, f ) with c = {ci }i=1,...,N the cluster labels, f = (fk`j ),
k = 1, . . . , Kmax , ` = 1, . . . , L, and j = 1, . . . , J` the allele frequencies. The priors on
allele frequencies are Dirichlet distributions D(λ, . . . , λ). The prior distribution on the set
of cluster configurations is defined as a Gibbs distribution
where ψ is a nonnegative parameter called the interaction parameter, U (c) is the number
of neighboring pairs that share the same labels in c, and Z is a normalizing constant called
the partition function. With ψ equal to 0, this HMRF model assumes a non-informative
spatial prior, and then corresponds to the (no admixture, uncorrelated allele frequencies)
clustering model of Pritchard et al. (2000) which can be considered as a special case
of our model. Typical values of the interaction parameter could be taken in the range
ψ ∈ (0.5, 1) for K = 2−10. Inferences on (c, f ) are carried out by simulating the posterior
distribution π(c, f |z) through a Markov Chain Monte Carlo (MCMC) sampling algorithm.
where
log(α.k ) ∼ N µ.k , σk2 (Id − ψW )−1 ,
µ.k is a mean trend effect, W is a weight matrix defined on the Voronoi tessellation,
Id is the identity matrix, and ψ is a spatial interaction parameter. For all individuals
i ∈ {1, . . . , N }, and all cluster labels k ∈ {1, . . . , K}, we denote yik = log(αik ) and we
have
N
wij (yjk − µjk ) , σk2 ,
X
yik |yjk , j 6= i ∼ N µik + ψ
j=1
where µik is the ith coordinate of the mean µ.k and wij represents the weight of the inter-
action between individuals i and j. We have µik = X̃i βk , where X̃i is a vector containing
monomials of latitude and longitude (see section Main options to choose the degree), and
βk is a vector of regression coefficients which is estimated by the algorithm. If the degree
of the trend surface is ≥ 1, µ.k models geographic trends. The term N j=1 wij (yjk − µjk )
P
represents the part of randomness that can be captured by spatial autocorrelation, and
that is likely to result from isolation-by-distance. In this new model, the spatial interac-
tion parameter ψ represents the strength of the spatial autocorrelation. Note that this
interaction parameter has not the same meaning as in the HMRF model, and the MCMC
4
algorithm is able to estimate ψ. It may be beneficial to run the model without admixture
before trying to estimate admixture, because it could provide an upper bound on the
number of clusters in the data. In the presence of clines and if the samples are irregularly
spaced, the model without admixture might create patches along the clines. Performing
runs of the admixture model may then be useful to further determine the presence of
clines. The DIC could then be helpful to decide which patterns best explain the genetic
data.
The above described model is called the CAR model, and tess contains an alternative
modelling of admixture, called the BYM model, which is described in (Durand et al.,
2009).
4 Dominant markers
For dominant markers, such as AFLPs, it is not possible to distinguish between the
heterozygote genotypes and the dominant homozygote genotype. Following (Falush et al.,
2007) and starting from version 2.3, we implement a model to deal with dominant markers.
For each loci, we assume that there may be a single allele that is recessive to all other
alleles, while all other markers are codominant. Full details are given in (Falush et al.,
2007). In order to perform these computations the algorithm must be told which allele (if
any) is recessive at each locus. This is done by checking the box “Row of recessive alleles”
when creating a new project (see section Using the GUI Shell), and including a single row
of L integers in the input le, between the (optional) extra lines and the genotypes. This
row indicates the recessive allele at each of the L loci in the data set (see section Data
Format). If at a given locus all the markers are codominant then the recessive value at
that locus must be set to the missing data value (which must be a negative integer). If the
phenotype is unambiguous, then it is coded in the input le as it is. If it is ambiguous then
it is coded as homozygous for the dominant allele(s). For example, assume that allele A
is recessive versus alleles B and C, and that B and C are codominant. Then, phenotype
A is coded AA, B is coded BB, BC is coded BC. The genotypes AB, AC, etc are illegal
in the input le when A is recessive.
5 Data Format
tess allows the user to enter his/her data in two distinct formats. The first one (the
same as in tess 1.x) uses the same data format as structure. This format uses two
rows to represent the spatial coordinates and the multilocus genotype of one diploid in-
dividual. Besides the “pure data”, ie (XY)spatial coordinates + genotypic data, there
can be additional rows or columns present in the data file for informational or other pur-
poses. XY coordinates could be input either as longitude-latitude degrees or as standard
coordinates in the metric system. However, when attempting to predict on an ascii-raster
map, it is crucial that the first coordinate is longitude, and the second one latitude. tess
accepts population sample coordinates instead of individual coordinates. However, tess
will automatically add a small perturbation to population coordinates, so that all individ-
uals have distinct coordinates. See also the “Generate Spatial Coordinates” paragraph to
generate individual coordinates from population coordinates. All missing data should be
represented by a single negative integer. The content of a data file should look like this:
No Info X Y L1 L2 L3 L4 L5 L6 L7 L8 L9 L10
5
r1 r2 r3 r4 r5 r6 r7 r8 r9 r10
01 Info1 103.4 35.1 120 128 -9 129 156 234 148 124 182 98
01 Info1 103.4 35.1 120 128 -9 129 142 228 142 118 182 98
02 Info2 96.8 53.6 128 128 124 137 156 234 142 124 182 98
02 Info2 96.8 53.6 120 128 124 129 142 234 142 112 182 98
03 Info3 79.4 47.4 128 128 146 135 142 228 144 124 182 98
03 Info3 79.4 47.4 120 128 124 129 142 228 134 124 182 98
04 Info4 99.2 40.8 120 128 146 135 142 234 142 124 182 98
04 Info4 99.2 40.8 120 124 124 129 142 228 134 112 182 98
05 Info5 79.8 67.3 120 128 146 129 156 228 144 124 182 98
05 Info5 79.8 67.3 120 128 124 129 142 228 142 116 182 98
06 Info6 92.5 79.8 120 124 146 129 142 228 146 124 182 98
06 Info6 92.5 79.8 120 124 124 129 142 228 142 112 182 98
07 Info7 100.2 61.9 120 128 146 129 142 234 140 124 186 98
07 Info7 100.2 61.9 120 124 124 129 142 228 140 112 182 98
08 Info8 89.4 52.1 126 124 146 129 142 228 146 124 186 98
08 Info8 89.4 52.1 120 124 144 129 142 228 140 112 180 98
09 Info9 93.0 55.8 120 128 -9 129 156 234 146 116 -9 98
09 Info9 93.0 55.8 120 128 -9 129 142 228 142 112 -9 98
10 Info0 89.3 55.7 128 128 146 135 156 228 142 124 182 98
10 Info0 89.3 55.7 128 126 124 135 142 228 134 112 182 98
This example contains genotypic data from 10 diploid individuals at 10 loci and missing
data are represented by “-9”. Note that tess only makes use of the last columns in the
data. Therefore, for this example data file, the “Number of Extra Rows” should be set to
1 and the “Number of Extra Columns” should be set to 2. The line containing r1, . . . , r10
is optional. Each ri, i = 1 . . . 10 denotes the recessive allele for locus i. It must be an
integer. If no recessive allele is found at locus i, ri should be set to the missing data value
(which must be a negative integer).
tess also tolerates data stored in a format with one single row for each individual.
For diploid individuals, twice the number of loci (plus two columns) are then required for
encoding the whole data set, which should look like this (only the two first individuals
and the 3 first loci are reported):
Note that the recessive alleles (optional) are reported in the same way for the two data
formats.
6
the command-line engine internally and present the analysis results to users visually. The
GUI shell can be launched by double-click “tessgui.exe” in the tess home directory.
When using the GUI shell, users work with projects. A project is a coherent unit
which groups the input data, the algorithmic parameter settings, and the output results
altogether. To create a new project, access the menu “File⇒New Project...” or the cor-
responding button on the tool bar (see Figure 1). The GUI shell will show the “New
Project” dialog box asking the user to key in the required project information (see Figure
2).
The user should name his/her project and choose the project path and data (tip:
using the “Browse...” is a convenient way to input this information). Note that tess
requests its user to put his/her data in the same directory of the project file itself. We also
recommend users to store different projects using separate directories for clear organization
of information, although tess does not request this. The user also needs to input the
information and format of the project data. These include number of individuals, ploidy,
number of loci, missing data value, number of extra rows, and number of extra columns.
If the user is not clear of this information, he/she can always click the “View Data...”
button to check the data first (see Figure 3). From the left picture in Figure 3, we can
see there is no extra row or extra column and the first two columns store the spatial
coordinates of individuals; from the right picture in Figure 3, we know there are 792 rows,
therefore the number of individuals should be 396 and there are 7 columns, hence the
number of loci should be 5. After viewing the data, user can close the “Project Data”
window and continue to input information. Users should input information and format of
the data with care. Although the software can catch most common errors, it is impossible
to prevent users from making mistakes and wrong inputs may result in strange analysis
results. The user can also enter a geographic distance file, which should contain pairwise
7
Figure 2: Input information for a new project.
Figure 3: View data to check the number of individuals, the ploidy, the number of loci,
the missing data value, the number of extra rows, and the number of extra columns.
8
Figure 4: Input the level of spatial influence (with or without admixture), the number of
runs, the maximal number of clusters, the total number of sweeps, the burn in number of
sweeps, and the other parameters.
geographic distances between individuals. This file is optional. If the user does not enter
a geographic distance file, uniform weights will be used on the neighborhood diagram.
Otherwise, the neighborhood will incorporate weights that depend on the geographic
distances between individuals (see section Method description). When finishing inputs,
click the “OK” button to confirm the creation of the new project.
When the project is created, the GUI shell will automatically load it and show its
data to the user. The user can then start to run the project by accessing the menu
“Project⇒Run...” or the corresponding button on the tool bar. The GUI shell will show
the “New Run(s)” dialog box asking the user to key in the required run information (see
Figure 4).
9
Figure 5: Set the advanced options: change the spatial interaction parameter, the model
variance, the parameter of allele frequency model, and the geographic options.
10
Figure 6: Check the results by double-click items under the run name in the “Project”
window.
diagram and show the “Modify Neighborhood System” dialog box (see Figure 7). This
can be done before or after runs of project. Modified neighborhood system will be used
in subsequent runs. To modify the neighborhood system, first choose an individual; its
current neighbors will be displayed in the “Neighbors” window. The user can remove
neighbors for the chosen individual by select and move neighbors from the “Neighbors”
window to the “Removal List” window; he/she can also add neighbors by select and move
candidates from the “Candidates” window to the “Addition List”. After modification
for one individual, the user can click “Apply” to accept the modification and continue
to modify neighborhood for other individuals. At any time, the user can click “OK” to
accept or click “Cancel” to reject the overall modifications. Alternatively, introducing
dummy points may be a more direct way of modifying the spatial prior network of tess.
11
Figure 7: Visually modify the neighborhood system.
The “Generate Spatial Coordinates” option (Figure 8) can be used to generate ran-
dom spatial coordinates from population sample coordinates or from prescribed ranges of
coordinates. This option should be used prior to any run. Simulations from prescribed
ranges of spatial coordinates can be a convenient way to use tess when the individual
birth places are not available to the study. The user will be asked to input a 7 column
data file containing information on the number of individuals in each sample, the min
X coordinate, the max X coordinate and a SD value, the min Y coordinate, the max Y
coordinate and a SD value for each sample. The following example of a spatial coordinate
file has 2 population samples with 4 and 5 individuals. The X range in pop 1 is (7.4, 9.2)
and the Y range in pop 2 is (3.1, 6.5). The program will draw random coordinates within
the prescribed range
If Xmin = Xmax , the program will use the SD parameter to sample from a normal
distribution.
The program also asks the user to input a genotype matrix (see the Data Format
section) and then combines the simulated spatial coordinates and the multilocus genotypes
into a file with the correct tess format.
12
Figure 8: The “Generate Spatial Coordinates" option.
13
Figure 9: The “Generate Geographic Distances" option.
14
Figure 10: The resampling option.
15
6.6 What are the dummy points?
Using dummy points is another way to modify the tess network topology. Dummy
individuals should indeed represent points where no population can be sampled (i.e. major
water-masses, high mountain ranges, desert, etc). Dummy points could also be placed
between very distant samples, and may cut the links between individuals at the opposite
sides of the dummy cell. Dummy points will actually create additional white cells which
may anihilate the local correlations between individuals.
Dummy points must be placed at the end of the data set, and they must be input in
the last rows. They can be coded as standard individuals. Because their genotypic data
will be ignored, a convenient way to input dummy points is by using the missing value
symbol at all loci (eg −9). Here is an example data set with 10 diploid individuals and 2
additional dummy individuals, placed at the end of the data set. In this example, empty
cells will be located at points with spatial coordinates (84.5, 45.7) and (92.3, 52.7).
No Info X Y L1 L2 L3 L4 L5 L6 L7 L8 L9 LA
01 Info1 103.4 35.1 120 128 -9 129 156 234 148 124 182 98
01 Info1 103.4 35.1 120 128 -9 129 142 228 142 118 182 98
02 Info1 96.8 53.6 128 128 124 137 156 234 142 124 182 98
02 Info1 96.8 53.6 120 128 124 129 142 234 142 112 182 98
03 Info1 79.4 47.4 128 128 146 135 142 228 144 124 182 98
03 Info1 79.4 47.4 120 128 124 129 142 228 134 124 182 98
04 Info1 99.2 40.8 120 128 146 135 142 234 142 124 182 98
04 Info1 99.2 40.8 120 124 124 129 142 228 134 112 182 98
05 Info1 79.8 67.3 120 128 146 129 156 228 144 124 182 98
05 Info1 79.8 67.3 120 128 124 129 142 228 142 116 182 98
06 Info1 92.5 79.8 120 124 146 129 142 228 146 124 182 98
06 Info1 92.5 79.8 120 124 124 129 142 228 142 112 182 98
07 Info1 100.2 61.9 120 128 146 129 142 234 140 124 186 98
07 Info1 100.2 61.9 120 124 124 129 142 228 140 112 182 98
08 Info1 89.4 52.1 126 124 146 129 142 228 146 124 186 98
08 Info1 89.4 52.1 120 124 144 129 142 228 140 112 180 98
09 Info1 93.0 55.8 120 128 -9 129 156 234 146 116 -9 98
09 Info1 93.0 55.8 120 128 -9 129 142 228 142 112 -9 98
10 Info1 89.3 55.7 128 128 146 135 156 228 142 124 182 98
10 Info1 89.3 55.7 128 126 124 135 142 228 134 112 182 98
D1 Dummy 84.5 45.7 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
D1 Dummy 84.5 45.7 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
D2 Dummy 92.3 52.7 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
D2 Dummy 92.3 52.7 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
16
select which runs to display in the summary table. By default, all runs are displayed. The
average DIC is computed over the selected lower and upper runs. The summary table is
dynamically updated. One can still use tess while the summary window is open, and the
summary table will be automatically populated with any new run. Likewise, deleting a
run automatically removes it from the table. Figure 11 illustrates the summary window.
This window also provides a tool to easily export runs to the clumpp format. clumpp
can only average estimates of models with the same value of Kmax . A drop-down menu
allows the user to select a particular value of Kmax . A list of candidate runs is then
automatically filled with runs between the "lower" and "upper" ranges that correspond
to the selected value of Kmax . The user can then further filter the candidate runs by
choosing to keep only a percentage of them using the "Select Lowest DIC runs" drop-
down menu. For example, selecting 10% in this menu will automatically remove all runs
from the candidate list that are not within the 10% runs with the lowest DIC values.
Then, the user selects runs from the candidate list (multi-selection is enabled) and moves
them to the "Runs to Export" list by clicking on the right-arrow button. It is still possible
to remove runs from the rightmost list by selecting them and clicking of the left-arrow
button. Finally, the user chooses a file to store the exported runs in the clumpp format
(the popfile). The file will be created if it does not exist. If the "Write clumpp paramfile"
button is ticked, an additional file called "paramfile" is created in the same directory as
the selected popfile. This file contains all the parameters needed to run clumpp. The
user simply needs to copy the popfile and the paramfile to the clumpp directory and run
the clumpp software.
17
Figure 12: File Menu
tessm -FData -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
The program will start to run and report its progress and timing. Four result files will
be produced in the same directory: “DataLH.png”, “DataAP.png”, “DataHC.png”, and
“DataTR.txt”. DataLH.png shows the log-likelihood history of the run; DataAP.png
contains a bar chart that shows the estimated assignment probabilities of individuals to
different clusters; DataHC.png shows the final geographical assignment of individuals;
DataTR.txt contains the estimated assignment probabilities and allele frequencies in text
format. User can also check “DataVD.png” for the Voronoi tessellation of the data,
“DataND.png” for the generated neighborhood system based on the Voronoi tessellation,
and “DataIM.png” for initial geographical assignment of individuals. These files are for
information only. Use a picture viewer to view the “.png” files and a text editor to view
the “.txt” files, or use the GUI shell to manipulate the command-line engine (then users
can analyze their data and visualize the results without using any external utilities).
A configuration file called “DataCF” is also generated each time a run obtains a lower
DIC value (always generated for the first run). The file can help the user to continue a
run by utilizing the results generated from a previous run. To continue from the run with
the lowest DIC value, issue this command:
tessm -FData -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000 -pDataCF
18
Figure 13: Tess Command-Line Usage (without admixture)
By continuing from a previous run, a user can gradually improve the analysis result until
his/her satisfaction. This can save a lot of time when he/she is exploring a large, complex
data set.
To read data file from a sub-directory “D Dir” and put result files into a sub-directory
“R Dir”, use the following command:
tessm -FData -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000 -i"D Dir" -o"R Dir"
For file (directory) name contains spaces, enclose it into quotes, as did in the above
example. The data directory and result directory must already exist (tess won’t create
these directories).
Should a user have a large number of data files to analyze, he/she is recommended
to use the command-line engine and he/she should put the commands into a batch file
(“.bat”). In this way, the data can be automatically analyzed in an unattended manner.
The content of an example batch file can be:
tessm -FData0 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData1 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData2 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData3 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData4 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData5 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData6 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData7 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData8 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
tessm -FData9 -N500 -A2 -L10 -K8 -P0.6 -D1.0 -S12000 -B2000
Now, we shortly describe the command-line engine for the model with admixture. It
can be invoked in the following way: on the Windows platform, first launch the “Com-
mand Prompt”, then change directory to the tess home and invoke “tessmAdmCAR.exe”
19
(or ”tessmAdmBYM.exe”) followed by its options. When no options are given to tess-
mAdmCAR.exe, it will show its typical usage and exit (see Figure 14).
As an example, we want to analyze the same data set as illustrated for the program
without admixture: say that there is a data file “Data” in the same directory of tessm.exe,
which contains 500 diploid individuals genotyped at 10 loci. Assuming there are at most
8 subpopulations, we set the initial spatial interaction parameter to 0.6, the parameter
of Dirichlet allele frequency model to 1.0, and the trend degree to 2. In addition, we
want the algorithm to estimate the spatial interaction parameter. If we want to run the
MCMC algorithm (admixture model) for a total 12,000 sweeps with the first 2,000 sweeps
discarded as burn-in, we can use the following command:
tessmAdmCAR -FData -N500 -A2 -L10 -K8 -P0.6 -T2 -D1.0 -S12000 -B2000 -upy
The ”-upy” argument tells tess to update the spatial interaction parameter. The program
will produce all the result files described for the program without admixture. In addition,
it will produce a file named “DataBH.png” which shows the convergence history of the
regression coefficients. It will also create a text file named “DataBetaHat.txt” which
contains the estimated regression coefficients.
DIC = D̄ + pD ,
where D̄ denotes the posterior mean of the deviance. There are different ways to compute
pD . We follow (Spiegelhalter et al., 2002; Gelman et al., 2003) by setting
pD = D̄ − D(θ̄) ,
i.e., the difference between the posterior mean of the deviance and the deviance taken at
the posterior mean of the parameters. Roughly speaking, pD represents the gain in fit
expected when estimating the model parameters.
The DIC can be used to select the maximal number of cluster (model without admix-
ture) or the maximal number of parental populations (admixture model), Kmax , that best
suits the data, in a way similar to the ∆K criterion traditionally used for structure
(Evanno et al., 2005). Alternatively, the DIC can be used to compare different admixture
models (or no-admixture models). For example, one can use the DIC to compare models
20
Figure 14: Tess Command-Line Usage (with admixture)
21
with different trend degrees. However, we do not recommend to use the DIC to compare
between the admixture and the no-admixture models.
An important intrinsic feature of imposing spatially structured priors is the possibility
for the MCMC algorithm to eliminate a number of spurious clusters automatically. When
we input a maximum of Kmax clusters to the model, the effective number of cluster in the
data may be a smaller value, K. In this case, the DIC sometimes selects models in which
Kmax is greater than K. Section 11 illustrates the use of the DIC to perform a full tess
analysis.
NCOLS xxx
NROWS xxx
CELLSIZE xxx
NODATA_VALUE xxx
row 1
row 2
...
row n
Here is an example of ASCII-raster map (only the first line is displayed here) :
#NCOLS 188
#NROWS 132
#XLLCENTER 0
#YLLCENTER 0
#CELLSIZE 1
22
#NODATA_VALUE -99
-99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99
-99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99
-99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99
-99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99
-99 -99 -99 -99 -99 -99 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 -99 1 -99 -99
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -99 -99
-99 -99 1 -99 -99 -99 1 1 1 -99 -99 -99 -99 -99 -99 -99 -99 -99
The first six lines of the ASCII-raster file describes the map parameters. Their meanings
are listed in table 1 below.
Parameter Description Requirements
Users input the map as a run parameter in tess (see section 6.1). It is crucial that the
individual coordinates in tess input data are ordered as longitude followed by latitude
for the prediction to work properly. The program will then generate Kmax new files in
the output directory. Each file contains the predicted value on every cell of the map that
is not equal to NODATA_VALUE. Typically, NODATA_VALUE can code for ocean if
the studied species is terrestrial. The user can then use a GIS to display the posterior
predictive map. Alternatively, one can use the R software (for instance using the image()
function and replacing the NODATA_VALUEs by NA values). Figure 15 shows an ex-
ample of a posterior predictive map diplayed using the R software. It corresponds to a
simulation analysis with K = 3 clusters. The 3 clusters are displayed on the same map.
23
Figure 15: Posterior predictive map for a data set simulating a demographic expansion
in Europe with K = 3 clusters. The map was drawn with the R software.
24
10.2 Spatial display of tess outputs
In order to display tess estimations of admixture proportions spatially, we provide an R
script that interpolates expected admixture proportions on every point on a grid. The
script can be found in the file R Script/krigAdmixProportions.R within the tess di-
rectory. The interpolation technique is known as universal kriging. We do not provide
technical details on universal kriging here. The interested reader can find further details
and references in (Ripley, 1981). Figure 16 shows an example of such an interpolation.
In order to use the R script, one first needs to store the estimated admixture proportions
in a text file in the clumpp output format. There are two ways to do so. The first (and
easiest) one is to directly copy-paste the estimates from tess textual results into a new
text file. The other option is to use the "Export to CLUMPP" feature implemented in
tess (see section 6.7). Let us assume we have created a file named estimates.txt with
one of the two methods, which contains the estimated individual admixture proportions
in the clumpp format. Also, let us denote data.txt the file containing the tess data.
Furthermore, let us assume that the two files are placed in the R Script directory (which
is inside the tess directory). Here is a typical instruction sequence to run the script:
• Load the files in R and store them into internal variables by typing: data <-
read.table("data.txt");
estimates <- read.table("estimates.txt")
The krigAdmixProportions function has several options, which we detail here. The full
list of arguments with their default values is
11 Tutorial
In this section, we provide a short tess tutorial. In particular, we highlight how to choose
Kmax for two particular examples.
The first example consist of a simulated data set which has already been studied in
(Chen et al., 2007). It consists of a group of individuals structured into five subpopulations
with a pairwise FST equal to 0.04. Each subpopulation contains 100 individuals genotyped
at 10 independent microsatellite loci. The genotypes are drawn from subpopulation-
specific allele frequencies. The geographic coordinates where simulated from Gaussian
distributions, so that the five subpopulations where organized in a star shape on a ring.
The regions of the five subpopulations overlap geographically. We ran tess without
admixture for Kmax ranging from 2 to 9 for 10,000 sweeps, discarding the first 5,000. The
25
Parameter Description Requirements
data R data frame (or matrix) containing tess data. Must be correctly formated
Alternatively, can contain coordinates only.
estimates R data frame (or matrix) containing estimated Must be correctly formated
admixture proportions.
twoRows Is "data" stored in two row per individual? Boolean (default: TRUE). Set it
If false, data contains one row per individual. to FALSE for haploid individuals
extraCol Number of extra columns in the file "data" Positive integer (default: 0)
(columns before the coordinates)
onePlot If TRUE, all maps are drawn in the same Boolean (default: TRUE)
window. Otherwise, one plot per cluster
Figure 16: Spatial interpolation of admixture proportions for an example data set analyzed
with tess.
other parameters were set to their default values. For each value of Kmax ,we computed
the DIC, and we plotted its value against Kmax . The membership coefficients displayed
in figure 17 (left) provide unambiguous evidence of 5 main clusters in the data. The
regularization implied by the spatial prior is visible on figure 17(C). Indeed, no clear
additional cluster is detected for Kmax > 5. The DIC curve decreases sharply and then
exhibits a plateau at Kmax = 5 (figure 17(A)). To emphasize the connection with the ∆K
method traditionally used to select Kmax with structure, figure 17 (right) displays the
same analysis performed with structure. It also correctly concludes that Kmax = 5.
The second example consists of a real data set. It contains 76 individuals genotyped
at 822 loci, from the Arabidopsis thaliana species. Keeping the same analysis scheme as
26
A B
Tess results (Fst = .04) Structure results (Fst = .04)
−20500
43000
−21000
ln P(D)
DIC
42000
−21500
41000
−22000
2 4 6 8 10 2 4 6 8 10
Number of cluster (Kmax) Number of cluster (Kmax)
C D
2 2
3 3
4 4
5 5
Kmax
Kmax
6 6
7 7
8 8
9 9
10 10
Figure 17: Analysis of an example data set (5-islands, Fst = .04) A) DIC for 9 TESS runs
with Kmax ranging from 2 to 10. The plateau starting at Kmax = 5 is an indication that the
number of clusters is K = 5. B) Logarithm of evidence, log P(D), for 9 STRUCTURE runs
with K ranging from 2 to 10, Evanno et al criterion indicates that there are 5 clusters in the
data set. C) Posterior estimates of cluster membership for TESS. For Kmax ≥ 5, 5 main clusters
are visible. D) Posterior estimates of cluster membership for STRUCTURE.
in the first example, we varied Kmax from 2 to 9. As this is a real data set, for which
the "true structure" is not known, we ran the admixture model of tess 100 independent
times for each value of Kmax . Each run consisted of 50,000 sweeps with a burn-in period
of 30,000. For each value of Kmax , we computed the DIC. We averaged the estimated
admixture coefficients over the 10% runs with the lowest values of the DIC using the
software clumpp. Figure 18 shows that the DIC selects Kmax = 4. Remark that the
admixture estimates provide evidence for four almost identical populations for all Kmax
in 4-9. The use of clumpp is greatly facilitated by using the GUI shell of tess, which
allows to export directly runs to the clumpp format. This functionnality is described in
section 6.7.
12 FAQs
1. How long should I run tess on my data set?
Answer: The default values suggest using very short runs, but we generally rec-
27
Germany-Switzerland
Belgium-Netherlands
Czech R.-Austria
Finland-Norway
A B
Ukraine-Russia
Poland-Estonia
France-Spain
-N. Sweden
British Isles
S. Sweden
Italy-Libya
-Portugal
2
Kmax
<
6
Figure 18: Analysis of the Arabidopsis thaliana data set (76 individuals genotyped at 822 loci).
A) DIC as a function of Kmax . Each dot corresponds to the average DIC value computed over
the 10 lowest DIC values. B) Posterior estimates of individual admixture coefficients for distinct
values of Kmax .
ommend to use longer runs (eg, burnin = 10,000 sweeps and runing period =
50,000 sweeps). Preliminary (short) runs may be useful to calibrate length. The
no-admixture program usually converges fast.
Answer: For the model without admixture, start with Kmax = 2, then increase
the number of cluster until the barplot stabilizes. The DIC should also stabilize
– or slowly vary – close to the correct value. The general strategy is to gradually
increment Kmax and then plot the DIC value against Kmax . In tess, the actual
number of cluster, K, may be less than Kmax . Analyze the results on the basis of
the DIC and their barplots. Be aware that this may not be the perfect solution.
28
4. How many runs should I perform?
Answer: For moderate size data sets, we recommend running hundreds of runs, and
then interpret the results from the runs having small DICs.
Answer: Keep the 20% lowest DIC runs, and use clumpp to perform averages over
these runs. The results may differ from the “Hard Clustering” colored tessellation of
tess. These averages may generally be closer to a Bayesian estimate than estimates
from a single run. You may use distruct to display the estimated membership
coefficients. You may also use the kriging functions provided by the program R to
interpolate them spatially, and overlay geographical maps to the results (R package
’maps’ and ’spatial’ or ’fields’ recommended). R scripts are available inside the tess
directory. They may also produce nice outputs.
Answer: It may be a good idea to try including realistic features and to remove the
very long edges in the tess network. Weighting by geographic distance or using
dummy points may be efficient alternative ways to do this.
Answer: No. Removing a link between two individuals means that these individuals
are not related a priori. Genetic evidence may be strong enough so that these two
individuals may be assigned to a same cluster. If all the links were removed, the
program would behave like structure.
29
References
Chen, C., Durand, E., Forbes, F., and François, O. (2007). Bayesian Clustering Algorithms Ascertaining Spatial Population
Structure: A New Computer Program and a Comparison Study. Molecular Ecology Notes, 7, 747–756.
Durand, E., Jay, F., Gaggiotti, O. E., and François, O. (2009). Spatial inference of admixture proportions and secondary contact
zones. Molecular Biology and Evolution, 26(9), 1963–1973.
Evanno, G., Regnaut, S., and Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE:
a simulation study. Molecular Ecology, 14(8), 2611–2620.
Falush, D., Stephens, M., and Pritchard, J. K. (2007). Inference of population structure using multilocus genotype data: dominant
markers and null alleles. Molecular Ecology Notes, 7(4), 574.
François, O., Ancelet, S., and Guillot, G. (2006). Bayesian Clustering Using Hidden Markov Random Fields in Spatial Population
Genetics. Genetics, 174, 805–816.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis, Second Edition. Chapman & Hall/CRC.
Jakobsson, M. and Rosenberg, N. A. (2007). CLUMPP: a Cluster Matching and Permutation Program for Dealing with Label
Switching and Multimodality in Analysis of Population Structure. Bioinformatics, 23, 1801–1806.
Pritchard, J., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics,
155(2), 945–959.
Rosenberg, N. A. (2004). DISTRUCT: A program for the graphical display of population structure. Molecular Ecology Notes, 4,
137–138.
Spiegelhalter, S. D., Best, N. G., Carlin, B. P., and Linde, A. V. D. (2002). Bayesian measures of model complexity and fit. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583–639.
30