Bio-Algorithms and Med-Systems 2020; 16(4): 20200057
Andrzej Krajka, Ireneusz Panasiuk, Adam Misiura and Grzegorz M. Wójcik*
On the mutation model used in the fingerprinting
DNA
https://doi.org/10.1515/bams-2020-0057
Received September 29, 2020; accepted November 13, 2020;
published online December 4, 2020
Abstract
Objectives: The most common technique of determining
biological paternity or another relationship among people
are the investigations of DNA polymorphism called
Fingerprinting DNA. The key concept of these investigations is the statistical analysis, which leads to
obtain the likelihood ratio (LR), sometimes called the paternity index.
Methods: Among the different assumptions stated in
these computations is a mutation model (this model is used
for all the computations).
Results and conclusions: Although its influence on LR is
usually negligible, there are some situations (when the
mother–child mutation arises) when it is crucial.
Keywords: fingerprinting DNA; microsatellite; mutation
models; simulation; STR.
Introduction
In paper [1] it was shown that the mutation model used in a
DNA view does not satisfy the Hardy–Weinberg hypothesis
and in consequence, and it is not recommended for long
time population simulations. Furthermore, in the case of
paternity testing when the maternal mutation occurs,
different mutation models may result in essentially different
conclusions (men accepted as fathers or excluded). However, there are biological reasons suggesting that the base,
on which this model is built, is correct. Xu et al. [2]
*Corresponding author: Grzegorz M. Wójcik, Chair of
Neuroinformatics and Biomedical Engineering, Maria CurieSklodowska University, Akademicka 9, 20-033, Lublin, Poland,
E-mail:
[email protected]. https://orcid.org/0000-00024678-9874
Andrzej Krajka, Chair of Intelligent Systems, Maria Curie-Sklodowska
University, Lublin, Poland
Ireneusz Panasiuk and Adam Misiura, Chair of Neuroinformatics and
Biomedical Engineering, Maria Curie-Sklodowska University, Lublin,
Poland
established a model containing the two terms: contraction
and expansion mutations. This investigation was continued
in Refs. [3, 4] There are a many consequences of those paper
results, and there arises many mutation models, which may
be considered as a two families of models:
– In paper [5] page 783 equation (3) there was described
a wide class of mutations models containing two
parts—contraction and expansion.
– In paper [6] Sainudiin Raazesh et al. (cf. the citations
given there) joined all described mutations models
into one family (equation (3) on page 385).
In paper [2] there was a wide discussion as to when
models give the stationary allele frequencies and try to fit the
parameters of mutations models family to the case of
chimpanzees and humans evolution (by comparing the
appropriate allele frequencies). The results were depressing,
suggesting that the mutation models are changing in time;
however authors emphasize the too small number of observations to build allele frequencies. The experiment determining which parameters of family of mutation models are
for the present human population (or some subpopulation)
true is extremely expensive – the mutations occurs rarely.
In this paper, using the data from the large Lublin
Medical University database:
– We check the assumptions of model [7].
– We compute the parameters suggested by us in an easy
mutation model. More precisely we mixed DNA View
mutation model with that obtained in Ref. [7]. This
model is very simple and is not included in models
described in Ref. [6]. Although in Ref. [6] the choice of
parameters of best fitted model are validated according to the statistical criterium (likelihood ratio, AIC),
we evaluate the best parameters with the genetic algorithm (as we mentioned early experimental observation is too expensive).
Materials and methods
All the computations are done on the large database of allelic DNA
frequencies obtained from the Medical University of Lublin. This database contains the DNA results of 7,076 individuals including personal
information such as name, date, and place of birth etc. From this
database we confine our consideration to unrelated persons. Thus we
2
Krajka et al.: On the mutation model
obtain 3,973 individuals who were typed using the classical electrophoresis method and 437 ones by the NGS (next generation sequencing)
technology. All individuals were genotyped within 15 STR (short
tandem repeats) markers located on the autosomal chromosomes.
The 3,973 persons loci were amplified using the commercial kits:
PowerPlex 16Hs System and PowerPlex 17ESX System (Promega
Corporation). PCR products were separated and detected on an ABI
3130 Genetic Analyzer (APplied Biosystem). The data were analyzed
with the Gene Mapper ID v3.2 software. Laser-induced fluorescence
has been used in the CE systems with the detection (by sensors) as low
as 10−18–10−21 mol. The sensitivity of techniques is attributed to the
high intensity of the incident light and the ability to focus the light
precisely on the capillary. The multicolor fluorescence detection can
be achieved by including multiple dichroic mirrors and bandpass
filters to separate the fluorescence emission amongst multiple
detectors (e.g. photomultiplier tubes), or by using a prism or grating to
project spectrally resolved fluorescence emission onto a positionsensitive detector (sensor) such as the CCD array. The CE systems with
4– and 5– colour LIF detection systems are used routinely for capillary
DNA sequencing and genotyping (DNA fingerprint) applications.
Four hundred and thirty seven individuals were profiled by the next
generation sequences (NGS) technology. Libraries were prepared using
ForenSeq DNA Signature Prep. Kit (Illumina) according to the manufacturer’s protocol. Sequencing was performed on the MiSeq FGx using the
Miseq FGx Reagent Kit v.3 (600 cycles) and ForenSeq Universal Analysis
Software package. Following the cluster generation, clusters are imaged
using LED and filter combinations specific to each of the four fluorescently labelled dideoxynucleotides. When imaging of a tile is complete, the flow cell is moved into the place to expose the nest tile. The
process is repeated for each cycle of sequencing. Following image
analysis, the software performs base calling, filtering and quality scoring.
All amplification results were stored in the MySQL database. All
computations were done in the R-3.6.0 language on the platform
RSTUDIO 1.2.1335.
Theoretical background (mutation
model in the DNA VIEW)
In a very popular application, called DNA VIEW, there is
an implemented algorithm of mutations, based on the
assumption:
Every microsatellite can decrease or increase a number
of repetitions by Δn with a probability.
P Δn
5
.
10Δn
(1)
Thus P |i−j| is the conditional probability that parent’s
gene with i repetitions arises as a gene with j repetitions (we
will write shortly i → j), provided mutation arises.
The important note is that usually:
∑ p|i−j| ≠ 1,
(2)
i, j∈N
So in implementation these results should be normalized.
We remark that some repetitions does not exist in some
populations and that the number of repetitions may not be
sometimes an integer value because the last repetition may
not be complete (for example 9.3 in TH01 or 30.2 in D21S11
loci). Let Λ denote the set of all possible repetitions in the
south-eastern Polish population in the considered loci.
Therefore the probability of mutation of allele with i repetitions into the allele with j repetitions (i → j) according to
the model assumed in the DNA VIEW is equal to:
⎪ 10⌊−|i−j|⌋
⎧
P |i−j| ⎪
⎨
, i ≠ j,
ωi, j
, i, j ∈ Λ,
(3)
⎪
αi
⎪
αi
⎩
0,
i j,
where
αi ∑ 10⌊−|i−j|⌋ , i ∈ Λ .
j∈Λ
It was shown in our previous studies [1] that this model
does not satisfy the Hardy–Weinberg equilibrium, and in a
period of time it leads to uniformly distributed frequencies;
however, there are biological reasons indicating that the
base of this model is correct.
Long studies have shown that the direction of the mutation is not random. The mutation model should take into account the influence of the length of the allele on the direction
of a mutation [2, 8–10], but the above mentioned DNA VIEW
mutation model does not include this factor.
In [1] it was shown that shorter microsattellites often
mutate into the longer ones—it was called the expansion
mutation, and longer STRs often mutate into shorter microsatellite—this was named the contraction mutation. This
approach was carried on in Ref. [2]. The authors consider two
types of mutations: expansion mutation (ωe, i, j, i < j)
and contraction mutation (ωc, i, j, i > j). They assumed that the
frequency of expansion mutation is uniformly distributed with
a respect to a repetition length whereas the frequency of
expansion mutation has the logistic distribution. The assumptions of Xu et al. [2] model can be summarized as follows:
(1) The overall rates of expansion and contraction mutations are equal
(2) The rate of expansion mutations is constant for all
alleles (in Ref. [2] this value was equal to 0.001)
(3) The rate of contraction mutations increases exponentially with a repetition length as given below:
ωc, i, j
aebi
, i > j; i, j ∈ Λ .
1 + aebi
(4)
(1) The alleles distribution is approximated by a logistic
distribution with the following probability density
function:
(x−Θ)
1
e− λ
P[X x]
2 ,x ∈ R.
λ 1 + e− (x−Θ)
λ
(5)
where λ and Θ are the parameters investigated by
observations.
3
Krajka et al.: On the mutation model
Although this model allows mutations which involve
genes that differ from themselves with more than one
repetition, which is consistent with other observations
[4, 11], in paper [2] only one step mutation was considered.
As in the former models [7, 12, 13] the probability of an
expansion mutation is independent of repetition length.
The general aim of our paper is:
– Using the large database of south-eastern Polish
population’s allele frequencies, we verify Assumption
4 of the Xu’s model. We compute the optimal parameters Θ and λ and verify the logistic law with χ 2 test.
– Joining the generalized, on the not necessarily, one
step Xu’s mutation model with the DNA View mutation
model rules, we find the numerically optimal parameters a and b from Assumption 3. The crucial for these
computations is the behavior of a population applied
to this mutation model in a long period of time. The
values a and b are optimal based on the assumptions
of the Hardy–Weinberg hypothesis.
i > j,
i j , , i, j ∈ Λ .
i < j,
ωc, i ∑ ωc, i, j
j:i>j
ωe, i ∑ ωe, i, j
j:i<j
(1 − m)aebi
,
1 + aebi
(6)
m
(1 − m)aebi
,
, 1−m ∑
1 + aebi
i∈Λ l
i∈Λ
(8)
(9)
which implies
ebi
1.
bi
i∈Λ 1 + ae
a∑
(12)
j∈Λ:i>j
j∈Λ:i<j
Let ω denote the overall mutation rate. We put (i → i) as
a probability that no mutation takes place. Then we have:
(i → j)
ωi, j ω,
1 − ∑k∈Λ, k≠i ωi, k ω,
i ≠ j,
i, j ∈ Λ ,
i j,
(13)
because for every i ∈ Λ, ∑j∈Λ (i → j) 1 , should hold.
The Hardy–Weinberg hypothesis can be rewritten as
Λ
j∈Λ
∑ νi (i → j) νj ,
(14)
i∈Λ
2
2
+ ∑ ∑ νi (i → j) − νj
j∈Λ
.
(15)
i∈Λ
We dropped the assumption (9) slightly allowing m to
be in the (0.3, 0.7) interval.
Simulation
(7)
where m is the fraction of contraction mutations among
all the mutations. Obviously Assumption 1 implies that
m=0.5 , but for arbitrary m we have:
m∑
αc, i ∑ 10⌊−|i−j|⌋ , αe, i ∑ 10⌊−|i−j|⌋ , i ∈ Λ .
with
ebi
−1
bi
i∈Λ 1 + ae
Let l denote the number of possible allele lengths (the
cardinality of Λ), then
m
,
l
(11)
f a∑
Thus our equation (3) should have the form as follows:
f or
f or
f or
i,j ∈ Λ,
where νi is a frequency of i-th allele. Using the genetic
algorithm we found the optimal a, b and m minimizing
f and such that (10) and (14) hold:
Suggested model of mutations
⎪
⎧
⎨ ωc, i, j ,
ωi, j ⎪ 0 ,
⎩
ωe, i, j ,
10⌊−|i−j|⌋
⎪
⎧
⎪
m
,
f or i>j,
⎪
⎪
⎪
lαc,i
⎪
⎪
⎨
ωi,j ⎪ 0,
f or ij,
⎪
⎪
⎪
⎪
⎪
aebi 10⌊−|i−j|⌋
⎪
⎩ (1−m)
, f or i<j,
1+aebi αe,i
(10)
In the area of contraction mutation as well as in the
area of expansion mutation we use the DNA View model
(3). Thus we have
Based on the large database of alleles frequencies, obtained from the Medical University of Lublin (cf. Section
Materials and Methods), we performed:
(i) Verification of assumptions of the Xu’s mutation
model,
(ii) Investigations of the best parameters for the population of south-eastern Poland.
At first, for aim (i), using 4,410 samples of 15 genes and
R-libraries fitdistr and MASS we found the optimal parameters Θ and λ of the law (5). The results are shown in
Table 1 (in the center). Then, using the χ2 test we compared
the distribution of alleles frequencies in a particular loci
with the theoretical law given by (5). There were some
problems due to the situation that some alleles in some loci
are not observed or rarely observed (fewer than five times).
In this situation, we bound this allele with the smaller
4
Krajka et al.: On the mutation model
Table : The best Θ and λ values and the result of genetic algorithm (a, b, m).
Allele
CSFPO
FGA
TPOX
DS
VWA
PENTA E
DS
DS
TH
DS
DS
DS
DS
DS
PENTA D
l
Θ
λ
a
b
m
f
l
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−.
−.
−E−
−.
−.
−.
−.
−.
−.
−.
−.
−.
−.
−.
−.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.E−
.E−
.E−
.E−
.E−
.E−
.
.E−
.
.
.
.E−
.E−
.E−
.E−
neighbor. The Kolmogorov–Smirnov’s test cannot be used
because the distribution of alleles is not continuous.
For the aim (ii) we implemented the genetic algorithm
which found the optimal values a, b and m satisfying (10)
and (14). This algorithm is presented here
# Input parameters
T <- 5000 # Simulation time
Nind <- 20 # A number of individuals
mut<-c(0.0016, 0.0028, 0.0001, 0.0014, 0.0017,
0.0016, 0.0022, 0.0019, 0.0001, 0.0012, 0.0011,
0.001, 0.0014, 0.0011, 0.0014) # Mutations rate for
loci (cf. [4])
inputData<-read.csv2("Freq.csv") #
Table
of
frequencies
# inputData is the frequencies data frame having
the following structure:
#
- in the first column there is a name of allele
#
- in the i-th column there is (i-1)-th loci
# ####### cf. equation (6)
prepare <- function(freq, m, a, b, l){
wynx<-matrix(c(0),l,l)
for (i in 2:l){
alpha<-sum(10^floor(-abs(freq[c(1:(i-1)),1]-freq
[i,1])))
for (j in 1:(i-1)) {
wynx[i,j]<-(1-m)*a*exp(b*freq[i,1])*10^floor(abs(freq[i,1]-freq[j,1]))/
(alpha*(1+a*exp(b*freq[i,1])))
}}
for (i in 1:(l-1)){
alpha<-sum(10^floor(-abs(freq[c((i+1):l),1]-freq
[i,1])))
for (j in (i+1):l) {
wynx[i,j]<-m*10^floor(-abs(freq[i,1]-freq[j,1]))/
(alpha*l)
}}
wynx
}
# ######## cf. equations (7) and (9)
kar <-function(a,b,freq,w,wmut,l){
f<-(a*sum(exp(b*freq[,1])/(1+a*exp(b*freq
[,1])))-1)^2
for (j in 1:l) {
x=y=0.0;
for (i in 1:l) {
if (i!=j) {y<-y+freq[i,2]*wmut[i,j]*w; x<x+wmut[i,j]*w}
}
y<-y+(1-x)*freq[j,2]
f<-f+(y-freq[j,2])^2
}
f
}
# gene-number of loci, a,b,m - investigated parameters, f-error
result <- data.frame(gene=rep(0.0,15), a=c(0.0),
b=c(0.0),m=c(0.0),f=c(0.0))
# Restricted area
mx<-c(0.3,0.7); ax<-c(1.0e-8,0.8);
1.0e-8)
bx<-c(-30,-
Krajka et al.: On the mutation model
5
ind<-matrix(0,Nind,4) # Matrix of individuals
Nind x 4
for (gene in c(2:16)) { # Loop on loci
# initialisation of individuals: (m,a,b,f)
ind[, 1] <- runif(20, mx[[1]], mx[[2]])
ind[, 2] <- runif(20, ax[[1]], ax[[2]])
ind[, 3] <- runif(20, bx[[1]], bx[[2]])
# Set allele's names and frequencies
freq <- inputData[inputData[,gene]!=0,
gene)]
for
(i
in
1:2)
freq[,
i]
as.numeric(as.character(freq[, i]))
freq <- freq[order(freq[,1]), ]
freq[,2] <- freq[,2]/sum(freq[,2])
l <- nrow(freq)
c(1,
<-
for (iter in 1:T) { # Loop on generations
for (j in 1:Nind) {
# Individuals - matrix of
mut. allele i in j cf. (6)
wmut <- matrix(rep(0,l*l), l, l)
wmut <- prepare(freq, ind[j, 1], ind[j, 2], ind
[j, 3], l) # Generating mutation's
array
ind[j,4]<-kar(ind[j,2],ind[j,3],freq,mut[[gene1]],wmut,l) # Function f
}
ind <- ind[order(ind[,4]),] # Ordering
# Only the best 2 individuals remain in the next
iteration
for (j in 3:Nind) {
ind[j, 1] <- min(max(ind[j, 1] + runif(1, -0.1,
0.1), mx[[1]]), mx[[2]])
ind[j, 2] <- min(max(ind[j, 2] + runif(1, -0.1,
0.1), ax[[1]]), ax[[2]])
ind[j, 3] <- min(max(ind[j, 3] + runif(1, -0.2,
0.2), bx[[1]]), bx[[2]])
}
}
# End of loop on generations
result[gene-1, ] <- c(gene-1, ind[1, 2], ind[1,
3], ind[1, 1], ind[1, 4])
}
# End loop on loci
write.csv2(result, file = "results1.csv") #
Storing results
Figure 1: Fitness of the logistic, normal and Weibull laws to the
D8S1179 and VWA frequencies.
and 2.53 · 10−9 for different loci. Two examples of the fitness
of a law by the χ 2 test, performed to these loci are presented
in Figure 1 (the values in the legend are the p-value of
χ 2 test).
Thus the main assumptions of Xu and others [2] fail in
the case of for our database of frequencies—the allele distributions are not logistic. However, as we assumed that for
the biological reasons this model is true, we can find parameter’s values that fit the best. They are presented in
Table 1.
From the computations described in the Listing 1 algorithm, we obtained the best values for the constructed
mutation models (13) and (14). These values of parameters
a, b, m for different loci are summarized in Table 1. These
parameters can be used in practical paternity for another
consanguinity computing purpose.
Conclusions
Results
For all alleles, the chi-square test excludes the assumption
of the logistic distribution with p value between 4.02 · 10−318
–
The Lublin Medical University allele frequencies do
not satisfy the assumptions of Xu et. al [2] mutation
model. The allele frequencies have the gamma rather
than the logistic distribution.
6
–
Krajka et al.: On the mutation model
The best parameters of our mutation model (cf. (11)) are
summarized in Table 1. It has the important practical
consequence—these values should be used in paternity testing with our model.
4.
5.
Research funding: None declared.
Author contributions: All the authors have accepted
responsibility for the entire content of this submitted
manuscript and approved submission.
Competing interests: The funding organization(s) played
no role in the study design; in the collection, analysis, and
interpretation of data; in the writing of the report; or in the
decision to submit the report for publication. Authors state
no conflict of interest.
Ethical Approval: The conducted research is not related to
either human or animal use at Maria Curie-Sklodowska
University in Lublin, Poland, even though it concerns DNA
data analysis.
Employment or leadership: None declared.
Honorarium: None declared.
6.
7.
8.
9.
10.
11.
References
1. Krajka T. Mutation rate model used in the dna view program. Appl
Sci 2020;10:3585.
2. Xu X, Peng M, Fang Z, Xu X. The direction of microsatellite mutations is
dependent upon allele length. Nature genetics 2000;24:396–9.
3. Harr B, Schlötterer C. Long microsatellite alleles in drosophila
melanogaster have a downward mutation bias and short
12.
13.
persistence times, which cause their genome-wide
underrepresentation. Genetics 2000;155:1213–20.
Huang Q-Y, Xu F-H, Shen H, Deng H-Y, Liu Y-J, Liu Y-Z, et al.
Mutation patterns at dinucleotide microsatellite loci in humans.
Am J Hum Genet 2002;70:625–34.
Whittaker JC. Likelihood-based estimation of microsatellite
mutation rates. Genetics 2003;2:781–7.
Sainudiin R. Microsatellite mutation models: insights from a
comparison of humans and chimpanzees. Genetics 2004;1:
383–95.
FuY-X, ChakrabortyR. Simultaneousestimationof allthe parameters
of a stepwise mutation model. Genetics 1998;150:487–97.
Kelkar YD, Tyekucheva S, Chiaromonte F, Makova KD. The
genome-wide determinants of human and chimpanzee
microsatellite evolution. Genome Res 2008;18:30–8.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,
et al. Erratum: initial sequencing and analysis of the
human genome: international human genome sequencing
consortium (nature (2001) 409 (860-921)). Nature 2001;412:
565–6.
Shao C, Lin M, Zhou Z, Zhou Y, Shen Y, Xue A, et al. Mutation
analysis of 19 autosomal short tandem repeats in Chinese
han population from Shanghai. Int J Leg Med 2016;130:
1439–44.
Harr B, Todorova J, Schlötterer C. Mismatch repair-driven
mutational bias in d. melanogaster. Mol Cell 2002;10:199–205.
Tachida H, Iizuka M. Persistence of repeated sequences that
evolve by replication slippage. Genetics 1992;131:471–8.
Walsh JB. Persistence of tandem arrays: implications for satellite
and simple-sequence dnas. Genetics 1987;115:553–67.
Supplementary Material: The online version of this article offers
supplementary material (https://doi.org/10.1515/bams-2020-0057).