Academia.eduAcademia.edu

On the mutation model used in the fingerprinting DNA

Abstract

Objectives: The most common technique of determining biological paternity or another relationship among people are the investigations of DNA polymorphism called Fingerprinting DNA. The key concept of these investigations is the statistical analysis, which leads to obtain the likelihood ratio (LR), sometimes called the paternity index. Methods: Among the different assumptions stated in these computations is a mutation model (this model is used for all the computations). Results and conclusions: Although its influence on LR is usually negligible, there are some situations (when the mother-child mutation arises) when it is crucial.

Bio-Algorithms and Med-Systems 2020; 16(4): 20200057 Andrzej Krajka, Ireneusz Panasiuk, Adam Misiura and Grzegorz M. Wójcik* On the mutation model used in the fingerprinting DNA https://doi.org/10.1515/bams-2020-0057 Received September 29, 2020; accepted November 13, 2020; published online December 4, 2020 Abstract Objectives: The most common technique of determining biological paternity or another relationship among people are the investigations of DNA polymorphism called Fingerprinting DNA. The key concept of these investigations is the statistical analysis, which leads to obtain the likelihood ratio (LR), sometimes called the paternity index. Methods: Among the different assumptions stated in these computations is a mutation model (this model is used for all the computations). Results and conclusions: Although its influence on LR is usually negligible, there are some situations (when the mother–child mutation arises) when it is crucial. Keywords: fingerprinting DNA; microsatellite; mutation models; simulation; STR. Introduction In paper [1] it was shown that the mutation model used in a DNA view does not satisfy the Hardy–Weinberg hypothesis and in consequence, and it is not recommended for long time population simulations. Furthermore, in the case of paternity testing when the maternal mutation occurs, different mutation models may result in essentially different conclusions (men accepted as fathers or excluded). However, there are biological reasons suggesting that the base, on which this model is built, is correct. Xu et al. [2] *Corresponding author: Grzegorz M. Wójcik, Chair of Neuroinformatics and Biomedical Engineering, Maria CurieSklodowska University, Akademicka 9, 20-033, Lublin, Poland, E-mail: [email protected]. https://orcid.org/0000-00024678-9874 Andrzej Krajka, Chair of Intelligent Systems, Maria Curie-Sklodowska University, Lublin, Poland Ireneusz Panasiuk and Adam Misiura, Chair of Neuroinformatics and Biomedical Engineering, Maria Curie-Sklodowska University, Lublin, Poland established a model containing the two terms: contraction and expansion mutations. This investigation was continued in Refs. [3, 4] There are a many consequences of those paper results, and there arises many mutation models, which may be considered as a two families of models: – In paper [5] page 783 equation (3) there was described a wide class of mutations models containing two parts—contraction and expansion. – In paper [6] Sainudiin Raazesh et al. (cf. the citations given there) joined all described mutations models into one family (equation (3) on page 385). In paper [2] there was a wide discussion as to when models give the stationary allele frequencies and try to fit the parameters of mutations models family to the case of chimpanzees and humans evolution (by comparing the appropriate allele frequencies). The results were depressing, suggesting that the mutation models are changing in time; however authors emphasize the too small number of observations to build allele frequencies. The experiment determining which parameters of family of mutation models are for the present human population (or some subpopulation) true is extremely expensive – the mutations occurs rarely. In this paper, using the data from the large Lublin Medical University database: – We check the assumptions of model [7]. – We compute the parameters suggested by us in an easy mutation model. More precisely we mixed DNA View mutation model with that obtained in Ref. [7]. This model is very simple and is not included in models described in Ref. [6]. Although in Ref. [6] the choice of parameters of best fitted model are validated according to the statistical criterium (likelihood ratio, AIC), we evaluate the best parameters with the genetic algorithm (as we mentioned early experimental observation is too expensive). Materials and methods All the computations are done on the large database of allelic DNA frequencies obtained from the Medical University of Lublin. This database contains the DNA results of 7,076 individuals including personal information such as name, date, and place of birth etc. From this database we confine our consideration to unrelated persons. Thus we 2 Krajka et al.: On the mutation model obtain 3,973 individuals who were typed using the classical electrophoresis method and 437 ones by the NGS (next generation sequencing) technology. All individuals were genotyped within 15 STR (short tandem repeats) markers located on the autosomal chromosomes. The 3,973 persons loci were amplified using the commercial kits: PowerPlex 16Hs System and PowerPlex 17ESX System (Promega Corporation). PCR products were separated and detected on an ABI 3130 Genetic Analyzer (APplied Biosystem). The data were analyzed with the Gene Mapper ID v3.2 software. Laser-induced fluorescence has been used in the CE systems with the detection (by sensors) as low as 10−18–10−21 mol. The sensitivity of techniques is attributed to the high intensity of the incident light and the ability to focus the light precisely on the capillary. The multicolor fluorescence detection can be achieved by including multiple dichroic mirrors and bandpass filters to separate the fluorescence emission amongst multiple detectors (e.g. photomultiplier tubes), or by using a prism or grating to project spectrally resolved fluorescence emission onto a positionsensitive detector (sensor) such as the CCD array. The CE systems with 4– and 5– colour LIF detection systems are used routinely for capillary DNA sequencing and genotyping (DNA fingerprint) applications. Four hundred and thirty seven individuals were profiled by the next generation sequences (NGS) technology. Libraries were prepared using ForenSeq DNA Signature Prep. Kit (Illumina) according to the manufacturer’s protocol. Sequencing was performed on the MiSeq FGx using the Miseq FGx Reagent Kit v.3 (600 cycles) and ForenSeq Universal Analysis Software package. Following the cluster generation, clusters are imaged using LED and filter combinations specific to each of the four fluorescently labelled dideoxynucleotides. When imaging of a tile is complete, the flow cell is moved into the place to expose the nest tile. The process is repeated for each cycle of sequencing. Following image analysis, the software performs base calling, filtering and quality scoring. All amplification results were stored in the MySQL database. All computations were done in the R-3.6.0 language on the platform RSTUDIO 1.2.1335. Theoretical background (mutation model in the DNA VIEW) In a very popular application, called DNA VIEW, there is an implemented algorithm of mutations, based on the assumption: Every microsatellite can decrease or increase a number of repetitions by Δn with a probability. P Δn  5 . 10Δn (1) Thus P |i−j| is the conditional probability that parent’s gene with i repetitions arises as a gene with j repetitions (we will write shortly i → j), provided mutation arises. The important note is that usually: ∑ p|i−j| ≠ 1, (2) i, j∈N So in implementation these results should be normalized. We remark that some repetitions does not exist in some populations and that the number of repetitions may not be sometimes an integer value because the last repetition may not be complete (for example 9.3 in TH01 or 30.2 in D21S11 loci). Let Λ denote the set of all possible repetitions in the south-eastern Polish population in the considered loci. Therefore the probability of mutation of allele with i repetitions into the allele with j repetitions (i → j) according to the model assumed in the DNA VIEW is equal to: ⎪ 10⌊−|i−j|⌋ ⎧ P |i−j| ⎪ ⎨ , i ≠ j, ωi, j  , i, j ∈ Λ, (3) ⎪ αi ⎪ αi ⎩ 0, i  j, where αi  ∑ 10⌊−|i−j|⌋ , i ∈ Λ . j∈Λ It was shown in our previous studies [1] that this model does not satisfy the Hardy–Weinberg equilibrium, and in a period of time it leads to uniformly distributed frequencies; however, there are biological reasons indicating that the base of this model is correct. Long studies have shown that the direction of the mutation is not random. The mutation model should take into account the influence of the length of the allele on the direction of a mutation [2, 8–10], but the above mentioned DNA VIEW mutation model does not include this factor. In [1] it was shown that shorter microsattellites often mutate into the longer ones—it was called the expansion mutation, and longer STRs often mutate into shorter microsatellite—this was named the contraction mutation. This approach was carried on in Ref. [2]. The authors consider two types of mutations: expansion mutation (ωe, i, j, i < j) and contraction mutation (ωc, i, j, i > j). They assumed that the frequency of expansion mutation is uniformly distributed with a respect to a repetition length whereas the frequency of expansion mutation has the logistic distribution. The assumptions of Xu et al. [2] model can be summarized as follows: (1) The overall rates of expansion and contraction mutations are equal (2) The rate of expansion mutations is constant for all alleles (in Ref. [2] this value was equal to 0.001) (3) The rate of contraction mutations increases exponentially with a repetition length as given below: ωc, i, j  aebi , i > j; i, j ∈ Λ . 1 + aebi (4) (1) The alleles distribution is approximated by a logistic distribution with the following probability density function: (x−Θ) 1 e− λ P[X  x]  2 ,x ∈ R. λ 1 + e− (x−Θ) λ  (5) where λ and Θ are the parameters investigated by observations. 3 Krajka et al.: On the mutation model Although this model allows mutations which involve genes that differ from themselves with more than one repetition, which is consistent with other observations [4, 11], in paper [2] only one step mutation was considered. As in the former models [7, 12, 13] the probability of an expansion mutation is independent of repetition length. The general aim of our paper is: – Using the large database of south-eastern Polish population’s allele frequencies, we verify Assumption 4 of the Xu’s model. We compute the optimal parameters Θ and λ and verify the logistic law with χ 2 test. – Joining the generalized, on the not necessarily, one step Xu’s mutation model with the DNA View mutation model rules, we find the numerically optimal parameters a and b from Assumption 3. The crucial for these computations is the behavior of a population applied to this mutation model in a long period of time. The values a and b are optimal based on the assumptions of the Hardy–Weinberg hypothesis. i > j, i  j , , i, j ∈ Λ . i < j, ωc, i  ∑ ωc, i, j j:i>j ωe, i  ∑ ωe, i, j  j:i<j (1 − m)aebi , 1 + aebi (6) m (1 − m)aebi , , 1−m ∑ 1 + aebi i∈Λ l i∈Λ (8) (9) which implies ebi  1. bi i∈Λ 1 + ae a∑ (12) j∈Λ:i>j j∈Λ:i<j Let ω denote the overall mutation rate. We put (i → i) as a probability that no mutation takes place. Then we have: (i → j)   ωi, j ω, 1 − ∑k∈Λ, k≠i ωi, k ω, i ≠ j, i, j ∈ Λ , i  j, (13) because for every i ∈ Λ, ∑j∈Λ (i → j)  1 , should hold. The Hardy–Weinberg hypothesis can be rewritten as Λ j∈Λ ∑ νi (i → j)  νj , (14) i∈Λ 2 2 + ∑ ∑ νi (i → j) − νj j∈Λ . (15) i∈Λ We dropped the assumption (9) slightly allowing m to be in the (0.3, 0.7) interval. Simulation (7) where m is the fraction of contraction mutations among all the mutations. Obviously Assumption 1 implies that m=0.5 , but for arbitrary m we have: m∑ αc, i  ∑ 10⌊−|i−j|⌋ , αe, i  ∑ 10⌊−|i−j|⌋ , i ∈ Λ . with ebi −1 bi i∈Λ 1 + ae Let l denote the number of possible allele lengths (the cardinality of Λ), then m  , l (11) f  a∑ Thus our equation (3) should have the form as follows: f or f or f or i,j ∈ Λ, where νi is a frequency of i-th allele. Using the genetic algorithm we found the optimal a, b and m minimizing f and such that (10) and (14) hold: Suggested model of mutations ⎪ ⎧ ⎨ ωc, i, j , ωi, j  ⎪ 0 , ⎩ ωe, i, j , 10⌊−|i−j|⌋ ⎪ ⎧ ⎪ m , f or i>j, ⎪ ⎪ ⎪ lαc,i ⎪ ⎪ ⎨ ωi,j  ⎪ 0, f or ij, ⎪ ⎪ ⎪ ⎪ ⎪ aebi 10⌊−|i−j|⌋ ⎪ ⎩ (1−m) , f or i<j, 1+aebi αe,i (10) In the area of contraction mutation as well as in the area of expansion mutation we use the DNA View model (3). Thus we have Based on the large database of alleles frequencies, obtained from the Medical University of Lublin (cf. Section Materials and Methods), we performed: (i) Verification of assumptions of the Xu’s mutation model, (ii) Investigations of the best parameters for the population of south-eastern Poland. At first, for aim (i), using 4,410 samples of 15 genes and R-libraries fitdistr and MASS we found the optimal parameters Θ and λ of the law (5). The results are shown in Table 1 (in the center). Then, using the χ2 test we compared the distribution of alleles frequencies in a particular loci with the theoretical law given by (5). There were some problems due to the situation that some alleles in some loci are not observed or rarely observed (fewer than five times). In this situation, we bound this allele with the smaller 4 Krajka et al.: On the mutation model Table : The best Θ and λ values and the result of genetic algorithm (a, b, m). Allele CSFPO FGA TPOX DS VWA PENTA E DS DS TH DS DS DS DS DS PENTA D l Θ λ a b m f l                . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −. −. −E− −. −. −. −. −. −. −. −. −. −. −. −. . . . . . . . . . . . . . . . .E− .E− .E− .E− .E− .E− . .E− . . . .E− .E− .E− .E− neighbor. The Kolmogorov–Smirnov’s test cannot be used because the distribution of alleles is not continuous. For the aim (ii) we implemented the genetic algorithm which found the optimal values a, b and m satisfying (10) and (14). This algorithm is presented here # Input parameters T <- 5000 # Simulation time Nind <- 20 # A number of individuals mut<-c(0.0016, 0.0028, 0.0001, 0.0014, 0.0017, 0.0016, 0.0022, 0.0019, 0.0001, 0.0012, 0.0011, 0.001, 0.0014, 0.0011, 0.0014) # Mutations rate for loci (cf. [4]) inputData<-read.csv2("Freq.csv") # Table of frequencies # inputData is the frequencies data frame having the following structure: # - in the first column there is a name of allele # - in the i-th column there is (i-1)-th loci # ####### cf. equation (6) prepare <- function(freq, m, a, b, l){ wynx<-matrix(c(0),l,l) for (i in 2:l){ alpha<-sum(10^floor(-abs(freq[c(1:(i-1)),1]-freq [i,1]))) for (j in 1:(i-1)) { wynx[i,j]<-(1-m)*a*exp(b*freq[i,1])*10^floor(abs(freq[i,1]-freq[j,1]))/ (alpha*(1+a*exp(b*freq[i,1]))) }} for (i in 1:(l-1)){ alpha<-sum(10^floor(-abs(freq[c((i+1):l),1]-freq [i,1]))) for (j in (i+1):l) { wynx[i,j]<-m*10^floor(-abs(freq[i,1]-freq[j,1]))/ (alpha*l) }} wynx } # ######## cf. equations (7) and (9) kar <-function(a,b,freq,w,wmut,l){ f<-(a*sum(exp(b*freq[,1])/(1+a*exp(b*freq [,1])))-1)^2 for (j in 1:l) { x=y=0.0; for (i in 1:l) { if (i!=j) {y<-y+freq[i,2]*wmut[i,j]*w; x<x+wmut[i,j]*w} } y<-y+(1-x)*freq[j,2] f<-f+(y-freq[j,2])^2 } f } # gene-number of loci, a,b,m - investigated parameters, f-error result <- data.frame(gene=rep(0.0,15), a=c(0.0), b=c(0.0),m=c(0.0),f=c(0.0)) # Restricted area mx<-c(0.3,0.7); ax<-c(1.0e-8,0.8); 1.0e-8) bx<-c(-30,- Krajka et al.: On the mutation model 5 ind<-matrix(0,Nind,4) # Matrix of individuals Nind x 4 for (gene in c(2:16)) { # Loop on loci # initialisation of individuals: (m,a,b,f) ind[, 1] <- runif(20, mx[[1]], mx[[2]]) ind[, 2] <- runif(20, ax[[1]], ax[[2]]) ind[, 3] <- runif(20, bx[[1]], bx[[2]]) # Set allele's names and frequencies freq <- inputData[inputData[,gene]!=0, gene)] for (i in 1:2) freq[, i] as.numeric(as.character(freq[, i])) freq <- freq[order(freq[,1]), ] freq[,2] <- freq[,2]/sum(freq[,2]) l <- nrow(freq) c(1, <- for (iter in 1:T) { # Loop on generations for (j in 1:Nind) { # Individuals - matrix of mut. allele i in j cf. (6) wmut <- matrix(rep(0,l*l), l, l) wmut <- prepare(freq, ind[j, 1], ind[j, 2], ind [j, 3], l) # Generating mutation's array ind[j,4]<-kar(ind[j,2],ind[j,3],freq,mut[[gene1]],wmut,l) # Function f } ind <- ind[order(ind[,4]),] # Ordering # Only the best 2 individuals remain in the next iteration for (j in 3:Nind) { ind[j, 1] <- min(max(ind[j, 1] + runif(1, -0.1, 0.1), mx[[1]]), mx[[2]]) ind[j, 2] <- min(max(ind[j, 2] + runif(1, -0.1, 0.1), ax[[1]]), ax[[2]]) ind[j, 3] <- min(max(ind[j, 3] + runif(1, -0.2, 0.2), bx[[1]]), bx[[2]]) } } # End of loop on generations result[gene-1, ] <- c(gene-1, ind[1, 2], ind[1, 3], ind[1, 1], ind[1, 4]) } # End loop on loci write.csv2(result, file = "results1.csv") # Storing results Figure 1: Fitness of the logistic, normal and Weibull laws to the D8S1179 and VWA frequencies. and 2.53 · 10−9 for different loci. Two examples of the fitness of a law by the χ 2 test, performed to these loci are presented in Figure 1 (the values in the legend are the p-value of χ 2 test). Thus the main assumptions of Xu and others [2] fail in the case of for our database of frequencies—the allele distributions are not logistic. However, as we assumed that for the biological reasons this model is true, we can find parameter’s values that fit the best. They are presented in Table 1. From the computations described in the Listing 1 algorithm, we obtained the best values for the constructed mutation models (13) and (14). These values of parameters a, b, m for different loci are summarized in Table 1. These parameters can be used in practical paternity for another consanguinity computing purpose. Conclusions Results For all alleles, the chi-square test excludes the assumption of the logistic distribution with p value between 4.02 · 10−318 – The Lublin Medical University allele frequencies do not satisfy the assumptions of Xu et. al [2] mutation model. The allele frequencies have the gamma rather than the logistic distribution. 6 – Krajka et al.: On the mutation model The best parameters of our mutation model (cf. (11)) are summarized in Table 1. It has the important practical consequence—these values should be used in paternity testing with our model. 4. 5. Research funding: None declared. Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission. Competing interests: The funding organization(s) played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication. Authors state no conflict of interest. Ethical Approval: The conducted research is not related to either human or animal use at Maria Curie-Sklodowska University in Lublin, Poland, even though it concerns DNA data analysis. Employment or leadership: None declared. Honorarium: None declared. 6. 7. 8. 9. 10. 11. References 1. Krajka T. Mutation rate model used in the dna view program. Appl Sci 2020;10:3585. 2. Xu X, Peng M, Fang Z, Xu X. The direction of microsatellite mutations is dependent upon allele length. Nature genetics 2000;24:396–9. 3. Harr B, Schlötterer C. Long microsatellite alleles in drosophila melanogaster have a downward mutation bias and short 12. 13. persistence times, which cause their genome-wide underrepresentation. Genetics 2000;155:1213–20. Huang Q-Y, Xu F-H, Shen H, Deng H-Y, Liu Y-J, Liu Y-Z, et al. Mutation patterns at dinucleotide microsatellite loci in humans. Am J Hum Genet 2002;70:625–34. Whittaker JC. Likelihood-based estimation of microsatellite mutation rates. Genetics 2003;2:781–7. Sainudiin R. Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics 2004;1: 383–95. FuY-X, ChakrabortyR. Simultaneousestimationof allthe parameters of a stepwise mutation model. Genetics 1998;150:487–97. Kelkar YD, Tyekucheva S, Chiaromonte F, Makova KD. The genome-wide determinants of human and chimpanzee microsatellite evolution. Genome Res 2008;18:30–8. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Erratum: initial sequencing and analysis of the human genome: international human genome sequencing consortium (nature (2001) 409 (860-921)). Nature 2001;412: 565–6. Shao C, Lin M, Zhou Z, Zhou Y, Shen Y, Xue A, et al. Mutation analysis of 19 autosomal short tandem repeats in Chinese han population from Shanghai. Int J Leg Med 2016;130: 1439–44. Harr B, Todorova J, Schlötterer C. Mismatch repair-driven mutational bias in d. melanogaster. Mol Cell 2002;10:199–205. Tachida H, Iizuka M. Persistence of repeated sequences that evolve by replication slippage. Genetics 1992;131:471–8. Walsh JB. Persistence of tandem arrays: implications for satellite and simple-sequence dnas. Genetics 1987;115:553–67. Supplementary Material: The online version of this article offers supplementary material (https://doi.org/10.1515/bams-2020-0057).