Bioqt 01
Bioqt 01
Bioqt 01
ABSTRACT:
Bioinformatics is a multi-disciplinary science focusing on the applications of
computational methods and mathematical statistics to molecular biology. Choosing
bioinformatics as specialization gives an opportunity to get involved with the most
interesting computational techniques dealing with biological data to contribute to
cure and diagnose some of genetic disorders that affect biological machines. The
purpose of this library (which defines namespace BioQt), is to provide a set of
routines for handling biological sequence data for Qt/C++ users (the full source
code available on https://github.com/alrawab/BioQt). This thesis will shed the light
on some modules of BioQt SDK such as exact string matching problem,
Microsatellite Repeats, Palindromic sequences and sequence alignment algorithms
(Longest Common Subsequence, Needleman-Wunsch and Smith-Waterman). This
thesis examines and evaluates these challenging problems in bioinformatics by
using Qt/C++.
Contents
1- Chapter One ...........................................................................................................1
Introduction ................................................................................................................1
1.1 Bioinformatics: ..............................................................................................1
1.2 The importance of bioinformatics: ................................................................2
1.3 Development of bioinformatics:....................................................................2
1.4 Bioinformatics application areas: ..................................................................3
1.4.1
1.4.2
1.4.3
1.4.4
Mutations detection:.............................................................................4
1.4.5
Phylogeny: ...........................................................................................5
1.4.6
1.4.7
1.4.8
1.4.9
Objective: .................................................................................................10
Alphabet: ............................................................................................12
2.2.2
2.2.3
Subsequence:......................................................................................12
2.2.4
Substring: ...........................................................................................12
2.2.5
2.2.6
Prefix: .................................................................................................13
2.2.7
2.2.8
Suffix:.................................................................................................13
2.2.9
Objective: .................................................................................................20
2.11
Material: ...................................................................................................20
2.12
Methods: ...................................................................................................20
2.13
2.14
Objective: .................................................................................................23
2.15
Material: ...................................................................................................23
2.16
Methods: ...................................................................................................24
2.17
Identity: ..............................................................................................46
3.3.2
Similarity: ..........................................................................................46
3.3.3
Homology: .........................................................................................46
3.9.2
Limitations: ........................................................................................56
3.10
Objective: .................................................................................................56
3.11
Material: ...................................................................................................56
3.12
Methods: ...................................................................................................56
Results: ...............................................................................................61
Experiment: ........................................................................................64
4.3.2
Results: ...............................................................................................64
4.4.1
Experiment: ........................................................................................68
4.4.2
Results: ...............................................................................................68
- Refrences: ......................................................................................................74
In God we trust
1- Chapter One
Introduction
This is an introductory chapter introduce the fundamental concepts about
bioinformatics (Sections 1.1 to 1.5), Qt framework Section(1.6), biological
macromolecules Sections(1.7 to 1.8) and Finally in both Sections 1.9 and 1.10
describes the motivations and the objectives of this thesis respectively.
1.1 Bioinformatics:
May not be a clear-cut definition and measurement of bioinformatics as term but it
can be referred to it as the administration, dealing and exploitation of biological
information by using computer methods that is why this field is one of
overlapping fields of knowledge among other sciences. It is clear that this
discipline
computer science but for accuracy this field guided for molecular science in
particular
includes the Biosimulation , Bioimaging and others to pour in the end to the
management and analysis of
biological
bioinformatics coming from the analysis of biomolecules such as DNA, RNA and
Proteins on the one hand and on the other hand the computer classifying and
arranging of this information, which is termed as Data Biomining.
of releases lasted until the seventies of the last century. In 1970 Needleman and
Wunsch [2] introduce an algorithm that searches the similarities in the amino acid
sequence of two proteins. Since the eighties of the last century has witnessed a
steady growth in the number of genomes that have been identified there sequences.
From the foregoing it is clear that bioinformatics beyond the bimolecular aspects
also needs:
1. An effective data warehouse.
2. Availability of data which stored in data warehouse.
3. Manipulation of stored data by different methods to extract useful
information.
processes
1.4.5 Phylogeny:
Its important to understand evolutionary and genetic relationships
between organisms; bioinformatics is very helpful to find out the
dimensional time for evolution through molecular sequencing data
and morphological data matrices.
1.4.6 Protein expression analysis:
Gene expression is measured in many ways including mRNA and
protein expression however protein expression is one of the best clues
of actual gene activity since proteins are usually final catalysts of cell
activity. Protein, microarrays and high throughput (HT) mass
spectrometry (MS) can provide a snapshot of the proteins present in a
biological sample. Bioinformatics is very much involved in making
sense of protein microarray and HT MS data.
1.4.7 Gene expression analysis:
The expression of many genes can be determined by measuring mRNA
levels with various techniques such as microarrays, expressed cDNA
sequence tag (EST) sequencing, serial analysis of gene expression
(SAGE) tag sequencing, massively parallel signature sequencing
(MPSS), or various applications of multiplexed in-situ hybridization
etc. All of these techniques are extremely noise-prone and subject to
bias in the biological measurement.
1.4.8 Protein structure prediction:
Polypeptide chain (so-called primary structure) can be easily
determined from the sequence on the gene that codes for it and this
primary structure uniquely determines a structure in its native
environment. To understand the function for a protein its vital to have
5
proper knowledge of this structure. For lack of better terms the protein
structure is classified as secondary, tertiary and quaternary structure.
Prediction of protein structure is very important task for
bioinformatics for drug design and the design of novel enzymes.
1.4.9 Primer design:
In the past decade molecular biology has been transformed from the
art of cloning a single gene to a statistical science measuring and
calculating properties of entire genomes. Its become an urgent
necessity for effective Programs for designing appropriate primers for
polymerase chain reaction (PCR).
1.6 Qt:
Qt (cute) is a cross-platform development framework that can be run on various
software and hardware platforms with little or no change in the underlying
codebase. The Qt framework first became publicly available in May 1995 which
initially developed by Haavard Nord and Eirik Chambe-Eng (from Norwegian
Institute of Technology in Trondheim and both graduated with master's degrees in
computer science)[4]. Qt uses standard C++ with extensions including signals and
slots that simplify handling of events, and this helps in development of both GUI
and server applications which receive their own set of event information and
should process them accordingly. Because of simplicity, robustness, native
performance, cross-platform compatibility and both commercial and open source
licenses, many organizations in many parts of the world use Qt.
Figure 01.1 Watson - Crick Model for the structure of DNA [10].
1.8 Proteins:
Proteins are built from twenty different amino acids (A, C, D, E, F, G, H, I, K, L,
M, N, P, Q, R, S, T, V, M, Y). Each protein has its distinct amino-acid sequence
and 3-dimensional structure. Proteins generated from DNA in process called gene
expression. The process of gene expression divided into two steps, the first is
transcription where the information coded in DNA is copied into a molecule of
RNA whose bases are complementary to those of the DNA. The second is
translation, where the information now encoded in RNA which translated into
instructions for manufacturing a protein utilizing the ribosome protein machine.
1.9 Motivations:
There are many good reasons to prefer Qt/C++ to other languages for writing this
library:
Portability:
Qt is Platform-independent this means the code can be complied
under Windows or Linux running on x86 hardware or Solaris running
on SPARC hardware.
Productivity:
Qt comes with full set of effective built-in programing modules
which help programing and save time such as QtGui, QtWebKit,
QtNetwork, QtOpenGL, QtSql and many others .
1.10 Objective:
The purpose of this thesis is to provide a programing library for scientists and
Programmers working on problems in bioinformatics and computational biology. It
may also appeal to programmers who want to improve their programming skills or
programmers who have been working in bioinformatics and computational biology
but are familiar with languages other than Qt/C++. A reasonable level of
programming skill is presumed as is some familiarity with some of the basic tasks
that need to be carried out in bioinformatics.
10
2 - Chapter Two:
String Processing Problems
DNA, RNA, and protein are represented as strings in bioinformatics for this reason
string processing is the cornerstone in the field of bioinformatics and these
problems take a variety of manifestations each of which has a specific meaning.
This topic will shed some light on some traditional string problems such as: local
sequence alignment problem, global sequence alignment problem, exact pattern
matching problem, approximate pattern matching problem, finding all maximal
palindromes problem, finding all tandem repeats problem, finding all tandem
arrays problem, etc. There are quite rich researches for these problems. This thesis,
will propose the major algorithms in this respect which implemented in BioQt.
2.1 Keywords:
Computational Biology; Bioinformatics ; Naive Algorithm ; Boyer-Moore ;
KnuthMorrisPratt; Palindromes; Microsatellite; Manacher Algorithm; dynamic
algorithms;
Needleman-Wunsch;
Smith-Waterman
11
Longest
Common
2.2.4 Substring:
A subsequence of consecutive characters e.g. TAC is a substring of
ACTACA, TAC is not a substring of ATGAC.
12
2.2.6 Prefix:
Substring containing the first character of a string, including the
empty string e.g. in the string ACT, a prefix may be the empty string,
A, AC, or ACT.
2.2.8 Suffix:
Substring containing the last character of a string e.g. in the string
ACT, may be T, CT or ACT.
13
distribution throughout the genome have made microsatellites one of the most
popular genetic markers for use in plant breeding programs(Gous Miah, Mohd
Y.Rafii et.al) [16].
2.4 Objective:
Design Qt/C++ class to Detect Microsatellite Repeats in Sequences and
investigated the position and frequencies of repeats.
14
polyglutamine (polyQ)
Non-Polyglutamine
15
Gene
DRPLA
ATN1 or
(Dentatorubropallidoluysian
DRPLA
Normal PolyQ
repeats
Pathogenic PolyQ
repeats
6 - 35
49-88
6 - 35
36 - 250
AR
9 - 36
38 - 62
ATXN1
6 - 35
49 - 88
ATXN2
14 - 32
33 - 77
ATXN3
12 - 40
55 - 86
CACNA1A
4 - 18
21 - 30
ATXN7
7 - 17
38 - 120
TBP
25 - 42
47 - 63
atrophy)
HD (Huntington's disease)
HTT
(Huntingtin)
Gene
Codon
Normal/wild
type
Pathogenic
FRAXA (Fragile
X syndrome)
FXTAS (Fragile
X-associated
tremor/ataxia
syndrome)
FRAXE (Fragile
XE mental
retardation)
CGG
6 - 53
230+
CGG
6 - 53
55-200
CCG
6 - 35
200+
GAA
7 - 34
100+
DMPK
CTG
5 - 37
50+
OSCA or SCA8
CTG
16 - 37
110 - 250
PPP2R2B or
SCA12
7 - 28
66 - 78
FRDA
(Friedreich's
ataxia)
DM (Myotonic
dystrophy)
SCA8
(Spinocerebellar
ataxia Type 8)
SCA12
(Spinocerebellar
ataxia Type 12)
AFF2 or FMR2,
on the Xchromosome
FXN or X25,
(frataxin
reduced
expression)
2.6 Material:
C++ compiler gnu GCC for Unix/mac or VC++ for MS windows.
Qt SDK.
Any Computer platforms (Pc/Mac or UNIX).
17
2.7 Methods:
FindMicrosatelliteRepeats class is written in Qt/C++ and calculates Microsatellite
Repeats positions and frequencies uses sliding window algorithm.
2.8 Palindromes:
A palindrome, in the literary sense, refers to a word or a phrase that reads the same
in both directions, i.e. when it is read in forward and reverse. One of the oldest
palindromes known is the phrase Sator arepo tenet opera rotas [19]. But in the
biological context definition of palindromes is slightly differs, in case of proteins
there is no difference compared to regular English language definition of
palindrome, in contrast the case of either DNA or RNA shows evident difference.
For a nucleotide sequence to be considered as a palindrome, its complementary
strand must read the same in the opposite direction i.e. the sequence 5`-CGATCG3` is considered a palindrome since its reverse complement 3`-GCTAGC-5` reads
the same (David Roy Smith et. al) [20]. Palindromes can be even or odd. An odd
palindrome is one in which there is no asymmetry in the center or mismatch in any
part of the sequence, such that its reverse compliment reads the same as the
original sequence itself.
2.10 Objective:
Implement Qt/C++ class to detect Palindrome Subsequence.
2.11 Material:
C++ compiler gnu GCC for Unix/mac or VC++ for MS windows
Qt SDK
Any Computer platforms (Pc/Mac or UNIX).
2.12 Methods:
Dynamic programing approach.
Dynamic Programing Approach
2- Extract Palindrome
1- Populate Matrix
20
22
2.14 Objective:
Implement Qt/C++ class to solve the exact string matching.
2.15 Material:
C++ compiler gnu GCC for Unix/mac or VC++ for MS windows.
Qt SDK.
Any Computer platforms (Pc/Mac or UNIX).
23
2.16 Methods:
1. Brute Force (Naive) algorithm.
2. Knuth-Morris-Pratt (KMP) algorithm.
3. Boyer-Moore Algorithm.
4. Baeza-YatesGonnet (shift-or/and) algorithm.
24
b
a
a
b
25
Its obvious the algorithm is simple and redundant e.g. if m=3 the
algorithm will compare positions: 0 1 2
, 1 2 3 etc. to avoid
26
27
1
a
0
2
b
0
3
a
4
b
5
a
6
c
7
a
5
a
6
c
7
a
5
a
6
c
7
a
1
a
0
2
b
0
3
a
1
4
b
1
a
0
2
b
0
3
a
1
4
b
2
29
1
a
0
2
b
0
3
a
1
4
b
2
5
a
3
6
c
7
a
5
a
3
6
c
1
7
a
5
a
3
6
c
1
7
a
1
1
a
0
2
b
0
3
a
1
4
b
2
1
a
0
2
b
0
3
a
1
4
b
2
Step 3: Compare the first character of the pat tern with first character
of Text. If match i s not found, substitute the value of [q] to q.
I f match is found, and then increment the value of q by 1.
Step 4: Check whether all the pat tern elements are matched with the
text elements if not, repeat the search process. If yes, print the number
of shifts taken by the pat -tern.
Step 5: look for the next match.
31
String
Pattern
b
b
a
a
b
b
a
c
String
Pattern
b
b
a
a
b
b
a
c
P [1] does not match with S [1]. P will be shifted one position to the
right.
Step 2: i = 2, q = 0.
Comparing p [1] with S [2].
String
Pattern
b
a
b
b
a
a
b
b
33
a
a
a
c
c
a
Step 3: i = 3, q = 1.
Comparing p [2] with S [3] p [2] does not match with S [3].
String
Pattern
a
a
b
b
a
c
Step 4: i = 4, q = 0.
Comparing p [1] with S[4] p [1] does not match with S[ 4].
String
Pattern
b
b
a
a
b
b
a
c
c
a
Step 5: i = 5, q = 0.
Comparing p [ 1] with S[5].
String
Pattern
b
a
a
b
a
a
b
b
34
a
a
a
c
c
a
Step 6: i = 6, q = 1.
Comparing p [2] with S [6] p [2] matches with S [6].
String
Pattern
a
b
a
a
b
b
a
c
Step 7: i = 7, q = 2.
Comparing p [3] with S [7] p [3] matches with S [7].
String
Pattern
b
a
b
b
a
c
c
a
Step 8: i = 8, q = 3.
Comparing p [4] with S [8] p [4] matches with S [8].
String
Pattern
b
a
b
b
b
b
a
35
a
a
a
c
c
a
Step 9: i = 9, q = 4.
Comparing p [5] with S [9] p [5] matches with S [9].
String
Pattern
b
b
a
a
a
c
String
Pattern
b
b
a
a
b
b
String
a
36
Pattern
String
Pattern
b
b
a
a
b
b
a
c
String
Pattern
b
a
b
b
a
a
b
b
a
a
a
c
c
a
2.18.4 Advantage:
Best known for linear time for exact matching.
Compares from left to right.
Shifts more than one position.
Preprocessing approach of Pattern to avoid trivial
Comparisons.
Avoids recomputing matches.
2.18.5 Disadvantages:
Doesnt work as well as the size of the alphabets increases. By
which more chances of mismatch occurs.
to increase both efficiency and accuracy and as result of that there are too many
variations of this algorithm e.g. Boyer-Moore Smith (MBS), Turbo Boyer Moore
(TBM), Two Way algorithm (TW), Berry Ravindran algorithm (BR), Reverse
Colussi algorithm (RC) and Sunday algorithm .
39
40
41
42
2.19.4 Advantages:
The both good-suffix and bad-char combined provides a good shift value as
maximum of two is taken as shift value.
2.19.5 Disadvantages:
The preprocessing of good-suffix is complex to implement and understand.
Bad-char of mismatch character may give small shift, if mismatch after
many matches.
43
3 - Chapter Three:
String Matching Evaluation Methods
Identifying patterns among biological sequences was the title for a lot of researches
in bioinformatics. Evaluation of string matching or approximate string matching
(often colloquially referred to as fuzzy string searching) is performed by
implementations of different algorithms on the theoretical side, some of the famous
algorithms used in sequences comparison such as: Longest Common Substring
(LCS), Subsequence (LCSS), Global alignment (Needleman-Wunsch) and Local
alignment (Smith-Waterman) algorithms. Many biological machines share very
similar gene sequences while some regions of sequences vary from spice to spice
or even among individuals from the same domain. The comparison committed by
recognizing the regions of similarity that may be a consequence of functional,
structural, or evolutionary relationships between the sequences. i.e. the similarity
shows that the sequences share a common ancestral sequence. As a result similar
sequences have similar functionality. Such sequences are said to be homologous.
Furthermore the knowledge of a DNA sequence and gene analysis can be used in
several biological, medical and agricultural research fields such as: possible
disease or abnormality diagnoses, forensics, pattern matching, biotechnology, etc.
The analysis and comparison studies for DNA sequences can be used to detect
possible errors or abnormality in a DNA sequence and predict the function of a
specific gene and compare it with other similar genes from same or different
organisms. Each newly discovered DNA sequence is classified according to its
Similarity with known DNA sequences.
44
Multiple
sequence
alignment
Pairwise
sequence
alignment
45
3.3 Terminology:
3.3.1 Identity:
Proportion of pairs of identical residues between two aligned
sequences. Generally expressed as a percentage. This value strongly
depends on how the two sequences are aligned.
3.3.2 Similarity:
Proportion of pairs of similar residues between two aligned sequences.
If two residues is similar is determined by a substitution matrix.
This value also depends strongly on how the two sequences are
aligned, as well as on the substitution matrix used.
3.3.3 Homology:
Two sequences are homologous if and only if they have a common
ancestor. There is no such thing as a level of homology (It's either yes
or no).
Figure 3.1 a simple dot matrix comparison of two DNA sequences AGCTAGGA
and GACTAGGC [29].
Dynamic Programming methods (Rigorous mathematical) this approach is
slow but optimal and uses two algorithms Needleman-Wunsch algorithm
(for global pairwise alignment) and Smith-Waterman algorithm (for local
47
As mentioned before in section 3.4 there is more than one method for performing
Pairwise Sequence Alignment. This thesis will focus on DP (Dynamic
Programming) methods which first introduced by Richard Bellman in 1953 to
study multistage decision problems [30]. The idea behind this style of problem
solving is breaking down the big problem to smaller sub problems (often many of
these sub problems are really the same) by solving each sub problems only once
this will reducing the number of computations. This approach is especially useful
when the number of repeating sub problems grows exponentially as a function of
the size of the input. In this method each pair of characters is compered in the two
48
sequences and generates an alignment. This alignment will include matched and
mismatched characters and gaps in the two sequences that are positioned so that
the number of matches between identical or related characters is the maximum
possible. The dynamic programming algorithm provides a reliable computational
method for aligning DNA and protein sequences. The method has been proven
mathematically to produce the best or optimal alignment between two sequences
under a given set of match conditions. Optimal alignments provide useful
information to biologists concerning sequence relationships by giving the best
possible information as to which characters in a sequence should be in the same
column in an alignment and which are insertions in one of the sequences (or
deletions on the other). This information is important for making functional,
structural and evolutionary predictions on the basis of sequence alignments. There
are two important algorithms used in DP pairwise sequence alignment:
Needleman and Wunsch Algorithm (for pairwise global sequence
alignment).
Smith and Waterman algorithm (for pairwise local sequence alignment).
Score
Match
+1
Mismatch
-1
Gap
-2
in which F(i, j) equals the best score of the alignment of the two prefixes (x 1,
x2, . . . , xi) and (y1, y2, . . . , yj).This will be done recursively by setting F(0,
0) = 0 and then computing F(i, j) from F(i-1, j-1), F(i-1, j) and F(i, j-1).
50
-2
-4
-6
-8
-10
-12
-2
-1
-3
-5
-7
-9
-4
-1
-2
-4
-4
-6
-6
-3
-1
-3
-5
-8
-5
-2
-1
-2
-10
-7
-4
-3
51
-2
-4
-6
-8
-10
-12
-2
-1
-3
-5
-7
-9
-4
-1
-2
-4
-4
-6
-6
-3
-1
-3
-5
-8
-5
-2
-1
-2
-10
-7
-4
-3
53
54
PAM series :
Percent Accepted Mutation (Dayhoff M., 1968, 1972, 1978)[34]. A unit
introduced by Dayhoff et al [34]. To quantify the amount of evolutionary
change in a protein sequence. 1.0 PAM unit is the amount of evolution
which will change on average 1% of amino acids in a protein sequence. A
PAM(x) substitution matrix is a look-up table in which scores for each
amino acid substitution have been calculated based on the frequency of that
substitution in closely related proteins that have experienced a certain
amount (x) of evolutionary divergence:
Based on 1572 protein sequences from 71 families.
3.9.2 Limitations:
Substitution matrices do not take into account long range
interactions between residues.
They assume that identical residues are equal (whereas in real
life a residue at the active site has other evolutionary constraints
than the same residue outside of the active site).
They assume evolution rate to be constant.
3.10 Objective:
Implement Qt/C++ class for Similarity measure.
3.11 Material:
C++ compiler gnu GCC for Unix/mac or VC++ for MS windows
Qt SDK
Any Computer platforms (Pc/Mac or UNIX).
3.12 Methods:
- Dynamic programing approach:
Global alignment (Needleman-Wunsch algorithm).
Local alignment (Smith-Waterman algorithm).
56
4 - Chapter Four
Case Study and Results
4.1 Microsatellite Repeats Case Study:
Apply FindMicrosatelliteRepeats on two types of Trinucleotide repeat disorder
such as FXN gene with NCBI Reference Sequence: NM_181425.2 [35] and FMR1
(fragile X mental retardation 1) with NCBI Reference Sequence: L19493.1 [36]:
Motif
MS Sequence
TA
TATATA
TT
333
TTTTTT
TT
340
TTTTTT
AT
547
ATATAT
TG
687
TGTGTGTG
AA
721
AAAAAA
TT
745
TTTTTT
TT
883
TTTTTT
TTGTTT
931
TTGTTTTTGTTTTTGTTT
GA
981
GAGAGAGA
GA
1037
GAGAGA
57
TT
1138
TTTTTTTTTT
GA
1207
GAGAGA
AT
1312
ATATAT
TT
1373
TTTTTTTT
TT
1382
TTTTTT
TT
1395
TTTTTT
TT
1423
TTTTTT
TT
1434
TTTTTTTTTT
CA
1782
CACACA
AT
1938
ATATAT
TG
2339
TGTGTG
Motif
MS Sequence
CCCAG
265
CCCAGCCCAGCCCAG
CT
483
CTCTCT
TT
544
TTTTTT
TT
934
TTTTTT
GTT
940
GTTGTTGTTGTTGTT
TT
957
TTTTTTTT
TT
1070
TTTTTT
AA
1157
AAAAAA
TG
1499
TGTGTG
AT
1506
ATATATAT
58
AA
1560
AAAAAA
AA
1589
AAAAAA
AG
1707
AGAGAG
AG
1808
GAGAGA
TA
1924
TATATA
AA
2128
AAAAAA
AAT
2284
AATAATAAT
ATA
2299
ATAATAATAATAATA
AA
2618
AAAAAAAAAAAAAA
AA
2834
AAAAAA
AA
2842
AAAAAAAA
TAA
2856
TAATAATAA
TT
2898
TTTTTT
TT
2918
TTTTTT
AA
3403
AAAAAA
GG
3483
GGGGGG
AAAT
3558
AAATAAATAAATAAATAAAT
TG
3658
TGTGTG
TT
3823
TTTTTTTT
TT
3841
TTTTTT
TT
3978
TTTTTT
GT
4081
GTGTGT
CT
4191
CTCTCT
GT
4219
GTGTGT
TT
4316
TTTTTTTTTTTTTT
59
TG
4765
TGTGTG
AC
4821
ACACAC
TT
5025
TTTTTT
AA
6027
AAAAAA
TA
6078
TATATA
TC
6548
TCTCTC
GG
6853
GGGGGG
CT
6901
CTCCTCCTC
CA
7086
CACACA
4.2.1 Results:
4.2.1.1 Experiment 1:
dynamic programing approach :
Test KC733827 sequence against BioQt dynamic programing
Palindrome finder algorithm with minimum palindrome length 10 and
maximum length 20:
Palindromic sequence
Position
AGTCCGGACT
617
GGATCGATCC
708
ACTTTTAAAGT
1021
ATTCTCAGAAT
1100
TGAGTACTCA
1119
TTTTAATTTAAAA
1729
AAGCTAGCTT
2703
CATGTTTAAACATG
2720
61
GTTAATCATTAA
2762
TTTTAATTTAAAA
4320
Crashed down.
4.2.1.2 Experiment 2:
dynamic programing approach :
Test dummy sequence against BioQt dynamic programing Palindrome
finder algorithm with minimum palindrome length 6 and maximum
length 20:
Palindromic sequence
Position
ATGCCAT
CATGATG
CGTTCAACG
75
ACGTAACGT
87
CGTAACG
93
ACACTGT
115
GTTAAC
120
Naive Algorithm :
Only one palindrome detected at position 120 GTTAAC.
62
4.3.1 Experiment:
Using the sequence TGAGAT as pattern.
4.3.2 Results:
All algorithms give the same result:
The pattern TGAGAT matched at text indices 1644, 1849, 2219, 4235, 4440 and
4810.
Name of
algorithms
Brute Force
Boyer Moor
KnuthMorris
Pratt
Shift-or
Numbers of
running time
1
2
3
4
5
6
7
8
9
10
795995
804442
774492
769500
773724
777180
773724
767580
796763
803290
364783
344048
516072
324848
513000
523367
336752
496105
334832
491881
513768
323313
540647
329456
342128
322929
556006
337520
342896
338288
Table 4.5 Exact String Matching.
64
874327
802906
529895
809818
772956
828249
803674
756445
761052
773724
Time in Nanoseconds
810000
800000
790000
780000
Brute Force
770000
760000
750000
740000
1
10
Number of
repeated trials
Time in Nanoseconds
600000
500000
400000
300000
Boyer Moor
200000
100000
0
1
10
Number of
repeated
trials
Time in Nanoseconds
KnuthMorrisPratt Algorithm
600000
500000
400000
300000
KnuthMorrisPratt
200000
100000
Number of
repeated trials
0
1
10
Time in Nanoseconds
Shift-Or Algorithm
1000000
900000
800000
700000
600000
500000
Shift-or
400000
300000
200000
Number of
repeated
trials
100000
0
1
10
1000000
Time in Nanoseconds.
900000
800000
700000
600000
Brute Force
500000
Boyer Moor
KnuthMorrisPratt
400000
Shift-Or
300000
200000
100000
Number of repeated
trials
0
1
10
4.4.1 Experiment:
Using two dummy sequences:
s1=TTGTCAGATTCACCAAAGTTGAAATGAAGGAAAAAATGCTAAGGGCAGCC
s2=AGAGAGAGGTCAGGTTACCCACAAAGGGAAGCCCATCAGACTAACAGCGG
4.4.2 Results:
69
Name of
algorithms
Hirschberg's
Algorithm
(LCS)
Needleman
Wunsch
algorithm
SmithWaterman
Algorithm
Numbers of
running time
1
2
3
4
5
6
7
8
9
10
7179236
2366456
5582271
1955983
5668282
1937936
5566528
1892627
8565780
1878420
9563355
1899923
5701689
1933329
5696313
1876116
5600318
1945616
5661371
1891091
Table 4.6 Sequence Alignment.
Time in Nanoseconds
2314619
2316923
2277757
3778727
4030232
3731114
2354936
2305787
2310395
2275069
12000000
10000000
8000000
6000000
Hirschberg's Algorithm
(LCS)
4000000
2000000
0
1
9 10
Number of
repeated trials
70
Time in Nanoseconds
NeedlemanWunsch algorithm
2500000
2000000
1500000
NeedlemanWunsch
algorithm
1000000
500000
Number of
repeated trials
0
1
10
Time in Nanoseconds
Smith-Waterman Algorithm
4500000
4000000
3500000
3000000
2500000
Smith-Waterman
Algorithm
2000000
1500000
1000000
500000
0
1
10
Number of
repeated trials
71
12000000
Time in Nanoseconds
10000000
8000000
Hirschberg's Algorithm
(LCS)
6000000
NeedlemanWunsch
algorithm
4000000
Smith-Waterman Algorithm
2000000
Number of repeated
trials
0
1
10
72
5 - Chapter Five:
Conclusion and Future work
In this thesis the evaluation of code implementation has been successfully done for
all algorithms: exact string matching problem, Microsatellite Repeats, Palindromic
sequences, longest common subsequence and Needleman-Wunsch and SmithWaterman sequence alignment. . In this implementation we use C++ language and
the Qt SDK 4.8 with the GCC 4.8 compiler to test the algorithms under the Centos
Linux
sequences on different algorithms may show different results. While some of the
differences are shown to be expected and are part of the different default
considerations or interpretations of those algorithms. Implementation of dynamic
programing algorithm to compare sequences shows remarkable waste of time
compared to BLAST or FASTA (this is accepted due to nature of algorithms).
Furthermore for future work complete the classes deal with open reading frames,
restriction enzyme, PCR primer design and database file handler.
73
6 - Refrences:
1- Margaret O. Dayhoff. Atlas of Protein Sequence and Structure: Supplement,
Volume 3. National Biomedical Research Foundation.(1978).
2- Neil C.Jones and Pavel A.Pevzner. An introduction to:bioinformatics
algorithms. The MIT Press.(2004).
3- Dinesh Bhatia.Medical Informatics.PHI Learning Pvt. Ltd.(2015).
4- Jasmin Blanchette; Mark Summerfield. C++ GUI Programming with Qt 4,
Second Edition. Prentice Hall. (2008).
5- The Qt Company and ESA to develop 3D mapping and visualization
software to support the Rosetta science operations. December 18, 2014.
Helsinki, Finland. https://www.qt.io/qt-news/qt-company-esa-develop-3dmapping-visualization-software-support-rosetta-science-operations/.
6- DreamWorks Animation keynote at Qt Developer Days. September 7th,
2010. https://blog.qt.io/blog/2010/09/07/dreamworks-animation-keynote-atqt-developer-days/.
7- Peter Kreuel .Lucasfilm Goes for Trolltechs Qt. Oct 16, 2007.
http://www.linux-magazine.com/Online/News/Lucasfilm-Goes-forTrolltech-s-Qt.
8- Gary Towsend. Panasonic Avionics Inflight Entertainment. https://mar-eu-1pa6s6zn8.qtcloudapp.com/case-panasonic/.
9- Qt on Android. https://www.qt.io/case-adeneo-embedded/.
10- David L.Nelson and Michael M.Cox. Lehninger Principles of Biochemistry,
sixth edition. Freeman, W. H. & Company.(2012).
11- Barbara Hansen and Lynn Jorde. USMLE Step 1 - Biochemistry and
Medical Genetics Lecture Notes. Kaplan.(2013).
12- Mona Snigh .Topics in computational molecular biology. lecture2. (1999).
74
13- Ben Langmead. Strings and exact matching. johns hopkins university
whiting school of engineering computer science.(2014).
14- Litt, M.; Luty, J.A. A hypervariable microsatellite revealed by in vitro
amplication of a dinucleotide repeat within the cardiac muscle actin gene.
Am. J. Hum. Genet. (1989).
15- Peter D.Turnpenny and Sian Ellard.Emery's Elements of Medical
Genetics,13th. ed.Elsevier.(2007).
16- Gous Miah, Mohd Y.Rafii et.al.A Review of Microsatellite Markers and
Their Applications in Rice Breeding Programs to Improve Blast Disease
Resistance.Int. J. Mol. Sci. 2013, 14, 22499-22528;
doi:10.3390/ijms141122499.
17- Hans Ellegren.Microsatellites: simple sequences with complex
evolution.Nature Reviews Genetics 5, 435-445 (June 2004) |
doi:10.1038/nrg1348.
18- E.Yu.Siyanova and S.M.Mirkin.Expansion of Trinucleotide
Repeats.Department of Molecular Genetics, University of Illinois at
Chicago, Chicago, IL 60607, USA.( October 2, 2000).
19- Duncan FISHWICK, M.A.An Early Christian Cryptogram.University of St.
Michaels College, Toronto.CCHA, Report, 26 (1959), 29-41.
20- David Roy Smith et al.Palindromic Genes in the Linear Mitochondrial
Genome of the Nonphotosynthetic Green Alga Polytomella magna.Genome
Biol Evol. 2013; 5(9): 16611667.[PMCID: PMC3787674].
21- Horace R. Drew, Denise Lewy, Jason Conaty, Keith N. Rand, Philip
Hendry and Trevor Lockett.RNA hairpin loops repress protein synthesis
more strongly than hammerhead ribozymes.CSIRO Division of Molecular
Science, North Ryde, Australia.Eur. J. Biochem. 266, 260273 (1999).
22- Choi and C.Q. DNA palindromes found in cancer.Genome Biology.(2005).
75
76
77