Genoogle: An Indexed and Parallelized Search Engine For Similar DNA Sequences
Genoogle: An Indexed and Parallelized Search Engine For Similar DNA Sequences
Genoogle: An Indexed and Parallelized Search Engine For Similar DNA Sequences
1 Motivation:
The search for similar genetic sequences is one of the main bioinformatics
tasks. The genetic sequences data banks are growing exponentially and the
searching techniques that use linear time are not capable to do the search in
the required time anymore. Another problem is that the clock speed of the
modern processors are not growing as it did before, instead, the processing
capacity is growing with the addiction of more processing cores and the tech-
niques which does not use parallel computing does not have benefits from
these extra cores. This work aims to use data indexing techniques to reduce
the searching process computation cost united with the parallelization of the
searching techniques to use the computational capacity of the multi core pro-
cessors. To verify the viability of using these two techniques simultaneously,
a software which uses parallelization techniques with inverted indexes was
developed.
2 Results:
Experiments were executed to analyze the performance gain when parallelism
is utilized, the search time gain, and also the quality of the results when
it compared with others searching tools. The results of these experiments
∗
Department of Computer and Systems Engineering, Military Institute of Engineering
(IME), Rio de Janeiro, Brazil
1
were promising, the parallelism gain overcame the expected speedup, the
searching time was 20 times faster than the parallelized NCBI BLAST, and
the searching results showed a good quality when compared with this tool.
3 Availability:
The software and its source code are available at http://genoogle.pih.bio.br.
4 Introduction
One of the most important tasks at the bioinformatics is the search for similar
genetic sequences in the data banks. With the new sequencing technologies,
the size of these genetics data banks are growing exponentially [4], and con-
sequently the search time is growing too.
The alignment algorithms, Needleman-Wunsch [21] and Smith-Waterman [24]
are algorithms of the dynamic programming class [6]. They are very sen-
sible alignments algorithms, but their computation and memory costs are
quadratic O(mn) (where m is the input sequence length and n the data
bank length). This computation and memory costs turn them impractical
at large data banks sets. To address this problem, heuristics was developed
to reduce the memory and processing consumption of the similar sequences
searching processes. Among the algorithms which use heuristics for similar
sequences searching, the FASTA [23] and BLAST [1, 2, 5] algorithms are the
more used and know algorithms. These algorithms search for areas which has
similarities, called HSP (High Scoring Pairs), and then, they make the align-
ment of the best HSP found using a dynamic programming algorithm. The
FASTA and BLAST algorithms also optimized the alignment cost, because
the alignment is made between words which have a previously detected sim-
ilarity only. Nevertheless, the problem complexity continues being O(nmq)
(q is the quantity of sequences in the data banks) because it still necessary
to read the data bank sequences entirely to find the HSPs.
The searching process optimization can be made using inverted index to
localize HSPs. Inverted indexes are data structures which allow to find the
localization of the indexed data at constant time (O(1)). This type of indexes
are utilized at Information Retrieval area, especially in web searching engine,
as Google, Yahoo, and libraries for data indexing, as the Apache Lucene [9].
2
For the search for similar genetic sequences, the sub-sequences of the data
bank sequences are indexed, eliminating the necessity of finding the HSPs
thought a linear search. It is quite similar with the indexes found at the
end of the books, where rather than to have the text entry, it have a sub-
sequences entries informing where these sub-sequence can be found. The two
most common data structures which are usually used to index the genetic
sequences data bank are the suffixes trees and vectors. Suffixes tress are
used at the work of Gusfield [11], this data structure allows to access the
sub-sequence position in linear time, but the real virtue is to find repeating
sequences regions and to obtain the longest common ancestral. One example
of suffixes tress application is presented by [10], where an algorithm for build-
ing tree to find near-exact sequences is utilized. At this example, the access
time gain has a high memory consumption cost. At [22] is said that the
Delcher et al. [7] has a memory spent of 37 bytes by base at her implemen-
tation when using suffixes trees. As comparison, using the human genome
with approximately three billions base pairs, are necessary more than 103 gi-
gabytes to store the suffix tree. At [8] an optimization is shown in relation
to previous work, where the algorithm is three times faster and it uses one
third of the memory. Even with this reduction, it is necessary more than
34 gigabytes to store the suffix tree to index the human genome.
Vector is a data structure which is mainly an array of elements, where
each position of this array represents an information and inside this posi-
tion, have another array informing where this information can be found.
Some similar searching sequences techniques that uses vectors as inverted
indexes are: SSAHA[22], BLAT [15], PatternHunter [17] the miBLAST [16],
Megablast[25], MegaBlast[18], and Kalafus [14] which uses hash tables to
align whole genomes. Transforming based methods are considered out of the
scope of this work, but Jing [13] presents methods that use this technique.
Another possible way to reduce the searching time is through parallel
computing. The importance of the parallelization has grown along with the
growing multiprocessing capabilities, like the number of the cores, in the
modern processors. Some tools for similar sequences searching have ways
to use these multiprocessing technologies. As one example, it is possible
to inform the NCBI BLAST to divide the data bank and perform parallel
searches at these data banks fragments. As the data bank is divided and
the search parallelized, the computational complexity is O( nmq f
) (f is the
quantity of fragments that the data bank will be divided), or it means, even
3
with the search parallelization, it still have a linear computation complexity.
The computational complexity can be reduced using inverted index, but
searching at the literature, it was not found any technique which uses index-
ing with parallel programming, and consequently they do not use the capacity
of moderns the multi-core processors. Thus, the objective of this work is to
use the index techniques to optimize the searching process along with the
use of the multi-core processor to reduce the searching time of the similar ge-
netic sequences. The developed software, called Genoogle uses indexing tech-
niques with inverted index, index search parallelization, data bank division
and parallelization, bit level sequences codification and alignment algorithms
optimizations. This software was developed utilizing the Java environment
and it has a web page, web services and text mode interfaces.
5 Methods
This work presents the Genoogle, that is a similar sequence searching engine
software which has as objective to execute fast and with good sensibility
searches. To achieve these goals, it uses inverted index to find nimbly the
HSPs and it also uses parallelization techniques to use the multi-core pro-
cessors capabilities. Genoogle works similarly to the BLAST, where given an
input sequence, a parameter set, and a data bank where the search will be
made, the software returns the similar sequences from the given input and
parameters.
Defining a genetic sequence as a sequence where Σ = {A, C, G, T } (DNA),
Σ={A, C, G, U } (RNA) and a sub-sequence is a sequence wich is contained
partiality of fully into other sequence. The sequences at Genoogle are divided
into fixed length sub-sequences, where the length is defined by the user be-
fore the data bank formatting, and them, they are codified using 2 bits for
each sequence base. The sub-sequences are stored as a bit vector into 32
bits integer and the length can vary from 1 to 16 bases. Changing this value
impacts at the search speed and sensibility, as big is the value, as fast will be
the searching process, but the sensibility will be lower. To save memory, not
overlapping windows are utilized to codify the data bank sequences, but for
input sequences, to have more sensibility during the searching process, over-
lapped windows are utilized. These codified sequences, with other sequence
information: name, identification code, and description are stored into a file
disk, composing the data bank sequences.
4
Masks are used at the indexed sub-sequences to improve the searching
sensibility of the inverted index. The masks are based on the Pattern-
Hunter [17] work. The masks inform which sub-sequences bases should be
maintained or removed and, consequently they increase the probability to
find sub-sequences at the index. The masks provide two gains: the sensibil-
ity, which allows to search at the index with not-exact sequences, and it saves
index space, because longer sub-sequences will be transformed into smaller,
having less index entries and less sub-sequences at the data bank.
Genoogle has some run time parameters which can improve the sensibility
or the search performance. The parameters for the sensibility are: the max-
imum distance between index entries for be considered for the same HSP,
the minimum HSP size and the drop off for sequences extension. Changing
these parameters impact on the sensibility, allows to find more HSP, but it
is expected to have more false positives and to slow down the performance.
5
the sequences are divided into 18 bases sub-sequences, saving inverted index
structure memory.
Genoogle needs approximately (((l/m)s)8) + ((4s )16) bytes to store the
inverted index. Being l the length of the data bank in bases, m, the total mask
length, and s the sub-sequences length. For a data bank with 4 billions bases
and sub-sequences with 11 bases, there will be 363 of millions sub-sequences,
requiring 2.833 megabytes to store the inverted index. Using masks with 18
bases length, there will be 222 millions sub-sequences and the inverted index
will need 1.759 megabytes, resulting on a gain of 30% of the total memory
required. To build the inverted indexes it is used the sort-based method [26]
and the inverted index and the formatted data bank are stored at the disk.
During the Genoogle start time, the whole inverted index is read and loaded
into the main memory, the data bank meta informations, like file offset, are
also loaded into the main memory, but the data bank sequences are read
from the disk when the sequences informations are necessary.
The input sequence processing firstly applies the mask at each overlapped
sub-sequence of the input sequence and codify the resulting sub-sequence to
the binary representation used by Genoogle. The input sequence processing
is shown at Fig. 1. As the input sub-sequences are codified as binary data
6
into an integer, it is possible to obtain the sub-sequence value directly from
this data. Because the determined sub-sequence position at the index is its
own value it turns the index searching process simpler and direct.
Sub-sequence mask
XXXOOXXX
At Fig. 2 is shown the process to retrieve the informations from the in-
verted index. For each masked and encoded sub-sequence from the input
sequence, using its encoded value, is retrieved from the inverted index all
places which have this sub-sequence at the data bank sequences. The re-
trieved informations are stored into an array of arrays, where each position
represents a data bank sequence. If two or more retrieved information are
are closer than a specified parameter, they are merged into one retrieved in-
formation. These informations are filtered by their length and the remaining
ones are them called High Scoring Pairs (HSP ). The HSP have five informa-
tion: Initial and final positions at the input sequence and at the data bank
sequence, and the length of this area, where it gets the length of the HSP
in relation of the data bank and of the input sequence and get the smaller
value.
After the index searching phase, the located HSPs are extended to the
both directions to try to enlarge their length. During the extension phase,
two or more HSPs which were closed can be overlapped, generating duplicate
results. Hence, it is verified after the extension phase if the HSPs have
overlapped regions, and if they do, they are merged into one new HSP.
After the extension and merging phase, the HSPs are sorted by their
lengths in a decreasing way. It is possible to specify a parameter to inform the
7
Inverted index
Input sub-sequence
CGTGACGA
Masked sub-sequence
CGTCGA
Codified sub-sequence
011011011000
8
6 Parallelization Methods
To improve the searching time and to use the multi-processing capabilities,
Genoogle uses three parallelization techniques: inverted index access paral-
lelization, extension and alignment parallelization, and data bank division
parallelization.
The data bank division parallelization is made dividing the data bank
in fragments, similarly how NCBI BLAST does. The indexes searches and
HSPs construction are made independently into threads for each data bank
fragment. After locating the HSPs by the individual threads, they are ex-
tended, merged, sorted, filtered, aligned, sorted by their score, and returned
to the user. This technique is interesting for large data banks, but this par-
allelization technique does not parallelize the whole searching process: the
extension, sorting and alignment phases are not parallelized by this tech-
nique.
Analyzing the searching data bank fragmentation process, it was verified
that not every searching thread have the same execution time. It happens
because some fragments have more similar sequences than others. Analyzing
the searching time, it was verified that the index searching time represents
about 60% of the whole searching time. About 30% of the searching time
is for the HSPs alignments, because it, a component was developed to par-
allelize the sequences extension, sorting and alignment using all available
computer processing core.
At this parallelization technique, the threads search the input sub-sequences
at the data bank fragments index and then put the HSPs into a collection
shared by all index searching threads. After the index searching phase, the
HSPs collection is sorted by the HSP length and the longest are selected
using a parameter that informs how many alignments should be returned.
The remaining HSPs are put into a FIFO queue, where it has extenders and
aligners which will perform the extension and alignment. These extenders
and aligners use independent threads, they read a HSP from the queue, them
perform the extension and alignment and put the result into another shared
collection. These threads are managed by an executor, when a thread finish
its job, the manager sends to it another HSP to extend and align until all
HSPs are extended and aligned. It is possible to configure the quantity of
simultaneous threads.
The memory required by the inverted index structure is the main prob-
lem of the data bank division. For example: using 11 base pairs length
9
sub-sequences, they are necessary 32 megabytes to store only the inverted
index structure. Theoretically, to use the whole computational capacity of
a computer with 8 processing cores, it is necessary to divide the data bank
in 8 parts, becoming necessary to use up 512 megabytes to store only the
inverted index structure. Because of the memory requirement for the data
bank division, it is important to use a complementary approach. Along with
the data bank division, the input sequences are also divided and performed
the search parallel. Thus it is possible to parallelize the searching process
without overloading the memory with more data structures.
This parallelization divides the input sequence in sub-inputs, and it searches
each sub-input at the inverted index independently. It is also useful because
it also parallelize the input query processing. After the index search, the
HSPs that are from the same data bank sequence and are closer, are merged
into one HSP. Using the data bank division in two fragments and the input
sequence in two sub-inputs, they are used 4 threads to search at the inverted
index, being the memory overload of only two inverted indexes structures.
7 Implementation
Genoogle is developed using the Java Environment version 1.6. The Java
Environment was choose because it is multi-platform and it has a framework
and primitives for parallel computing. The library BioJava [12] was used
during the first part of the development, but now it was removed from the
project. At the first implementations, BioJava was used to reading, parsing,
storing and genetic sequences alignment, but it was verified that the reading
and parsing methods require too much memory, so, new and optimized classes
was developed to perform these tasks.
The main user interface is an text mode interface, where the user types
the search command and the results are stored into a XML file that is de-
fined by the user. This interface has commands to perform the search, to list
the available data banks, to obtain the parameters list, to run the garbage
collection, to run the last command executed, and to run a batch file con-
taining commands. The batch command is interesting because with it is
possible to write a file with all commands that should be executed and to
inform the Genoogle to execute them without any used intervention during
the execution of these commands.
Genoogle has also an embedded simple web page interface, best suitable
10
for testing, which contains only one input field, for the input sequence and a
button to perform the search. The search results of the web page interface are
a XML document formatted and shown as a HTML web page by a XSL doc-
ument. Together with the web page, Genoogle has a web-services interface,
implemented using JAX-WS. Using the web-services interface, it is possible
to execute queries, to set parameters, retrieve the data bank available list,
and others tasks inside of a programming script. The users can write scripts
to access Genoogle services automatically using their preferred programming
language and perform their searches without manual intervention.
8 Results
For the experiments was utilized a data bank with sequences of the phase 3
of the human genome project [19] along with RefSeq [20] data banks. The
RefSeq data banks were used because they are verified and they have high
quality. The human genome project data bank is the hs phase3 and has
approximately 3.8 Gb. The RefSeq data bank are the cow with approximately
57 M b, frog with 20 M b, human with 112 M b, mouse with 93 M b, rat with
73 M b, and zebrafish with 62 M b, totalizing approximately 417 M b. At
the experiments was used the data banks from the RefSeq along with the
hs phase3, totalizing 4.25 Gb and being necessary approximately 2 gigabytes
of memory to store it.
For the execution of the experiments, they were generated 11 sets of in-
put sequences. Each set has 11 sequences with approximately the same size,
being 1 sequence got from the data bank and 10 are mutations of these se-
quences. The sets are with sequences with 80 base pair (bp), 200bp, 500bp,
1.000bp, 5.000bp, 10.000bp, 50.000bp, 100.000bp, 500.000bp e 1.000.000bp. The
searches were made using the data bank parallelism, dividing the input se-
quence parallelism and extending and aligning the sequences simultaneously.
At the experiments it was used a computer with 16 gigabytes of RAM,
Linux version 2.6.18 and Java Environment version 1.6 with the JVM JRockit ver-
sion 3.0.3.
For entries up to 5.000bp the gain using the parallelism is low, giving a gain
of only 2 times and the total time increases when the input sequence is divided
into more than 4 parts. It happens because the searching time for these
small input sequences are too low, less than 200ms, and the synchronization
overload impacts directly to the searching time. For input sequences with
11
Speedup relative to the size of the input sequence
9
Achieved
Desirable
8
6
Speedup (times)
1
100 1000 10000 100000 1e+06
Sequence size
12
only against the NCBI BLAST.
Comparing the sequential search times of Genoogle and BLAST it is
shown that Genoogle is almost 20 times faster and comparing the parallel
times, Genoogle is 26, 60 times faster. It is interesting to realize that for
smaller input, until 5.000bp, the time gains in relation with BLAST are not
so good because the parallelization techniques does not use all their potential
in these small inputs. The time difference at bigger inputs comes until 29
times for 100.000bp input. It is important to realize that the sequential
version of Genoogle is faster than the parallel executions of the BLAST.
13
Base pairs BLAST (ms) Genoogle (ms) Gain (times)
80 1.061 150 7, 00
200 2.145 270 7, 94
500 3.170 210 15, 09
1.000 2.853 270 10, 56
5.000 10.387 1.341 7, 74
10.000 13.027 1.050 12, 40
50.000 78.067 4.440 17, 58
100.000 276.779 9.380 29, 50
500.000 1.206.212 45.120 26, 73
1.000.000 193.090 8.780 22, 00
Total 1.786.791 67.011 26, 66
This graphic was generated from the data where it was verified which of the
alignments found by the BLAST were found by the Genoogle too. In this
graphic it is possible to observe that until the E-value 10e−35 more than 90%
of the alignments were found by the Genoogle. Until the E-Value 10e− 15
more than 60% of the alignments were found and with the E-Value 10e−10
almost 55% of the alignments were found. Above this E-Value the quantity
of alignments found is bellow 40%.
Analyzing this graphic is realized that the Genoogle is a good tool to
find alignments with the E-Value lower than 10e−25 . It happens because
alignments with this E-Value are usually long, with more than 100bp, and
consequently are found easily. From the E-Value 10e−30 there is a drop in the
results quality, where it is stabilized close to the E-Value 1. Alignments with
the E-Value highest than 0.005 are alignments which can not be possible to
infer a close homology [3]. This, it can be observed that Genoogle has a very
good search quality until the maximum 10e−20 E-Value, but it has a quality
drop until 10e−5 . The alignments should not to be considered to homology
inference for values higher than 10e−4 .
The experiments showed that the Genoogle quality results is comparable
to the BLAST when the alignment E-Value is representative, demonstrating
a possible homology between the two aligned sequences. More sensible search
can be archived changing the searching parameters, as the maximum distance
14
Percentage of similar areas found in relation to the BLAST
100
Similar areas found
80
60
Found (%)
40
20
0
10e−80 10e−70 10e−60 10e−50 10e−40 10e−30 10e−20 10e−10 10e+00
E−Value
between the sub-sequences information got from the index, and the minimum
HSP length. Changing the minimum HSP length for alignments with E-Value
higher than 1e−5 , the HSP found for this E-Value grew to approximately 80%
and the search time was raised only 3%.
9 Conclusion
This work presented a genetic similarity sequences searching software which
uses data bank sequences indexing along with parallel computing. To ensure
the effectiveness and the search quality of to use index and parallel comput-
ing was developed and implemented the Genoogle tool. This software was
implemented using the Java 1.6 and it can be executed at Windows, Linux,
and Mac environment. Experiments were executed to verify the results ex-
ecution time and the results quality. The searching time was really good,
with speedup of more than 20 times in relation to parallelized BLAST. The
15
results quality was good, finding relevant alignments, but it can be opti-
mized by changing the searching parameters. Thus, merging the indexing
techniques with the three parallelization techniques and the option to opti-
mize the search configurations, Genoogle proved to be an effective tool and
its results have good quality.
As the main contributions of this work, it should be noted primarily as the
first tool in the literature to do the genetic sequences search using indexing
and parallel computing. Considering that parallel computing importance
has increased with increasing number of cores at the processors and also the
importance of the data indexing to optimize the searching process in data
banks wich has an exponential grow, this work has a relevance for addressing
these two issues together.
References
[1] F. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. A basic local
alignment search tool. Jornal of Molecular Biology, 215(3):403–410, out.
1995.
[3] Isobel Anderson and Andy Brass. Searching dna databases for similar-
ities to dna sequences: when is a match significant? Bioinformatics,
14(4):349–356, January 1998.
16
[7] A.L. Delcher, S. Kasif, R.D. Fletschmann, J. Petterson, O. White, and
S. Salzberg. Alignment of whole genomes. Nucleic Acids Res., 27:2369–
2376, 1999.
[10] Eldar Giladi, Michael G. Walker, Jamis Z. Wang, and Wayne Volk-
muth. Sst: an algorithm for finding near-exact sequences matches in
time proportional to the logarithm of the database size. Bioinformatics,
18(6):873–879, 2001.
[13] Xianyang Jiang, Peiheng Zhang, Xinchun Liu, Stephen S., and T. Yau.
Survey on index based homology search algorithms. The Journal of
Supercomputing, 40(2):185–212, may 2007.
[15] W. James Kent. Blat - the blast-like alignment tool. Genome Research,
12:656–664, 2002.
[16] Kim, You Jung, Boyd, Andrew, Athey, Brian D., Patel, and Jignesh M.
miblast: Scalable evaluation of a batch of nucleotide sequence queries
with blast. Nucleic Acids Research, 33(18):4335–4344, 2005.
[17] Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more
sensitive homology search. Bioinformatics, 18(3):440–445, 2002.
17
[18] Aleksandr Morgulis, George Coulouris, Yan Raytselis, Thomas L. Mad-
den, Richa Agarwala, and Alejandro A. Schäffer. Database indexing
for production megablast searches. Bioinformatics, 24(18):1757–1764,
2008.
[22] Zemin Ning, Anthony J. Cox, and James C. Mullinkin. Ssaha: A fast
search method for large dna databases. Genome Research, 11:1725–1729,
2001.
[23] W.R. Pearson and D.J. Lipman. Improved tools for biological sequence
comparison. Proc. Natl. Acad. Sci, 85:2444–2448, 1988.
[25] Guangming Tan, Lin Xu, Yishan Jiao, Shengzhong Feng, Dongbo Bu,
and Ninghui Sun. An optimized algorithm of high spatial-temporal ef-
ficiency for megablast. Proceedings of the 2005 11th International Con-
ference on Parallel and Distributed Systems (ICPADS’05), 2005.
[26] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing giga-
bytes: compressing and indexing documents and images. Morgan Kauf-
mann Publishers, 1999.
18