Parallelization of Smith-Waterman Algorithm Using MPI

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Parallelization of Smith-Waterman algorithm using

MPI
Anes Luckin, Irfan Mehanovic, Haris Spahic, Kenan Mahmutovic, Mesud Klisura

Faculty of Electrical Engineering


Sarajevo, Bosnia and Herzegovina

Abstract— This paper describes the implementation of parallel II. MEASURING PROCEDURES
version of the Smith-Waterman algorithm for sequence alignment
on small Linux based cluster using OpenMPI. Code is written in C We set up a small cluster consisting of four personal
language. Measurements of parallel execution time on cluster with computers, connected in LAN network. One of them was set up
different number of processes and slight tuning of algorithm are as a master node, and the other three were set up as slaves. Each
described and compared. Main aim of these measurements is to computer has i5 dual core processor with two threads on each
push our cluster to its limits and to try to achieve significant core, and 4 GB of RAM memory. We used Ubuntu 16.04 as our
speedup in parallel execution over sequential one. operating system. Below is table with machine configurations.

Keywords—communication, MPI, Smith-Waterman, parralel Processor RAM(GB) No. of Cores


processing PC1 i5-2520M 4 2
PC2 i5-2520M 4 2
I. INTRODUCTION PC3 i5-3337U 4 2
Computational biology refers to hypothesis-driven PC4 i5-4310M 4 2
investigation of a specific biological problem using computers, Fig. 1. Cluster configuration
carried out with experimental or simulated data, with the
Communication between nodes is achieved with OpenMPI.
primary goal of discovery and the advancement of biological
With OpenMPI, each of the processes usually resides at different
knowledge. Put more simply, bioinformatics is concerned with
host with its own memory. Therefore, we have an increased
the information while computational biology is concerned with
overhead of having to distribute required data to all participating
the hypotheses. Gene sequencing problem are one of the major
nodes. Master node was responsible for sending chunks of data
issues for researchers to come up with optimized system model
to other nodes and later grouping results from all nodes in one
that could facilitate optimum processing and efficiency without
final score.
introduction overheads in terms of memory and time.
Algorithms assessing similarity, i.e. alignment of sequences play We tested each case 10 times, and took average time of
a key role in Bioinformatics. Their task is the alignment of two execution as a result. Our goal was to exploit parallelization
or more sequences of DNA, RNA and proteins and an potential of this algorithm and to beat sequential execution while
assessment of similarity based on the alignment, while their trying to achieve the best time by tuning different parameters of
importance in the fact that their results may point to functional, algorithm. As a beginning point of our research, here are results
structural and evolutionary relationship between the given that we got buy executing sequential implementation. We kicked
sequence. Sequences can be seen as a set of characters of the of execution with 1000 sequence length, and incrementally
alphabet. Algorithms clearing and assessed sequences of DNA, increased it to 10000.
RNA and proteins make up one of the central pillars of
Bioinformatics for its application to search for genes. Average execution time of sequential code
Average execution time (s)

We choose to use Smith-Waterman algorithm in our research 3.5


primary because it is perhaps the most widely used local 3
sequence alignment algorithm. Smith Waterman algorithm was 2.5
first proposed by Temple F. Smith and Michael S. Waterman in 2
1.5
1981. The algorithm explains the local sequence alignment, it 1
gives conserved regions between the two sequences, and one can 0.5
align two moderately overlap sequences, and it is possible to 0
align the sub-sequence of the sequence to itself. These are the 0 2000 4000 6000 8000 10000 12000
major advantages of alignment of Local Sequence Alignment. Sequence length
Cluster architecture is created using open source technologies,
based on Linux operating system. Implementation is based on
Message Passing Implementation (MPI) standard. Fig. 2. Average execution time of sequential code
III. PARALLELIZATION After searching for experiments that have been done before
We were interested in parallelizing the computation of the H on similar projects, we obtained that majority of the previous
matrix. Our approach was to divide H matrix into blocks. Each researchers were recommending to set sequence block size
process will therefore be responsible for a row of blocks. As somewhere around 600 or 700 nucleobases.
soon as process 𝑃 𝑥 finishes the computation of one block, it will
send the last row of the block to the process 𝑃 𝑥+1 , who is OpenMPI execution times on different processors
responsible for the next row of blocks. In this way, we achieved

Average execution time (s)


5
pipeline-like model of execution, where process 𝑃 𝑥 can start 4.5

execution as soon as process 𝑃 𝑥−1 finishes his first block, and


4
3.5

𝑃 𝑥+1 can start when 𝑃 𝑥 finishes his block. Next we considered 3


2.5
different block sizes at column level. The matrix H is divided 2
1.5
into blocks of N/P rows and B columns, where N is the number 1
of rows and P is the number of processes. Each process will be 0.5
0
responsible for only one row of blocks during the whole 0 10 20 30 40 50 60 70 80

execution. No. of processes

IV. ANALYSIS AND RESULTS i5-4310M 2.70GHz i5-2520M 2.50GHz i5-3337U 1.80GHz

In this section we presented experimental results of the


parallel execution of Smith-Waterman algorithm in our cluster. Fig. 4. OpenMPI execution times on different processors
Number of processes, sequence lengths and block sizes can
influence the performance of the parallel execution, so this study On the picture above we can see that each node did quite
was made with varying values of these parameters. We used two differently in this test. Algorithm execution time may differ a lot
DNA sequences of the same size, where the sole loading time of when changing number of processes, which is acceptable since
these sequences is not included in the execution time. Our focus we have to take into account all the communication between
was computation time only. All execution times are shown in processes that is going on during execution. Node with i5-
seconds. 4310M processor did the best of all three with minimal
oscillations in execution time. The best time was 2.01 s.
We started with executing algorithm on one machine using
MPI and compared it to the result we got when we executed In the next three figures, we showed how average execution
algorithm sequentially. Results are shown below. time differs on our master processor having 45 processes that we
used in parallel execution before. The number of processes was
Sequential versus parallel average execution time extracted analytically from previous Fig. 4 because this number
3
of processes showed best average execution times of all
Average execution time (s)

2.5 processes as we tested previously. From our examination we


2 extracted best sequence block size that fit best for our
1.5
comparison. We picked up three suitable values for our case.
1
Average execution time of different sequence length (700
0.5
block size)
Average execution time (s)

0
0 2000 4000 6000 8000 10000 12000 2.5
Sequence length 2
Sequential Parallel 1.5
1
Fig. 3. Sequential versus parallel average execution time
0.5
We can see above that both algorithms showed good 0
performance when matching sequences that are less than 4000 0 2000 4000 6000 8000 10000 12000
elements long. Whereas MPI takes great lead as sequence got Sequence length
longer. Sequential execution on sequences of 10000 elements
took 2.73 seconds. Parallel execution took 1.99 seconds. That
makes speed up of almost 30%. It’s clear that running MPI on Fig. 5. Average execution time of different sequence length (700 block size)
one machine with multiple processes is better than running
sequential code in this case.
Different configuration of nodes in our cluster definitely
impacted performance time, so to put that on paper we put those
machines on test, where each node executed parallel algorithm
on sequences that were 10000 elements long, but we changed
number of processes in each execution time.
Average execution time of different sequence length Average execution time of different sequence block size (45
(800 block size) processes)

Average execution time (s)


Average execution time (s)

2.5 3
2
2
1.5
1 1
0.5
0 0
0 2000 4000 6000 8000 10000 12000 0 1000 2000 3000 4000 5000

Sequence length Sequence block size

Fig. 6. Average execution time of different sequence length (800 block size) Fig. 9. Average execution time of different sequence block size (45)

Average execution time of different sequence block size (50


processes)
Average execution time of different sequence length

Average execution time (s)


(900 block size) 4
Average execution time (s)

2.5 3
2
2
1.5
1
1
0.5 0
0 1000 2000 3000 4000 5000
0
0 2000 4000 6000 8000 10000 12000 Sequence block size
Sequence length
Fig. 10. Average execution time of different sequence block size (50)

Fig. 7. Average execution time of different sequence length (900 block size)

We precisely optimized what are our best grains, that we


were going to use in our cluster, that contains 4 computers,
After we found out that optimal sequence block size of 700 connected together with ethernet cables in the star topology with
nucleobases satisfied our scenario, we started investigating what router in the middle. We used OpenMPI as communication
was the best number of processes for our parallel OpenMPI between computers and optimized our code as much as we could
setup. After examination of our previous execution results, we for best setup. Much of the information that we gathered for
picked three best competitors for trying to find optimal number creating this cluster, were obtained from the experienced
of processes. As we can see next three figures show our obtained engineers and people that are into this kind of engineering.
results.
Firstly we analyzed execution times of different number of
computers connected together, to obtain does it really affects
that much, if size of cluster is playing role to get better
Average execution time of different sequence block size (40 performance and efficiency. First scenario was one computer,
processes) then two computers, until the all four computers were used. The
last scenario promised that more computers, have better results
Average execution time (ms)

3 than less computers, but still worse than the only one computer.
2.5 As we can see in the figure Fig. 11 below, four computers had
2 better results than three computers, as the sequence length
1.5 increased.
1
0.5
0
0 1000 2000 3000 4000 5000
Sequence block size

Fig. 8. Average execution time of different sequence block size (40 processes)
number of processes on four processors by changing sequence
Average execution time depending on number of length incrementally by 500 nucleobases until the length of
Average execution time (s)

processors 10000 nucleobases. Even with optimized parameters we didn’t


300
get “amazing” results as we expected.
200

V. CONCLUSION
100
In this paper we analyzed the speedup obtained by the
0 implementation of parallel version of the Smith-Waterman
0 2000 4000 6000 8000 10000 12000 algorithm for sequence alignment compared to sequential
Sequence length execution, based on the relation between different numbers of
1 processor 2 processors processes and the amount of work. This comparison was done
3 processors 4 processors by executing the code on single machine and several machines
running OpenMPI. The results show that performances and
Fig. 11. Average execution tme depending on number of processors efficiency of parallel processing code implementation is much
After execution of parallel code on this cluster we obtained better than sequential processing code implementation, but
different results than before, when we tested it on the processor results on parallel computers with parallel code are
that showed best results. As the next figure shows, our number disappointing, as in our case. We achieved speedup on single
of processes changed from 45 processes to just 4 processes on machine with parallelization but when we connected more and
the cluster. For detailed information you can investigate figure more machines to expand our cluster we didn't get better results
Fig. 12 which is shown below. from ones that are obtained by single machine.
In order to gain a larger speedup, it is important that the work
Average execution time with different sequence increases with the number of processes, that is the increase of
Average execution time (ms)

block size on 4 processes number of processes only makes sense with larger dataset.
300 Finally, the results show that it is possible to achieve speedup
250 by parallelizing code for machine with 4-core processor, but it
200 doesn't mean that we will get speedups by connecting more
150 machines in a cluster. In order to obtain ideal speedup, it is
100 necessary to find the optimum between the number of processes
50 and the size of the data being processed. Sometimes it is better
0 to use single machine if tasks are not huge and complicated
0 200 400 600 800 1000 1200 because results can be on their side because a lot of time is lost
Sequence block size during communication between processing units. In fact it all
depends on what do you need your program for. Clusters allow
so much user choice it really does depend on your specific needs.
Fig. 12. Average execution time with different sequence block size on 4
processes

REFERENCES

Average execution time running on 4 machines with 4 [1] Alice Dorottya DOMOKOS, BIOINFORMATICS AND
processes (Optimal block size) COMPUTATIONAL BIOLOGY. Cluj-Napoca: University of
Agricultural Sciences and Veterinary Medicine Cluj-Napoca, 2008.
200
[2] Oswaldo Trelles, On the Parallelization of Bioinformatic Applications.
Malaga: Computer Architecture Department, University of Malaga.
150 [3] Ananth Prabhu G, Dr.Ganesh Aithal, “Automatic parallelization for
Parallel Architectures using Smith- Waterman Algorithm” volume
100 3, issue 9, April 2014
[4] X. Meng, & V. Chaudhary, Bio-Sequence Analysis with Cradle's 3SoC
50 Software Scalable System on Chip. Proc. 19th ACM Symposium on
Applied Computing, Nicosia, Cyprus, 2004, 202-206
[5] Douglas Eadline, “High Performance Computing for Dummies”,
0
Published by: Wiley Publishing, Inc. 111 River Street Hoboken, NJ
0 2000 4000 6000 8000 10000 12000 07030-5774.
[6] Ananth Grama, George Karypis, Vipin Kumur, and Anshul Gupta.
Fig. 13. Average execution time running on 4 machines with 4 processes Introduction to Parallel Computing (2nd Edition). Addison Wesley, 2
(Optimal block size) edition, January 2003.
Curve in the figure Fig. 13 shows how Smith-Waterman
algorithm behaves when we set optimal block size B, optimal

You might also like