Fault Tolerance-Challenges Techniques and Implemen

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/266525159

Fault Tolerance-Challenges, Techniques and Implementation in Cloud


Computing

Article · January 2012

CITATIONS READS

106 3,791

2 authors:

Anju Bala Inderveer Chana


Thapar University Thapar University
21 PUBLICATIONS   256 CITATIONS    111 PUBLICATIONS   2,020 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

QoS-aware Autonomic Resource Provisioning and Scheduling for Cloud Computing View project

Energy Aware Resource Scheduling for Cloud Computing View project

All content following this page was uploaded by Anju Bala on 31 July 2015.

The user has requested enhancement of the downloaded file.


IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 288

Fault Tolerance- Challenges, Techniques and


Implementation in Cloud Computing
Anju Bala1, Inderveer Chana2

1
Computer Science and Engineering Department, Thapar University
Patiala-147004, Punjab, India

2
Computer Science and Engineering Department, Thapar University
Patiala-147004, Punjab, India

 
Abstract these techniques provide mechanisms to the software
Fault tolerance is a major concern to guarantee availability system to prevent system failure occurrence [18]. The
and reliability of critical services as well as application main benefits of implementing fault tolerance in
execution. In order to minimize failure impact on the cloud computing include failure recovery, lower cost,
system and application execution, failures should be improved performance metrics etc. This paper aims
anticipated and proactively handled. Fault tolerance
techniques are used to predict these failures and take an
to provide a better understanding of fault tolerance
appropriate action before failures actually occur. This paper challenges and identifies various tools and techniques
discusses the existing fault tolerance techniques in cloud used for fault tolerance. When multiple instances of
computing based on their policies, tools used and research an application are running on several virtual
challenges. Cloud virtualized system architecture has been machines and one of the server goes down, there is a
proposed. In the proposed system autonomic fault tolerance need to implement an autonomic fault tolerance
has been implemented. The experimental results technique that can handle these types of faults. To
demonstrate that the proposed system can deal with various address this issue, cloud virtualized system
software faults for server applications in a cloud virtualized architecture has been proposed and implemented
environment. 
using HAProxy.The proposed architecture also has
Keywords: Cloud Computing; Virtual Machine; Fault been validated through experimental results.
Tolerance; Replication
The rest of the paper is organized as follows. Section
1. Introduction 2 discusses fault tolerance techniques based on their
policies. Section 3 presents challenges of
Cloud computing is a style of computing where implementing fault tolerance in cloud computing
service is provided across the Internet using different .Section 4 identifies the comparison between various
models and layers of abstraction [4]. It refers to the tools used for implementing fault tolerance
applications delivered as services [5] to the mass, techniques with their comparison table. Section 5
ranging from the end-users hosting their personal presents proposed cloud virtualized architecture and
documents on the Internet to enterprises outsourcing implementation with experimental results. Section 6
their entire IT infrastructure to external data centers. finally concludes the paper.
A simple example of cloud computing service is 2. Background
Yahoo email or Gmail etc.
There are various faults which can occur in cloud
Although cloud computing has been widely adopted computing .Based on fault tolerance policies various
by the industry, still there are many research issues to fault tolerance techniques can be used that can either
be fully addressed like fault tolerance, workflow be task level or workflow level .
scheduling, workflow management, security etc.Fault
tolerance is one of the key issues amongst all. It is 2.1 Reactive fault tolerance
concerned with all the techniques necessary to enable Reactive fault tolerance policies reduce the effect of
a system to tolerate software faults remaining in the failures on application execution when the failure
system after its development. When a fault occurs, effectively occurs. There are various techniques

Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 289

which are based on these policies like  Proactive Fault Tolerance using Self-
Checkpoint/Restart, Replay and Retry and so on. Healing- When multiple instances of an
application are running on multiple virtual
 Check pointing/ Restart - When a task fails, it is machines, it automatically handles failure of
allowed to be restarted from the recently checked application instances.
pointed state rather than from the beginning. It is
an efficient task level fault tolerance technique  Proactive Fault Tolerance using Preemptive
for long running applications [2]. Migration- Preemptive Migration relies on a
feedback-loop control mechanism where
 Replication-Various task replicas are run on application is constantly monitored and
different resources, for the execution to succeed analyzed. 
till the entire replicated task is not crashed. It can
be implemented using tools like HAProxy, 3. Challenges of Implementing Fault
Hadoop and AmazonEc2 etc. Tolerance in Cloud Computing
 Job Migration-During failure of any task, it can Providing fault tolerance requires careful
be migrated to another machine. This technique consideration and analysis because of their
can be implemented by using HAProxy. complexity, inter-dependability and the
following reasons.
 SGuard- It is less disruptive to normal stream
processing and makes more resources available.  There is a need to implement autonomic fault
SGuard is based on rollback recovery [18] and tolerance technique for multiple instances of an
can be implemented in HADOOP, Amazon EC2. application running on several virtual machines
[12].
 Retry-It is the simplest task level technique that  Different technologies from competing vendors of
retries the failed task on the same cloud resource cloud infrastructure need to be integrated for
[20]. establishing a reliable system [15].
 Task Resubmission-It is the most widely used  The new approach needs to be developed that
fault tolerance technique in current scientific integrate these fault tolerance techniques with
workflow systems [25]. Whenever a failed task existing workflow scheduling algorithms [14].
is detected, it is resubmitted either to the same or
to a different resource at runtime.  A benchmark based method can be developed in
cloud environment for evaluating the
 User defined exception handling-In this user performances of fault tolerance component in
specifies the particular treatment of a task failure comparison with similar ones [21].
for workflows.
 To ensure high reliability and availability
 Rescue workflow-This technique [20] allows the multiple clouds computing providers with
workflow to continue even if the task fails until independent software stacks should be used [22]
it becomes impossible to move forward without [23].
catering the failed task.
 Autonomic fault tolerance must react to
2.2 Proactive Fault Tolerance synchronization among various clouds [15].
The principle of proactive fault tolerance policies 4. Tools Used For Implementing Fault
is to avoid recovery from faults, errors and Tolerance
failures by predicting them and proactively
replace the suspected components other working Fault tolerance challenges and techniques have been
components. Some of the techniques which are implemented using various tools. Table 1 compares
based on these policies are Preemptive these tools based on their programming framework,
migration, Software Rejuvenation etc. environment and application type along with
different fault tolerance techniques. HAProxy is used
 Software Rejuvenation-It is a technique that
for server failover in the cloud [13]. SHelp [12] is a
designs the system for periodic reboots. It
lightweight runtime system that can survive software
restarts the system with clean state [5].
failures in the framework of virtual machines. It may

Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 290

also work in cloud environment for implementing techniques in cloud enviorment.Amazon Elastic
check pointing. ASSURE [9] introduces rescue points Compute Cloud (EC2) [8] provides a virtual
for handling programmer anticipated failures. computing environment to run Linux-based
Hadoop [7] is used for data intensive applications but applications for fault tolerance.
can also be used to implement fault tolerance
Table 1: Tools Used To Implement Existing Fault Tolerance Techniques
Fault Tolerance Techniques Policies System Programming Environment Fault Detected Applicati
Framework on Type

Self Healing, Job Migration, Reactive/ HAProxy[13] Java Virtual Process/node Load
Replication Proactive Machine failures balancing
Fault
Tolerance
Check pointing Reactive SHelp[12] SQL,JAVA Virtual Application Fault
Machine Failure tolerance
Check pointing, Retry, Self Reactive/ Assure[9] JAVA Virtual Host, Network Fault
Healing Proactive Machine Failure tolerance
Job Reactive/ Hadoop[7] Java,HTML,CSS Cloud Application/no Data
Migration,Replication,Sguard,Resc Proactive Environment de failures intensive
Replication, Sguard,Task Reactive/ AmazonEC2[8] Amazon Machine Cloud Application/no Load
Resubmission Proactive Image, Amazon Environment de failures balancing
Map , fault
tolerance

5. Proposed Cloud Virtualized System


Architecture And Implementation
cloud virtualized system architecture as shown in
5.1 Cloud Virtualized System Architecture Figure 1. The server virtualized system consists of
VMs (server 1 and server 2) on which an Ubuntu
A few techniques currently exist for autonomic fault 10.04 OS and database application are running.
tolerance in cloud enviorment.Shelp can survive the Server 2 is a backup sever in case of failure.
software faults for server applications running in HAProxy is configured on the third virtual machine
virtual machine environment [12].There is a need to to be used for fault tolerance. The availability of the
implement autonomic fault tolerance in cloud servers is continuously monitored by HAProxy
enviorment.If any one of the servers breaks down, statistics tool on a fault tolerant server. HAProxy is
system should automatically redirect user requests to running on web server to handle requests from web.
the backup server. So, the cloud virtualized system When one of the servers goes down unexpectedly,
architecture has been proposed and implemented connection will automatically be redirected to the
using HAProxy.The application availability and other server.
reliability can be maintained by using the proposed
Fault Tolerant 
Server 1 Server 2
System

Web Server Web Server
Application  Application 
Haproxy 
DB(MySQL) DB(MySQL) Server

VM VM VM

Virtual Machine  Server (VMware)

Cloud Virtualized  System

Figure 1: Cloud Virtualized System Architecture

5.2 Implementation and Experimental Results server failures in fault tolerant cloud environment. It
provides a web interface for statistics known as
Fault tolerant system has been implemented using HAProxy statistics. Implementation includes two
HAProxy and MySQL. HAProxy is used to handle virtual machines as web servers, server 1 and server 2

Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 291

hosting Apache Tomcat 6.0.32. HAProxy software to one or more MySQL database servers. Application
version 1.3.15.2 is configured on third virtual can be accessed on any of the web server. Data
machine. A simple database application written in consistency is also maintained through MySQL
Java is installed on the web servers. Xampp for Linux replication. The experimental results show that
Windows XP (SP2) is used to install MySQL. Data in HAProxy can assist server applications to recover
MySQL is replicated using replication technique for from server failures in just a few milliseconds with
local backup. Replication enables data from one minimum performance overhead.
MySQL database server (the master) to be replicated

Case 1: Figure 2 shows the stats when both the servers are working and green line indicates that the servers are up.

Figure 2: HAProxy Statistics


Case 2: Figure 3 shows the stats when server 1 goes down and server 2 is still up. Red line in this figure indicates
that the server is down.

Figure3: Server 1 is down


Case 3: Figure 4 shows the stats when server 2 goes down and server 1 is still up.

Figure 4: Server 2 is down


Case 4: Replication enables data from one MySQL database server to be replicated to one or more MySQL database
servers (the slaves). Figure 5 shows the database table of server 1 using SQLyog and the connection established with
server 2 slave.

Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 292

Figure 5: Database entries of sever 1

Case 5: Figure 6 shows the replicated database table of server 2.When server 1 fails data is replicated to server 2.

Figure 6: Replicated Database entries of sever 2

6. Conclusion also proposed based on HAProxy. Autonomic fault


tolerance is implemented dealing with various
Fault tolerance is concerned with all the techniques software faults for server applications in a cloud
necessary to enable a system to tolerate software virtualized environment. When one of the servers
faults remaining in the system after its development. goes down unexpectedly, connection will
This paper discussed the fault tolerance techniques automatically be redirected to the other server. Data
covering its research challenges, tools used for replication technique is implemented on virtual
implementing fault tolerance techniques in cloud machine environment. The experimental results are
computing. Cloud virtualized system architecture is obtained, that validate the system fault tolerance.
References [5] M.Armbrust, A.Fox, R. Griffit,et al., “A view of cloud
computing”, Communications of the ACM, vol. 53, no.
[1]Antonina Litvinova, Christian Engelmann and Stephen
4, pp. 50–58, 2010.
L. Scott,” A Proactive Fault Tolerance Framework for
High Performance Computing”, 2009.
[6]R.Buyya, S.Pandey and C.Vecchiola, “Cloudbus toolkit
for market-oriented cloud computing”, In Proceeding
[2]Golam Moktader Nayeem, Mohammad Jahangir Alam,”
of the 1st International Conference on Cloud
Analysis of Different Software Fault Tolerance
Computing (CloudCom2009), Beijing, China,
Techniques”, 2006.
December 2009.
[3]Steven Y. Ko, Imranul Hoque, Brian Cho and Indranil
[7]HadoopMapReduceTutorial.http://hadoop.apache.org/co
Gupta, “On Availability of Intermediate Data in Cloud
re/docs/current/mapred tutorial.html.
Computations”, 2010.
[8]AmazonElasticComputeCloud(EC2)
http://www.amazon.com/ec2/
[4]L. M. Vaquero, L. Rodero-Merino, J. Caceres and M.
[9]S. Sidiroglou, O. Laadan, C. Perez, N. Viennot, J. Nieh,
Lindner, “A break in the clouds: towards a cloud
and A. D. Keromytis, “ASSURE: Automatic Software
definition,” SIGCOMM Computer Communication
Self-healing Using REscue points”, Proceedings of the
Review,vol. 39, pp. 50–55, December 2008.
14th International Conference on Architectural Support
for Programming Languages and Operating Systems

Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 293

(ASPLOS’09), ACM Press, March 7-11, 2009, [20]Elvin Sindrilaru,,Alexandru Costan,, Valentin Cristea,”
Washington, DC, USA, pp.37-48. Fault Tolerance and Recovery in Grid Workflow
Management Systems”, 2010 International Conference
[10]B. Buck and J. K. Hollingsworth, “An API For on Complex, Intelligent and Software Intensive
Runtime Code Patching”, International Journal of High Systems.
Performance Computing Applications, Vol.14, No.4,
November 2000, pp.317-329. [21]S. Hwang, C. Kesselman, “Grid Workflow: A Flexible
Failure Handling Framework for the Grid”,12th IEEE
[11]S. Osman, D. Subhraveti, G. Su, and J. Nieh, “The international Symposium on Zigh Performance
Design and Implementation of Zap: A System For Distributed Computing (HPDC’03), Seattle,
Migrating Computing Environments”, Proceedings of Washington,USA., IEEE CS Press, Los Alamitos, CA,
the 5th Symposium on Operating Systems Design and USA, June 22 - 24, 2003.
Implementation (OSDI’02), USENIX Association,
December 9-11, 2002, Boston, Massachusetts, USA, [22]Michael Armbrust, Armando Fox,Rean Griffith, “
pp.361-376. Above the Clouds: A Berkeley View of Cloud
Computing”, Electrical Engineering and Computer
[12]Gang Chen, Hai Jin, Deqing Zou, Bing Bing Zhou, Sciences University of California at Berkeley, 2009.
Weizhong Qiang, Gang Hu, “SHelp: Automatic Self-
healing for Multiple Application Instances in a Virtual [23]Wenbing Zhao, P. M. Melliar-Smith and L. E. Moser,”
Machine Environment”, IEEE International Conference Fault Tolerance Middleware for Cloud Computing”,
on Cluster Computing, 2010. 2010 IEEE 3rd International Conference on Cloud
Computing.
[13]http://haproxy.1wt.eu/download/1.3/doc/configuration.t
xt. [24]Kassian Plankensteiner, Radu Prodan, Thomas
Fahringer,”A New Fault Tolerance Heuristic for
[14]Yang Zhang1, Anirban Mandal2, Charles Koelbel1 and Scientific Workflows in Highly Distributed
Keith Cooper,” Combined Fault Tolerance and Environments based on Resubmission Impact”, Fifth
Scheduling Techniques for Workflow Applications on IEEE International Conference on e-
Computational Grids “in 9th IEEE/ACM international Science,Austria,2009
symposium on clustering and grid, 2010.
Anju Bala completed her B.Tech in Computer Science
[15]Imad M. Abbadi, “Self-Managed Services Conceptual (1999) and M.Tech. in Information Technology from
Pbi.University (2004) and pursuing Ph.D. in Cloud
Model in Trustworthy Clouds' Infrastructure”, 2010.
Computing from Thapar University, Patiala (2009) and has
over eleven years of teaching experience. She is working as
[16]Manish Pokharel and Jong Sou Park, “Increasing Assistant Professor in Computer Science and Engineering
System Fault Tolerance with Software Rejuvenation in Department of Thapar University, Patiala. Her research
E-government System”, IJCSNS International Journal interests lie in Fault Tolerance in Cloud Computing and
of Computer Science and Network Security, VOL.10 Workflow management in Cloud Computing. She has over
No.5, May 2010. 14 publications in Conferences and International Journals of
repute.
[17]Zaipeng Xie, Hongyu Sun and Kewal Saluja, “A
Survey of Software Fault Tolerance Techniques”. Dr. Inderveer Chana completed her B.Tech in Computer
Science (1997) and M.E. in Software Engineering from TIET
(2002) and Ph.D. in Resource Management in Grid
[18]Geoffroy Vallee, Kulathep Charoenpornwattana, Computing from Thapar University, Patiala (2009) and has
Christian Engelmann, Anand Tikotekar, Stephen L. over thirteen years of teaching and research experience.
Scott,” A Framework for Proactive Fault Tolerance”. She is working as Assistant Professor in Computer Science
and Engineering Department of Thapar University, Patiala.
[19]Anglano C, Canonico M,” Fault-tolerant scheduling for Her research interests include Grid computing, Cloud
bag-of-tasks grid applications”, In: Advances in grid Computing and resource management challenges in Grids
computing—EGC 2005. Lecture notes in computer and Clouds. She has over 30 publications in International
Journals and Conferences of repute. More than 20 Masters
science, vol 3470/2005. Springer,Berlin/Heidelberg.
have been completed so far under her supervision and is
ISSN: 0302-9743 Print. doi:10.1007/b137919, ISBN: currently supervising 9 Doctoral candidates in the area of
978-3-540-26918-2,pp 630–639 Grid and Cloud Computing.

View publication stats


Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

You might also like