Fault Tolerance-Challenges Techniques and Implemen
Fault Tolerance-Challenges Techniques and Implemen
Fault Tolerance-Challenges Techniques and Implemen
net/publication/266525159
CITATIONS READS
106 3,791
2 authors:
Some of the authors of this publication are also working on these related projects:
QoS-aware Autonomic Resource Provisioning and Scheduling for Cloud Computing View project
All content following this page was uploaded by Anju Bala on 31 July 2015.
1
Computer Science and Engineering Department, Thapar University
Patiala-147004, Punjab, India
2
Computer Science and Engineering Department, Thapar University
Patiala-147004, Punjab, India
Abstract these techniques provide mechanisms to the software
Fault tolerance is a major concern to guarantee availability system to prevent system failure occurrence [18]. The
and reliability of critical services as well as application main benefits of implementing fault tolerance in
execution. In order to minimize failure impact on the cloud computing include failure recovery, lower cost,
system and application execution, failures should be improved performance metrics etc. This paper aims
anticipated and proactively handled. Fault tolerance
techniques are used to predict these failures and take an
to provide a better understanding of fault tolerance
appropriate action before failures actually occur. This paper challenges and identifies various tools and techniques
discusses the existing fault tolerance techniques in cloud used for fault tolerance. When multiple instances of
computing based on their policies, tools used and research an application are running on several virtual
challenges. Cloud virtualized system architecture has been machines and one of the server goes down, there is a
proposed. In the proposed system autonomic fault tolerance need to implement an autonomic fault tolerance
has been implemented. The experimental results technique that can handle these types of faults. To
demonstrate that the proposed system can deal with various address this issue, cloud virtualized system
software faults for server applications in a cloud virtualized architecture has been proposed and implemented
environment.
using HAProxy.The proposed architecture also has
Keywords: Cloud Computing; Virtual Machine; Fault been validated through experimental results.
Tolerance; Replication
The rest of the paper is organized as follows. Section
1. Introduction 2 discusses fault tolerance techniques based on their
policies. Section 3 presents challenges of
Cloud computing is a style of computing where implementing fault tolerance in cloud computing
service is provided across the Internet using different .Section 4 identifies the comparison between various
models and layers of abstraction [4]. It refers to the tools used for implementing fault tolerance
applications delivered as services [5] to the mass, techniques with their comparison table. Section 5
ranging from the end-users hosting their personal presents proposed cloud virtualized architecture and
documents on the Internet to enterprises outsourcing implementation with experimental results. Section 6
their entire IT infrastructure to external data centers. finally concludes the paper.
A simple example of cloud computing service is 2. Background
Yahoo email or Gmail etc.
There are various faults which can occur in cloud
Although cloud computing has been widely adopted computing .Based on fault tolerance policies various
by the industry, still there are many research issues to fault tolerance techniques can be used that can either
be fully addressed like fault tolerance, workflow be task level or workflow level .
scheduling, workflow management, security etc.Fault
tolerance is one of the key issues amongst all. It is 2.1 Reactive fault tolerance
concerned with all the techniques necessary to enable Reactive fault tolerance policies reduce the effect of
a system to tolerate software faults remaining in the failures on application execution when the failure
system after its development. When a fault occurs, effectively occurs. There are various techniques
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 289
which are based on these policies like Proactive Fault Tolerance using Self-
Checkpoint/Restart, Replay and Retry and so on. Healing- When multiple instances of an
application are running on multiple virtual
Check pointing/ Restart - When a task fails, it is machines, it automatically handles failure of
allowed to be restarted from the recently checked application instances.
pointed state rather than from the beginning. It is
an efficient task level fault tolerance technique Proactive Fault Tolerance using Preemptive
for long running applications [2]. Migration- Preemptive Migration relies on a
feedback-loop control mechanism where
Replication-Various task replicas are run on application is constantly monitored and
different resources, for the execution to succeed analyzed.
till the entire replicated task is not crashed. It can
be implemented using tools like HAProxy, 3. Challenges of Implementing Fault
Hadoop and AmazonEc2 etc. Tolerance in Cloud Computing
Job Migration-During failure of any task, it can Providing fault tolerance requires careful
be migrated to another machine. This technique consideration and analysis because of their
can be implemented by using HAProxy. complexity, inter-dependability and the
following reasons.
SGuard- It is less disruptive to normal stream
processing and makes more resources available. There is a need to implement autonomic fault
SGuard is based on rollback recovery [18] and tolerance technique for multiple instances of an
can be implemented in HADOOP, Amazon EC2. application running on several virtual machines
[12].
Retry-It is the simplest task level technique that Different technologies from competing vendors of
retries the failed task on the same cloud resource cloud infrastructure need to be integrated for
[20]. establishing a reliable system [15].
Task Resubmission-It is the most widely used The new approach needs to be developed that
fault tolerance technique in current scientific integrate these fault tolerance techniques with
workflow systems [25]. Whenever a failed task existing workflow scheduling algorithms [14].
is detected, it is resubmitted either to the same or
to a different resource at runtime. A benchmark based method can be developed in
cloud environment for evaluating the
User defined exception handling-In this user performances of fault tolerance component in
specifies the particular treatment of a task failure comparison with similar ones [21].
for workflows.
To ensure high reliability and availability
Rescue workflow-This technique [20] allows the multiple clouds computing providers with
workflow to continue even if the task fails until independent software stacks should be used [22]
it becomes impossible to move forward without [23].
catering the failed task.
Autonomic fault tolerance must react to
2.2 Proactive Fault Tolerance synchronization among various clouds [15].
The principle of proactive fault tolerance policies 4. Tools Used For Implementing Fault
is to avoid recovery from faults, errors and Tolerance
failures by predicting them and proactively
replace the suspected components other working Fault tolerance challenges and techniques have been
components. Some of the techniques which are implemented using various tools. Table 1 compares
based on these policies are Preemptive these tools based on their programming framework,
migration, Software Rejuvenation etc. environment and application type along with
different fault tolerance techniques. HAProxy is used
Software Rejuvenation-It is a technique that
for server failover in the cloud [13]. SHelp [12] is a
designs the system for periodic reboots. It
lightweight runtime system that can survive software
restarts the system with clean state [5].
failures in the framework of virtual machines. It may
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 290
also work in cloud environment for implementing techniques in cloud enviorment.Amazon Elastic
check pointing. ASSURE [9] introduces rescue points Compute Cloud (EC2) [8] provides a virtual
for handling programmer anticipated failures. computing environment to run Linux-based
Hadoop [7] is used for data intensive applications but applications for fault tolerance.
can also be used to implement fault tolerance
Table 1: Tools Used To Implement Existing Fault Tolerance Techniques
Fault Tolerance Techniques Policies System Programming Environment Fault Detected Applicati
Framework on Type
Self Healing, Job Migration, Reactive/ HAProxy[13] Java Virtual Process/node Load
Replication Proactive Machine failures balancing
Fault
Tolerance
Check pointing Reactive SHelp[12] SQL,JAVA Virtual Application Fault
Machine Failure tolerance
Check pointing, Retry, Self Reactive/ Assure[9] JAVA Virtual Host, Network Fault
Healing Proactive Machine Failure tolerance
Job Reactive/ Hadoop[7] Java,HTML,CSS Cloud Application/no Data
Migration,Replication,Sguard,Resc Proactive Environment de failures intensive
Replication, Sguard,Task Reactive/ AmazonEC2[8] Amazon Machine Cloud Application/no Load
Resubmission Proactive Image, Amazon Environment de failures balancing
Map , fault
tolerance
Web Server Web Server
Application Application
Haproxy
DB(MySQL) DB(MySQL) Server
VM VM VM
Virtual Machine Server (VMware)
Cloud Virtualized System
5.2 Implementation and Experimental Results server failures in fault tolerant cloud environment. It
provides a web interface for statistics known as
Fault tolerant system has been implemented using HAProxy statistics. Implementation includes two
HAProxy and MySQL. HAProxy is used to handle virtual machines as web servers, server 1 and server 2
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 291
hosting Apache Tomcat 6.0.32. HAProxy software to one or more MySQL database servers. Application
version 1.3.15.2 is configured on third virtual can be accessed on any of the web server. Data
machine. A simple database application written in consistency is also maintained through MySQL
Java is installed on the web servers. Xampp for Linux replication. The experimental results show that
Windows XP (SP2) is used to install MySQL. Data in HAProxy can assist server applications to recover
MySQL is replicated using replication technique for from server failures in just a few milliseconds with
local backup. Replication enables data from one minimum performance overhead.
MySQL database server (the master) to be replicated
Case 1: Figure 2 shows the stats when both the servers are working and green line indicates that the servers are up.
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 292
Case 5: Figure 6 shows the replicated database table of server 2.When server 1 fails data is replicated to server 2.
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012
ISSN (Online): 1694-0814
www.IJCSI.org 293
(ASPLOS’09), ACM Press, March 7-11, 2009, [20]Elvin Sindrilaru,,Alexandru Costan,, Valentin Cristea,”
Washington, DC, USA, pp.37-48. Fault Tolerance and Recovery in Grid Workflow
Management Systems”, 2010 International Conference
[10]B. Buck and J. K. Hollingsworth, “An API For on Complex, Intelligent and Software Intensive
Runtime Code Patching”, International Journal of High Systems.
Performance Computing Applications, Vol.14, No.4,
November 2000, pp.317-329. [21]S. Hwang, C. Kesselman, “Grid Workflow: A Flexible
Failure Handling Framework for the Grid”,12th IEEE
[11]S. Osman, D. Subhraveti, G. Su, and J. Nieh, “The international Symposium on Zigh Performance
Design and Implementation of Zap: A System For Distributed Computing (HPDC’03), Seattle,
Migrating Computing Environments”, Proceedings of Washington,USA., IEEE CS Press, Los Alamitos, CA,
the 5th Symposium on Operating Systems Design and USA, June 22 - 24, 2003.
Implementation (OSDI’02), USENIX Association,
December 9-11, 2002, Boston, Massachusetts, USA, [22]Michael Armbrust, Armando Fox,Rean Griffith, “
pp.361-376. Above the Clouds: A Berkeley View of Cloud
Computing”, Electrical Engineering and Computer
[12]Gang Chen, Hai Jin, Deqing Zou, Bing Bing Zhou, Sciences University of California at Berkeley, 2009.
Weizhong Qiang, Gang Hu, “SHelp: Automatic Self-
healing for Multiple Application Instances in a Virtual [23]Wenbing Zhao, P. M. Melliar-Smith and L. E. Moser,”
Machine Environment”, IEEE International Conference Fault Tolerance Middleware for Cloud Computing”,
on Cluster Computing, 2010. 2010 IEEE 3rd International Conference on Cloud
Computing.
[13]http://haproxy.1wt.eu/download/1.3/doc/configuration.t
xt. [24]Kassian Plankensteiner, Radu Prodan, Thomas
Fahringer,”A New Fault Tolerance Heuristic for
[14]Yang Zhang1, Anirban Mandal2, Charles Koelbel1 and Scientific Workflows in Highly Distributed
Keith Cooper,” Combined Fault Tolerance and Environments based on Resubmission Impact”, Fifth
Scheduling Techniques for Workflow Applications on IEEE International Conference on e-
Computational Grids “in 9th IEEE/ACM international Science,Austria,2009
symposium on clustering and grid, 2010.
Anju Bala completed her B.Tech in Computer Science
[15]Imad M. Abbadi, “Self-Managed Services Conceptual (1999) and M.Tech. in Information Technology from
Pbi.University (2004) and pursuing Ph.D. in Cloud
Model in Trustworthy Clouds' Infrastructure”, 2010.
Computing from Thapar University, Patiala (2009) and has
over eleven years of teaching experience. She is working as
[16]Manish Pokharel and Jong Sou Park, “Increasing Assistant Professor in Computer Science and Engineering
System Fault Tolerance with Software Rejuvenation in Department of Thapar University, Patiala. Her research
E-government System”, IJCSNS International Journal interests lie in Fault Tolerance in Cloud Computing and
of Computer Science and Network Security, VOL.10 Workflow management in Cloud Computing. She has over
No.5, May 2010. 14 publications in Conferences and International Journals of
repute.
[17]Zaipeng Xie, Hongyu Sun and Kewal Saluja, “A
Survey of Software Fault Tolerance Techniques”. Dr. Inderveer Chana completed her B.Tech in Computer
Science (1997) and M.E. in Software Engineering from TIET
(2002) and Ph.D. in Resource Management in Grid
[18]Geoffroy Vallee, Kulathep Charoenpornwattana, Computing from Thapar University, Patiala (2009) and has
Christian Engelmann, Anand Tikotekar, Stephen L. over thirteen years of teaching and research experience.
Scott,” A Framework for Proactive Fault Tolerance”. She is working as Assistant Professor in Computer Science
and Engineering Department of Thapar University, Patiala.
[19]Anglano C, Canonico M,” Fault-tolerant scheduling for Her research interests include Grid computing, Cloud
bag-of-tasks grid applications”, In: Advances in grid Computing and resource management challenges in Grids
computing—EGC 2005. Lecture notes in computer and Clouds. She has over 30 publications in International
Journals and Conferences of repute. More than 20 Masters
science, vol 3470/2005. Springer,Berlin/Heidelberg.
have been completed so far under her supervision and is
ISSN: 0302-9743 Print. doi:10.1007/b137919, ISBN: currently supervising 9 Doctoral candidates in the area of
978-3-540-26918-2,pp 630–639 Grid and Cloud Computing.