Assignment 1: CT074-3-M-RELM Reliability Management
Assignment 1: CT074-3-M-RELM Reliability Management
Assignment 1: CT074-3-M-RELM Reliability Management
CT074-3-M-RELM
Reliability Management
Assignment 1
Evaluating major issues regarding Reliability Management for Cloud-
based Application of Mass Media Company
1
CT074-3-M-RELM Reliability Management TP063111
Abstract: Cloud based application or cloud computing, as a new computing model, is widely concerned by mass media companies
and industries. Based on resource virtualization technology, cloud computing can provide users with infrastructure, platforms,
software, and other services, such as e-to-go, pay-per-use. As a result, more and more companies are choosing cloud computing to
place their scientific or commercial applications. However, with the growing number of users, the size of the data center is rapidly
expanding, and the architecture is becoming more complex, resulting in huge losses due to frequent failures of cloud computing
systems. Therefore, in the large-scale, complex architecture of cloud computing systems, how to ensure the reliability of the system
has become a very challenging problem. In the context of cloud computing reliability issues, this paper presents an overview of the
cloud computing architecture and service model, also describes cloud computing reliability, principles, and affected reliable factors.
Then, a short description of the evaluation of reliability of cloud computing system. Additionally, presents future challenges to the
reliability of cloud computing systems.
2
CT074-3-M-RELM Reliability Management TP063111
quick action and libertine system for their end-users or Service) model provides you with access to application
clients. cloud computing management efforts already depend software that is often Mentioned as an on-demand service.
on the service provider, which provides remarkable support SaaS provides an application to many users, anyhow of their
of its full Efficacy to end users. In simple terms, Cloud location, instead of the traditional thematic model of an
computing is a service that provides access to end-users or application per desktop. With SaaS, there is no necessity to
clients to the resources (tools or applications) that use the install, set up, and run the application. These activities will
Internet (Frankenfield, 2020). be conducted by the central location, the service provider.
Cloud computing's on-demand self-service setup, extensive Platform as a Service (PaaS): PaaS provides a platform
network access, collaboration, fast resilience, and for program and application development without required
measurement services make it extremely useful for end for any software. Applications are created by programming
users. Many mass media industries support this model languages and tools which are supported by the PaaS
because it enables end-users to recover their resources such provider.
as hardware and software via the Internet in a measurable Infrastructure as a Service (IaaS): IaaS provides the
and virtual manner (Lewis, 2010). As a result, could cloud computing environment for sharing resources such as
computing does not only provide unlimited operation for its servers, data storage, network, and operating system. IaaS
end users, but also provides economic benefits over the helps users use shared resources to set up and run their
industries that use this IT technology. So, cloud computing applications. Clients purchase those resources as complete
offers a significant benefit economically, that is best for outsourced services on demand.
those clients whose are considering cost-effectiveness.
3
CT074-3-M-RELM Reliability Management TP063111
platform as a service (PaaS), and software as a service service so that the failure does not reduce the effectiveness of
(SaaS). Therefore, the reliability of cloud computing systems the service. At the same time, labor should be reduced as
is closely linked to specific service models. For cloud much as possible during the service recovery process.
creation, the services provided by computing systems are
more reliable. Each service model has the responsibility to 3.3 Factors that affect the reliability of cloud computing
ensure the service reliability of the service provider and their systems
respective responsibilities vary with the service model. For
example, the service provider of the IaaS service level should According to Javadi et al (2012), system failure can be
ensure that the hardware benefits are always in a stable defined as an event where the system cannot function
operational condition and hardware failure will not affect the properly according to the consent. When the system
quality of service running on the infrastructure; And SaaS withdraws from normal operation, it can be considered that
service providers must ensure, as a service level, that the the system has failed. Therefore, system failure is a major
software system does not contain any serious software errors, factor affecting system reliability. In a cloud computing
to avoid software failures that affect the user's service system, the resources of the system are large, the architecture
experience. is very complex, the number of services is very high and due
to the interdependence between different service models,
failures often occur compared to normal systems. Therefore,
failure is the main threat to the reliability of cloud computing
systems.
In the case of software testing, failure refers to an
unwanted or unacceptable external behavior that occurs
during the operation of the software. Failure refers to an
unwanted or unacceptable internal state where the software is
running, such as entering the wrong conditional branch while
executing the software and the software fails. Errors are
unwanted or unacceptable differences, such as code errors
that exist in software programs, data, or documents. This
paper introduces the definition of failure in software testing,
failure of hardware, software, and other components or
systems to behave incorrectly, such as due to aging of
memory reading and writing so that the computer system
does not work properly, at this point the system can be said
Figure 2: Cloud computing Service model (Wikipedia to have a memory failure. This section provides an overview
contributors, 2021) of common failures and major causes of failure in cloud
computing systems.
3.2 Design principles of reliable cloud computing Based on the characteristics of failure, there are three main
systems types of failures in cloud computing systems: resource
failure, service failure, and other failures. Resource failure
According to Gill and Buyya (2020), three design principles refers to a misplaced physical asset, such as hardware failure,
are proposed to avoid cloud computing system failures software error, power outage, network outage, etc. Currently,
during operation or to restore cloud computing services if the most of the fault tolerance work is mainly focused on
system fails and to ensure the reliability of the cloud resource failure. Service failures in cloud computing systems
computing system. are service failures that the service provider cannot meet the
Design for resilience: Without human interaction, cloud quality of service specified in the service level agreement.
computing services can tolerate the failure of system Resource failure usually results in service failure when error
components. It can detect system failures and take corrective tolerance measures are not taken, but service failure can still
action automatically in the event of a failure, so that users are occur when physical resources function normally. Other
not aware of service interruptions. When the service fails, the failures of cloud computing systems usually refer to
system provides some function instead of crashing unforeseen failures caused by natural or man-made causes,
completely. such as high-energy particles, cyber-attacks, and human
Design for data integrity: If a system fails, the service error. Understanding the reasons for failure is crucial to
can manipulate, store, and discard data in continuous normal ensuring the reliability and availability of cloud computing
operations to maintain the integrity of user-managed data. services. Below figure 2 gives an overview of the common
Design for recoverability: When an abnormal situation causes of failure in the cloud computing systems.
happens on the system, it can ensure that the service is
automatically restored as soon as possible. 3.3.1 Hardware failure: Hardware failure refers to
system failures caused by the failure of hardware facilities
Following these design principles will help improve the such as hard disks, memory, and other devices. About 50%
reliability of cloud computing systems and reduce the impact of all data center failures are caused by hardware. As the size
of failures. In addition to these principles, if an application and usage time of the data center increases, so does the
service example fails, the unfinished part of the service and frequency of hard disk failures. Studies by Vishwanath et al,
the delayed service should eventually be able to complete it show that 78% of hardware failures are caused by disk
smoothly as needed. Once the system fails, the service drives, and the number of hard drive failures increases
provider or user should take appropriate steps to recover the rapidly as usage time increases. Therefore, regular disk
4
CT074-3-M-RELM Reliability Management TP063111
replacement or the use of unnecessary disk arrays can noise, and hardware aging, instantaneous and intermittent
significantly reduce the chances of disk failure and increase failures can occur in hardware circuits, which can lead to soft
system reliability. bugs. As soft bugs spread throughout the system, they can
manifest themselves as various forms of system failure. Such
3.3.2 Software failure: As systems and software in as incorrect output or system crash. At the cloud computing
cloud computing systems become more complex, software system level, this problem can get worse. But system
failure has become a major cause of system crashes. designers can eliminate the effects of soft errors on services
Software failures are mainly caused by software design through error prediction and error tolerance.
errors, update failures, and potential operating failures due to
system reboots. According to the research AppDynamics, the 3.3.7 Cyber Attack: In recent years, network attacks
unfortunate software failure cost of 1000 businesses between have become one of the main reasons for the rapid growth of
$1.25 billion to $2.5 billion every year. Sometimes, an data center failures. According to a report released by the
unexpected error in the software update upgrade process can Ponemon Research Center, since 2015, companies have
cause the entire system to crash. According to The Propitt greatly improved their cyber resilience: the percentage of
survey, about 20 percent of restarts fail due to data companies that have achieved a high level of cyber resilience
inconsistencies. Of course, memory leaks, indeterminate has increased from 35% in 2015 to 53% in 2020, an increase
threads, data mishandling, storage space fragmentation, etc. of 51%. Despite the increase in the amount and intensity of
may also cause other system failures or system performance attacks in the last 12 months, 67% and 64%, respectively,
degradation. companies are feeling more confident.
3.3.3 Power failure: In cloud computing data centers, 3.3.8 Human Error: Like cyberattacks, human factors
about 33 percent of services are disrupted due to power account for a significant proportion of cloud computing
outages, which could easily occur during a natural disaster or system failures (22%), with the average cost of human-
war. In 2012, six of the 27 major power outages at the cloud induced failures being $489. Zhao et al research presents that
computing data center were caused by Hurricane Sandstorm inexperience and operational error are the main causes of
and all customer service was disrupted. Another major cause human error and human error is responsible for a significant
of power failure is the failure of uninterruptible power proportion in the early days of cloud computing
systems, which causes about 25% power outage and one infrastructure. Therefore, more experience gained by cloud
failure causes a loss of about $ 1,000. computing system managers can help reduce the likelihood
of people making mistakes.
3.3.4 Network Failure: In distributed computing
architectures, especially cloud computing systems, all 4. Reliability evaluation of cloud computing
services are supported by communication networks, and all systems
information is exchanged between servers. Disruption of the
underlying network can also result in the disruption of cloud To ensure the reliability of cloud computing systems,
computing services. For some cloud-based real-time researchers need to design appropriate methods and
applications, network performance often plays a key role. techniques to mitigate the effects of system failures. Before
Short network congestion can cause network transmission designing a particular approach, researchers must first
delays, resulting in a breach of the Service Level Agreement determine evaluation criteria for system reliability based on
(SLA) provided by the system. Of all service failures, some actual requirements. This section will introduce three ways
failures are due to network disconnections, and network of measuring system reliability evaluation.
services interruptions may be due to physical or logical
arrangements. 4.1 Time indicator for measuring system reliability
In general, for cloud computing service providers, high-
3.3.5 Service Failure: In cloud computing systems, reliant cloud computing systems are usually characterized by
service failures can occur whether resource failures occur. a small number of failures, long working hours, and short
Bai et al.'s research show that the occurrence of service repair times after failures. According to the Alavian (2020),
failures is closely related to the stage in which the job is There are three-time indicators for measuring the reliability
submitted. On the one hand, at the stage of a job request, a of cloud computing systems that reflect the system's ability
job submitted by the user with a specific service requirement to maintain normal activity over a period.
is stored in a prepared queue. At this stage, resource Mean-time-to-failure (MTTF): MTTF is the average
overload, such as peak time for service requests, may end time between the start of normal operation and the time the
during service requests, at which time the user will not be failure occurs, the average time the system runs without
able to access the service. In this case, although the system's failure. The longer the average trouble-free time of the
built-in resources are working well, they may not be able to system, the higher the reliability of the system.
fully accommodate all requests, leading to service failures.
On the other hand, at the execution stage of work, the work Mean-time-to-repair (MTTR): MTTR is the average of
is committed to the underlying physical resources, and the intervals experienced from the time the system fails to
therefore the service may be disrupted due to resource the end of the fault repair and can be re-worked. The longer
constraints. the average failure-free time of the system, the better the
3.3.6 Soft Bugs: With the continuous development of recovery performance of the system.
CMOS technology and the constant control of processor
voltage, soft bugs have become an important concern in Mean-time-between-failures (MTBF): MTBF is the
modern computer systems. Due to high-energy particles, average of the interval experienced by the system for two
5
CT074-3-M-RELM Reliability Management TP063111
adjacent failures. The greater the average failure interval of To allocate cloud computing reliability resources
the system, the greater the reliability of the system and the efficiently, scheduling in a cloud computing environment
stronger the correct performance. becomes a highly complex task where many alternative
computing tools with different capabilities are available in
4.2 Reliability of cloud service system the market. Effective task scheduling methods can fulfill the
requirement of users and enhance resource usage. According
A cloud service system is designed to be a cloud to new research (M., 2021), cloud service providers at the
management system (CMS), which can perform four same time receive huge computing requests from users with
different functions: 1) managing the request queue consisting different needs and preferences. Accepting each requested
of job requests from different users, 2) managing computing task is different from one and another, some request tasks
resources on the Internet (e.g. personal computers, clusters, require less cost and less computing resources, while some
supercomputers, etc.), 3) managing data resources on the request tasks require higher computing capacity and more
network (e.g. databases, web pages, etc.), and 4) assign bandwidth and computing resources. When service providers
subtasks to multiple different computing resources that can got the tasks from users, tasks can be paired-based
access data resources. When a user requests a cloud service, comparisons using comparative matrix techniques. Service
the CMS system first uses workflows to describe the providers deal with users on work requirements, including
subtasks that the cloud service contains, the data resources network bandwidth, full-time, task costs, and task reliability.
that the subtasks require, and the dependencies between the In a cloud computing environment, computing resource
resources and then distribute the subtasks to each computing storage can be allocated to the task at hand once per job
resource node that can access the data resources. requirement.
In the process of managing cloud services, there are
various factors that can affect the reliability of cloud 5. Cloud computing reliability challenge
services, including: request queue overflow, request timeout,
data resource loss, loss of computing resources, software Currently, there are many new applications that have been
failure, database failure, hardware failure, and network created and deployed based on cloud computing, and more
failure. According to the Zhou et al, evaluating reliability in and more applications are shifting from traditional
cloud computing is not a easy task, but they are presenting a computing platforms to cloud computing-based platforms. In
tool that helps evaluate and improve cloud service reliability, the future, with the further development of science and
this tool name is FTCloudSim (Fat-Tree cloud simulation). technology, the application of cloud computing technology
FTCloudSim is a cloud simulation-based tool that provides will become more widespread. However, since cloud
an expansible process to increase cloud service reliability. computing is based on the implementation of virtualization
FTCloudSim can manage failures with a check-point technology, and features multi-leased, large-scale, and
process. complex architecture, it is very difficult to properly manage
FTCloudSim has 5 steps to simulate the cloud computing software and hardware resources in a cloud computing
system which are 1) fat-tree data center network system. At the same time, the development of data
construction, 2) failure and repair event triggering, 3) protection, edge computing, and other technologies, and their
checkpoint image generation and storage, 4) check-point- deep integration with cloud computing, has also brought
based cloud let recovery, and 5) result generation. many challenges to the reliability of cloud computing. Taken
Advantages and disadvantages of each stage. The first metric together, the future of cloud computing reliability research
can increase the reliability of cloud services. The metrics could continue in five areas:
include total execution time and average lost time. The 5.1 Service reliability problems of virtualization
second metric is the use of network resources. In addition, technology in case of hardware resource failure.
the check-point-image-data transferred by all device Virtualization technology is the key to the realization of
switches, this metric collect all the check-point-image-data cloud computing. Virtualization failure and resource
which is transferred by all available switches. The last metric contention are two main problems in computing systems,
is storing all the disk usage for the storage of checkpoint which increase the response time. The reliability of Cloud
images. computing systems can reduce problems with real-time
applications such as video broadcasting and video
4.3 Reliability of cloud resource systems conferencing, which can reduce delays in data transfer (Gill
et al, 2020).
In a cloud computing environment, service providers need
to manage a variety of resource components such as 5.2 With the rapid growth of the cloud computing
processors, memory modules, storage units, network market, data security problems due to system reliability,
switches, and so on. The more components there are, the users' reliance on cloud computing are increasing and more
more likely the cloud computing system is to fail. If service and more users are storing personal data in the cloud. the
providers understand the failure characteristics of different vulnerability in the cloud computing system allows system
resource components, it will help them better manage attackers access to another person's personal data, network
computing resources to make their systems error-tolerant and intrusions can cause widespread security issues and hidden
provide high-performance services. The reliability of cloud threats to services. Thus, how to improve the reliability of
computing systems is studied from the interrelationship cloud computing systems to ensure user data protection has
between software component (including process, hypervisor, become a major challenge for cloud computing service
and management programs) failures and hardware providers (Wenxue et al,2019).
component (server node) failures and component failures.
6
CT074-3-M-RELM Reliability Management TP063111
5.3 Reliability leads to increased service costs. To computing. This paper provides some reliability evaluation
improve the reliability of services provided by cloud of cloud computing. in addition, how to solve the problem of
computing systems, many service providers use redundant cost growth and service performance degradation caused by
resources to improve service fault tolerance. While improving system reliability, how to find and prevent the
increasing unnecessary resources can significantly improve potential information security problems brought about by
the reliability of services, it also increases the cost of system improving system reliability will be a great challenge. at the
services accordingly. This not only increases the cost for same time, how to not reduce the reliability of application
users of cloud computing services but can also reduce the services when hardware resources fail is also a challenge. In
rate of return for service providers. Thus, when studying how the future, with the deep integration of cloud computing and
to improve system reliability, designing a reasonable edge computing, the internet of things, and other emerging
resource allocation strategy to reduce the cost of using cloud technologies and cloud computing, cloud computing
computing services is also a problem that future researchers reliability will guide in more and greater challenges.
will need to consider (Wenxue et al,2019).
7
CT074-3-M-RELM Reliability Management TP063111
https://appdevelopermagazine.com/2371/2015/2/11/N [24] Alazie Dagnaw, G., & Ebabye Tsige, S. (2019).
ew-AppDynamics-Study-Shows-Critical-Failures- Challenges and Opportunities of Cloud Computing in
Can-Cost-$1-Million-Per-Hour/ Social Network; Survey. Internet of Things and Cloud
[11] F. (2013, February 27). Lessons Learned from Recent Computing, 7(3), 73.
Cloud Outages. Flexera Blog. https://doi.org/10.11648/j.iotcc.20190703.13
https://www.flexera.com/blog/cloud/lessons-learned- [25] Alavian, P., Eun, Y., Liu, K., Meerkov, S. M., &
from-recent-cloud-outages/ Zhang, L. (2019). The (α, β)-Precise Estimates of
[12] Bai, Y., Zhang, H., & Fu, Y. (2016). Reliability MTBF and MTTR: Definitions, Calculations, and
modeling and analysis of cloud service based on Induced Effect on Machine Efficiency Evaluation.
complex network. 2016 Prognostics and System IFAC-PapersOnLine, 52(13), 1004–1009.
Health Management Conference (PHM-Chengdu). https://doi.org/10.1016/j.ifacol.2019.11.326
Published. https://doi.org/10.1109/phm.2016.7819907
[13] Bai, Y., Zhang, H., & Fu, Y. (2016). Reliability
modeling and analysis of cloud service based on
complex network. 2016 Prognostics and System
Health Management Conference (PHM-Chengdu).
Published. https://doi.org/10.1109/phm.2016.7819907
[14] Ponemon, L. (2020, July 7). The 2020 Cyber Resilient
Organization: Preparation and Technology
Differentiate High Performers. Security Intelligence.
https://securityintelligence.com/posts/2020-cyber-
resilient-organization-preparation-technology-
differentiate-high-performers/
[15] Zhao, E., & Wu, C. (2021). Long-term safety
assessment of large-scale arch dam based on non-
probabilistic reliability analysis. Structures, 32, 298–
312. https://doi.org/10.1016/j.istruc.2021.03.012
[16] Zhou, A., Wang, S., Yang, C., Sun, L., Sun, Q., &
Yang, F. (2015). FTCloudSim: support for cloud
service reliability enhancement simulation.
International Journal of Web and Grid Services, 11(4),
347. https://doi.org/10.1504/ijwgs.2015.072804
[17] M., B. K. (2021). Hybrid Evolutionary Algorithm
based Task Scheduling Mechanism for Resource
Allocation in Cloud Environment. Revista Gestão
Inovação e Tecnologias, 11(4), 194–209.
https://doi.org/10.47059/revistageintec.v11i4.2101
[18] Gill, S. S., & Buyya, R. (2020). Failure Management
for Reliable Cloud Computing: A Taxonomy, Model,
and Future Directions. Computing in Science &
Engineering, 22(3), 52–63.
https://doi.org/10.1109/mcse.2018.2873866
[19] Botta, A., de Donato, W., Persico, V., & Pescapé, A.
(2016). Integration of Cloud computing and Internet of
Things: A survey. Future Generation Computer
Systems, 56, 684–700.
https://doi.org/10.1016/j.future.2015.09.021
[20] Alazie Dagnaw, G., & Ebabye Tsige, S. (2019).
Challenges and Opportunities of Cloud Computing in
Social Network; Survey. Internet of Things and Cloud
Computing, 7(3), 73.
https://doi.org/10.11648/j.iotcc.20190703.13
[21] Duan Wenxue, Hu Ming, Zhou Qiong, Wu Tingming,
Zhou Junlong, Liu Xiao, Wei Tongquan, Chen
Mingsong, 2020. Reliability in Cloud Computing
System: A Review[J]. Journal of Computer Research
and Development, 2020, 57(1): 102-123.
https://crad.ict.ac.cn/EN/Y2020/V57/I1/102
[22] YU, X. X., & BIAN, J. (2017). Reliability Analysis of
Cloud Computing Service System. DEStech
Transactions on Computer Science and Engineering,
aice-ncs. https://doi.org/10.12783/dtcse/aice-
ncs2016/5746
[23]