Project FInal Report

ABSTRACT
With big data growth in biomedical and healthcare communities, accurate analysis of medical data
benefits early disease detection, patient care and community services. However, the analysis
accuracy is reduced when the quality of medical data is incomplete. Moreover, different regions
exhibit unique characteristics of certain regional diseases, which may weaken the prediction of
disease outbreaks. In this paper, we streamline machine learning algorithms for effective prediction
of chronic disease outbreak in disease-frequent communities. We experiment the modified prediction
models over real-life hospital data collected from central China in 2013-2015. To overcome the
difficulty of incomplete data, we use a latent factor model to reconstruct the missing data. We
experiment on a regional chronic disease of cerebral infarction. We propose a new convolutional
neural network based multimodal disease risk prediction (CNN-MDRP) algorithm using structured
and unstructured data from hospital. To the best of our knowledge, none of the existing work focused
on both data types in the area of medical big data analytics. Compared to several typical prediction
algorithms, the prediction accuracy of our proposed algorithm reaches 94.8% with a convergence
speed which is faster than that of the CNN-based unimodal disease risk prediction (CNN-UDRP)
algorithm.
1
1. INTRODUCTION
This chapter introduces BigData briefly. It includes definition, attributes and formats of BigData.
Brief introduction to Healthcare applications is also included.
1.1. INTRODUCTION TO BIGDATA

“Big data” is a term used to describe a collection of data sets with the following three characteristics:
 Volume- Large amounts of data generated.
 Velocity-Frequency and speed of which data are generated, captured and shared
 Variety-Diversity of data types and formats from various sources.
The size and complexity of big data makes it difficult to use traditional database management and
data processing tools. Data is being created in much shorter cycles from hours to milliseconds. There
is also a trend underway to create larger databases by combining smaller data sets so that data
correlations can be discovered.
Fig 1.1. Pictorial representation of BigData
Big data has become the new frontier of information management given the amount of data today’s
systems are generating and consuming. It has driven the need for technological infrastructure and
tools that can capture, store, analyze and visualize vast amounts of disparate structured and
unstructured data. These data are being generated at increasing volumes from data intensive
2
technologies including, but not limited to, the use of the Internet for activities such as accesses to
information, social networking, mobile computing and commerce. Corporations and governments
have begun to recognize that there are unexploited opportunities to improve their enterprises that can
be discovered from these data.
1.2. CHARACTERISTICS OF BIGDATA
1. Velocity
Velocity refers to the speed at which vast amounts of data are being generated, collected and
analyzed. Every day the number of emails, twitter messages, photos, video clips, etc. increases at
lighting speeds around the world. Every second of every day data is increasing. Not only must it be
analyzed, but the speed of transmission, and access to the data must also remain instantaneous to
allow for real-time access to website, credit card verification and instant messaging. Big data
technology allows us now to analyze the data while it is being generated, without ever putting it into
databases.
2. Volume
Volume refers to the incredible amounts of data generated each second from social media, cell
phones, cars, credit cards, M2M sensors, photographs, video, etc. The vast amounts of data have
become so large in fact that we can no longer store and analyze data using traditional database
technology. We now use distributed systems, where parts of the data is stored in different locations
and brought together by software. With just Facebook alone there are 10 billion messages, 4.5
billion times that the “like” button is pressed, and over 350 million new pictures are uploaded every
day. Collecting and analyzing this data is clearly an engineering challenge of immensely vast
proportions.
3. Value
When we talk about value, we’re referring to the worth of the data being extracted. Having endless
amounts of data is one thing, but unless it can be turned into value it is useless. While there is a clear
link between data and insights, this does not always mean there is value in BigData. The most
important part of embarking on a big data initiative is to understand the costs and benefits of
collecting and analyzing the data to ensure that ultimately the data that is reaped can be monetized.
3
4. Variety
Variety is defined as the different types of data we can now use. Data today looks very different than
data from the past. We no longer just have structured data (name, phone number, address, financials,
etc.) that fits nice and neatly into a data table. Today’s data is unstructured. In fact, 80% of all the
world’s data fits into this category, including photos, video sequences, social media updates, etc.
New and innovative big data technology is now allowing structured and unstructured data to be
harvested, stored, and used simultaneously.
5. Veracity
Last, but certainly not least there is veracity. Veracity is the quality or trustworthiness of the data.
Just how accurate is all this data? For example, think about all the Twitter posts with hash tags,
abbreviations, typos, etc., and the reliability and accuracy of all that content. Gleaning loads and
loads of data is of no use if the quality or trustworthiness is not accurate. Another good example of
this relates to the use of GPS data. Often the GPS will “drift” off course as you peruse through an
urban area. Satellite signals are lost as they bounce off tall buildings or other structures. When this
happens, location data has to be fused with another data source like road data, or data from an
accelerometer to provide accurate data.
Fig 1.2. Characteristics of BigData
1.3. BIG DATA ANALYTICS
Big data has become the new frontier of information management given the amount of data today’s
systems are generating and consuming. It has driven the need for technological infrastructure and
tools that can capture, store, analyze and visualize vast amounts of disparate structured and
unstructured data. These data are being generated at increasing volumes from data intensive
4
technologies including, but not limited to, the use of the Internet for activities such as accesses to
information, social networking, mobile computing and commerce. Corporations and governments
have begun to recognize that there are unexploited opportunities to improve their enterprises that can
be discovered from these data.
Analytics when applied in the context of big data is the process of examining large amounts of data,
from a variety of data sources and in different formats, to deliver insights that can enable decisions in
real or near real time. Various analytical concepts such as data mining, natural language processing,
artificial intelligence and predictive analytics can be employed to analyze, contextualize and
visualize the data. Big data analytical approaches can be employed to recognize inherent patterns,
correlations and anomalies which can be discovered as a result of integrating vast amounts of data
from different data sets.
Fig 1.3.BigData Analytics
Big data analytics requires the use of new frameworks, technologies and processes to manage it. Yet
its arrival in the enterprise software space has created some confusion as business leaders try to
understand the differences between it and traditional data warehousing (DW) and business
intelligence (BI) tools.
There are important distinctions and sufficient differentiating value between BDA and DW/BI
systems which make BDA unique.
Gartner defines a data warehouse as “a storage architecture designed to hold data extracted from
transaction systems, operational data stores and external sources. The warehouse then combines that
data in an aggregate, summary form suitable for enterprise-wide data analysis and reporting for
predefined business needs.”
5
Forrester Research has defined business intelligence as “a set of methodologies, processes,
architectures, and technologies that transform raw data into meaningful and useful information used
to enable more effective strategic, tactical, and operational insights and decision-making."
BDA solutions will not replace DW/BI, rather they will co-exist side-by-side to unlock hidden value
in the massive amount of data that exists within and outside the enterprise.
BDA functions are unique because they:
 Handle open ended "how and why" type questions whereas BI tools are designed to query
specific "what and where".
 Process unstructured data to find patterns, whereas DW systems process structured and
mostly aggregated data.
The term “Analytics” refers to the logic and algorithms, both deduction and inference, performed on
BD to derive value, insights and knowledge from it. Analytical methods such as data mining, natural
language processing, artificial intelligence and predictive analytics are employed to analyze,
contextualize and visualize the data. These computerized analytical methods recognize inherent
patterns, correlations and anomalies which are discovered as a result of integrating vast amounts of
data from different datasets. Together, the term “Big Data Analytics” represents, across all
industries, new data-driven insights which are being used for competitive advantage over peer
organizations to more effectively market products and services to targeted consumers. Examples
include real-time purchasing patterns and recommendations back to consumers, and gaining better
understandings and insights into consumer preferences and perspectives through affinity to certain
social groups.
The origin of BDA comes from web-based search engines such as Google and Yahoo, the popularity
of social media and social networking services such as Facebook and Twitter, and data-generating
sensors, telehealth and mobile devices. All have increased and generated new data and opportunities
for new insights on customer behaviors and trends. While BDA frameworks have been in operation
since 2005, they have just recently moved into other industries and sectors including financial
services firms and banks, online retailers and health care.
6
BIG DATA COMPUTING
The rising importance of big-data computing stems from advances in many different technologies.
Sensors: Digital data are being generated by many different sources, including digital imagers
(telescopes, video cameras, MRI machines), chemical and biological sensors), and even the millions
of individuals and organizations generating web pages. Computer networks: Data from the many
different sources can be collected into massive data sets via localized sensor networks, as well as the
Internet.
Fig 1.4. BigData Computing
Data storage: Advances in magnetic disk technology have dramatically decreased the cost of storing
data. For example, a one-terabyte disk drive, holding one trillion bytes of data, costs around $100. As
a reference, it is estimated that if all of the text in all of the books in the Library of Congress could be
converted to digital form, it would add up to only around 20 terabytes.
Cluster computer systems: A new form of computer systems, consisting of thousands of "nodes,"
each having several processors and disks, connected by high-speed local-area networks, has become
the chosen hardware configuration for data-intensive computing systems. These clusters provide both
the storage capacity for large data sets, and the computing power to organize the data, to analyze it,
and to respond to queries about the data from remote users. Compared with traditional high-
performance computing (e.g., supercomputers), where the focus is on maximizing the raw computing
power of a system, cluster computers are designed to maximize the reliability and efficiency with
which they can manage and analyze very large data sets. The "trick" is in the software algorithms –
cluster computer systems are composed of huge numbers of cheap commodity hardware parts, with
scalability, reliability, and programmability achieved by new software paradigms.
7
Cloud computing facilities: The rise of large data centers and cluster computers has created a new
business model, where businesses and individuals can rent storage and computing capacity, rather
than making the large capital investments needed to construct and provision large-scale computer
installations. For example, Amazon Web Services (AWS) provides both network-accessible storage
priced by the gigabyte-month and computing cycles priced by the CPU-hour. Just as few
organizations operate their own power plants, we can foresee an era where data storage and
computing become utilities that are ubiquitously available.
1.4. BIG DATA ANALYTICS IN CLOUD ENVIRONMENT
Most corporate enterprises face significant challenges in fully leveraging their data. Frequently, data
is locked away in multiple databases and processing systems throughout the enterprise, and the
questions customers and analysts ask require an aggregate view of all data, sometimes totaling
hundreds of terabytes.
Cerri et al proposed ‘Knowledge in the cloud’ in place of ‘data in the cloud’ to support collaborative
tasks which are computationally intensive and facilitate distributed, heterogeneous knowledge. This
is termed as “Utility Computing” derived from required data in and out of Cloud the utilities like
electricity, gas for which we only pay for what we use from a shared resource. With the growing
interest in cloud, analytics is a challenging task. In general, Business Intelligence applications such
as image processing, web searches, understanding customers and their buying habits, supply chains
and ranking and Bio-informatics (e.g. gene structure prediction) are data intensive applications.
Cloud can be a perfect match for handling such analytical services. For example, Google’s
MapReduce can be leveraged for analytics as it intelligently chunks the data into smaller storage
units and distributes the computation among low-cost processing units. Several research teams have
started working on creating Analytic frameworks and engines which help them provide Analytics as
a Service. For example, Zementis launched the ADAPA predictive analytics decision engine on
Amazon EC2, allowing its users to deploy, integrate, and execute statistical scoring models like
neural networks, support vector machine (SVM), decision tree, and various regression models.
Booz Allen’s IT professionals, equipped with extensive expertise in the application of cloud
computing technology has described a way for setting a course for mastering your big data. Cloud
technology combines the best practices of virtualization, grid computing, utility computing, and web
technologies. The result is a technology that inherits the agility of virtualization, the scalability of
grid computing, and simplicity of Web 2.0. Cloud computing is an evolutionary step in computing
that unifies the resources of many computers to function as one entity, allowing the construction of
8
massively scalable systems that can take in and store, process and analyze all of your enterprise’s
data. The definitive application of cloud technology is as a large-scale data storage, development and
processing system, allowing your enterprise to master big data. But the agility of cloud computing
has applications beyond effective use of data. Because all data is now maintained in a centralized
system, we can help develop and implement a centralized security policy that can be easily enforced,
allowing precise and well-documented control of sensitive data. In addition, the cloud provides an
environment in which to prototype, test, and deploy new applications in a fraction of the time and
cost of traditional systems.
The benefits continue to accrue as your “cloud” grows. As more datasets are aggregated, the cloud
gains a critical mass of data across an enterprise, becoming “the place” to put data. As each dataset is
added, and potentially analyzed with the other datasets, there is an exponential increase in benefit to
the enterprise. We can enable your enterprise with simplified programming and data models, which,
combined with easy access to a wide range of data, results in an explosion of innovation from across
your enterprise in the form of data mashups, data-mining applications. Few decades back, the
problem was the shortage in information or data. In recent past, this problem has been overcome with
the advent of Internet and reduced Storage Memory cost. But a new challenge is how to analyze the
data. Data is getting generated at a much faster pace than the speed at which it can be processed with
the current infrastructure. Huge and dedicated servers were developed to solve this problem. But the
problem is with the cost of such an infrastructure which is not affordable to all the companies for
Availability of data and accessing each and every specific purpose. So today, these companies are
looking it is the key success factor for Cloud computing which makes feasible for all these
companies to hire on a temporary basis, the computational power and storage space value-based
analytics. Migrating for a specific purpose.
MOVING BIG DATA INTO CLOUD
Big Data is a data analysis methodology enabled by recent advances in technologies and architecture.
However, big data entails a huge commitment of hardware and processing resources, making
adoption costs of big data technology prohibitive to small and medium sized businesses. Cloud
computing offers the promise of big data implementation to small and medium sized businesses.
Big Data processing is performed through a programming paradigm known as MapReduce.
Typically, implementation of the MapReduce paradigm requires networked attached storage and
9
parallel processing. The computing needs of MapReduce programming are often beyond what small
and medium sized business are able to commit.
Fig 1.5. Moving BigData into cloud
The defacto approach of hard drive shipping is not flexible or secure. This work studies timely, cost-
minimizing upload of massive, dynamically-generated, geo-dispersed data into the cloud, for
processing using a Map Reduce-like framework.
Targeting at a cloud encompassing disparate data centers, we model a cost-minimizing data

migration problem, and propose two online algorithms: an online lazy migration (OLM) algorithm
and a randomized fixed horizon control (RFHC) algorithm, for optimizing at any given time the
choice of the data center for data aggregation and processing, as well as the routes for transmitting
data there. Careful comparisons among these online and offline algorithms in realistic settings are
conducted through extensive experiments, which demonstrate close-to-offline-optimum performance
of the online algorithms.
KEY TECHNOLOGIES FOR EXTRACTING BUSINESS VALUE FROM BIG DATA
Big data technologies describe a new generation of technologies and architectures, designed to
economically extract value from very large volumes of a wide variety of data by enabling high-
velocity capture, discovery and/or analysis. Storage and processing technologies have been designed
specifically for large data volumes. Computing models such as parallel processing, clustering,
virtualization, grid environments and cloud computing, coupled with high-speed connectivity, have
redefined what is possible. Here are three key technologies that can help you get a handle on big data
– and even more importantly, extract meaningful business value from it. Information management
for big data: Manage data as a strategic, core asset, with ongoing process control for big data
10
analytics. High-performance analytics for big data: Gain rapid insights from big data and the ability
to solve increasingly complex problems using more data. Big data will also intensify the need for
data quality and governance, for embedding analytics into operational systems, and for issues of
security, privacy and regulatory compliance. Everything that was problematic before will just grow
larger. Unified data management capabilities, including data governance, data integration, data
quality and metadata management. Complete analytics management, including model management,
model deployment, monitoring and governance of the analytics information asset. Effective decision
management capabilities to easily embed information and analytical results directly into business
processes while managing the necessary business rules, workflow and event logic.
11
2. LITERATURE SURVEY
1) Enabling real-time information service on telehealth system over cloud-based big data
platform
Authors: J. Wang, M. Qiu, and B. Guo
Abstract
A telehealth system covers both clinical and nonclinical uses, which not only provides store-and-
forward data services to be offline studied by relevant specialists, but also monitors the real-time
physiological data through ubiquitous sensors to support remote telemedicine. However, the current
telehealth systems don’t consider the velocity and veracity of the big–data system in the medical
context. Emergency events generate a large amount of the real-time data, which should be stored in
the data center, and forwarded to remote hospitals. Furthermore, patients’ information is scattered on
the distributed data center, which cannot provide a high-efficient remote real-time service. In this
paper, we proposes a probability-based bandwidth model in a telehealth cloud system, which helps
cloud broker to provide a high performance allocation of computing nodes and links. This brokering
mechanism considers the location protocol of Personal Health Record (PHR) in cloud and schedules
the real–time signals with a low information transfer between different hosts. The broker uses several
bandwidth evaluating methods to predict the near future usage of bandwidth in a telehealth context.
The simulation results show that our model is effective at determining the best performing service,
and the inserted service validates the utility of approach.
Introduction
Global demographic trends to demonstrate a clear rise in the proportion of elderly and chronically ill
individuals. At the same time, the strong demand for various medical and public health care services
from customer needs the creation of powerful individual–oriented personalized health care service
systems. Fortunately, over the recent years, the development of wireless sensor network and cloud
computing provides an important solution for delivering patient’s Personal Health Record (PHR) and
real–time physiological information. In order to collect the physiological information of patient
bodies, ubiquitous Body Sensor Network (BSN) is widely used in many remote health care
applications, such as wireless wearable electrocardiography (ECG). Furthermore, based on the
scalable computing capability of Data-Center (DC), it is possible to provide the remote diagnosis and
medical services to public with high–privacy policy. Current PHR services, such as Microsoft
HealthVault and Google Health, provide a foundation of adoption decisions and serve as a starting
point of requirements analysis for more complicated telehealth systems.
12
Fig 2.1.1. The architecture of the cloud platform.
The current telehealth systems have two orthogonal rules to be classified: uses and modes. In order
to support the activities of medical and patient, telehealth system should provide both clinical uses
and nonclinical uses. For clinical uses system, the teleconference between patient and healthcare
provider from some cloud services, e.g., transmission of medical images for diagnosis, individuals
exchanging health services, and health advice guided in emergent cases (referred to as tele triage).
For nonclinical uses system, some distance educating and information managing tasks are performed
based on telehealth platforms, e.g., meetings between supervision and presentations, healthcare
system integration, and medical/patient education. However, the above uses are the viewpoint of the
users (medical and patient), which cannot reveal the key techniques invoked when implementing.
Thus, we divide it into store-and-forward mode and real-time mode. For store-and-forward mode,
digital image, video, audio, observations of daily living and clinical data are captured and stored on
cloud or mobile devices; then at a convenient time, they are securely forwarded to a clinic at another
location, where they are studied by relevant specialists. For real-time mode, a telecommunications
link allows instantaneous interaction. Some typical real-time clinical tele health include tele
audiology, tele cardiology, tele dentistry, and tele behavioral health. Based on the BSN and PHR
technologies, DCs and cloud computing are supporting the information storage and delivering
services for medicals and patients. For example, the high level medical services, such as disease
early warming and preliminary decision making, could be implemented on cloud end or hospital end.
In addition, the QoS and data security can be both guaranteed with cloud, where, privacy data are
redundantly stored with encryption algorithms. All of these applications constitute a big–data system
to serve the remote individuals. The efficiency of big–data within neonatal intensive care units is
reported to have a great potential to support new wave of clinical discovery, leading to earlier
detection and prevention of wide range of deadly medical condition. In a distributed data coherence
system, when source node is executing a service, it might continually fetch the latest PHR
13
information from target nodes. As a distributed system, telehealth services meet two kernel problems
for data coherence between remote nodes. Firstly, the computing source nodes may not be the home
node of the PHR, which means locating data is a slow procedure by inquiring to the home node.
Secondly, the coherence protocol fails to guarantee the data synchronization of multiple copies of
one PHR in the distributed system, which cannot help its cache system to increase the performance
and hit-rate. The traditional memory solution may lead to significant increasing of the queue delay
time of the waiting jobs with normal memory requirement, which slows down the execution of each
job and decreases the system throughput. In this work, we adopt the store-and-forward and real-time
modes as the basic division to the telehealth services. The former includes all medical applications
that have to refer the patient’s personal information and disease history. Indeed, the PHR information
could reside in any location of the cloud. The later always collect the patient’s current physiological
signal, such as ECG, from local BSN, and transfer the information to cloud. Hence, one copy of the
real–time signal is given to the remote medical for monitoring through external network, and another
copy, as the historical record, should be saved in the host DC of the patient. In a distributed
computing and storing system, cloud broker takes charge of both the service request from remote
BSN and the resources allocation on cloud. As one of the most important components, broker
introduces the incoming cloud consumers to health service providers so that the resource could be
utilized with a higher efficiency and balance. Unfortunately, the most of distributed brokers only
monitor the resource information, such as utilization of CPU, memory, bandwidth, and disk space in
each node. The location information of the multiple PHR copies is transparent to brokers, which
blocks brokers to make a wise scheduling to allocate computing task near the available PHRs. This
invisible of PHR information results in that broker cannot choose the high–performance node and
link for the PHR data transferring between the home and target node. Traditional data coherence
protocol on application layer, such as MESI protocol, also ignores the real location of specific PHR
data, which also failed to suggest broker which node should be chosen. The work in this paper aims
to adding a content–aware service to cloud broker so that the location of PHR copies can help broker
to allocate proper bandwidth to DCs with high predictability and networking efficiency. Our main
contributions are: 1. We design a data coherence protocol for the distributed system, where the PHRs
can be stored and accessed by distributed data centers with a high performance. 2. Based on the
protocol, a flow estimating algorithm is proposed to evaluate the bandwidth consumption on each
data center by the real–time signal from the remote BSNs. 3. We design several predicting methods
for the future bandwidth consumption, based on which, bandwidth allocating algorithms can be
applied to various applications with different features in the bandwidth leasing system.
Related Work
14
To provide cloud–based health care system has triggered considerable researches both in industry
and academia in these years. In October 2011, Microsoft released a new version Web– based
platform, HealthVault, to help users to gather, store, use, and share health information for them and
their family. The authors described the prototype for remote health care monitoring, and considered
usability, power supplies, and expensive issues in the context of localized multi–sensory wearable
networks and presented a method to generate low–power sampling schedules which were resilient to
sensor faults while achieving high diagnostic fidelity. Paper provided an approach involving
telehealth applications, many of which were based on sensor technologies for unobtrusive
monitoring. The importance of these applications was considerable in light of the global
demographic trends and the resultant rose in the occurrence of injurious falls and the decrease of
physical activity. The authors presented PHISP (a Public-oriented Health care Information Service
Platform) which supported numerous health care tasks, provided individuals with many intelligent
and personalized services, and supported basic remote health care and guardianship. Paper proposed
on–line dynamic resource allocation algorithms for the IaaS cloud system with pre-emptible tasks,
which adjusted the resource allocation dynamically based on the updated information of the actual
task executions.
Fig. 2.1.2. The work flow of the cloud system for telehealth
To provide a security representation of PHR, XML schema is adopted in work. Also, the PHR
representation supports some data mining techniques for further clinical studies. The previous
solution cannot strain there performance with the BSN workload because they invoked negative
response at the serving end. Despite sensor network in the area of telehealth service, much efforts
have specifically focused on the traffic load balance for power, storage space and energy saving
which introduces some control structures in the process models for cloud. In the work of, each virtual
15
node itself decided whether to replicate, migrate or suicide by weighing up the pros and cons, which
was based on the evaluation of traffic load of all nodes, and selected among physical nodes with the
most traffic to replicate or migrate on. After that, it takes into account blocking probability to achieve
quicker response and better load balance performance. the authors presented the real– time
virtualised cloud infrastructure that was developed in the context of the IRMOS European Project.
The work proposed an interface design of a low-power programmable system on chip for intelligent
wireless sensor nodes to reduce the overall power consumption of the heart disease monitoring
system. Those work didn’t provide service with the requirement quantities, and no model was
applied within them to guide a proper performance configuration.
Conclusion:
It design a distributed data managing framework for telehealth system, which includes BSN, cloud
system, and remote hospital end. By analyzing the features of data processing with medical
applications, we provide a decentralized data coherence protocol to solve the performance problems
by current design. Our model measures the bandwidth consumption between any node pair in cloud
so that the bandwidth can be calculated in each interval. The experimental results show that the
bandwidth predicting error is limited in 10%, which provides cloud with a flexible methods (4 types
of predicting algorithms) to estimate the bandwidth resources to nodes. Furthermore, a case study
shows that our method is able to support finding the most appropriate bandwidth-estimating
algorithm for underlining telehealth applications.
16
2) Optimal Big Data Sharing Approach for Tele-health in Cloud Computing
Authors: L. Qiu, K. Gai, and M. Qiu
Abstract: The rapid development of tele-health systems have received driving engagements from
various emerging techniques, such as big data and cloud computing. Sharing data among multiple
tele-health systems is an adaptive approach for improving service quality via the network-based
technologies. However, current implementations of data sharing in cloud computing is still facing the
restrictions caused by the networking capacities and virtual machine switches. In this paper, we focus
on the problem of data sharing obstacles in cloud computing and propose an approach that uses
dynamic programming to produce optimal solutions to data sharing mechanisms. The proposed
approach is called Optimal Telehealth Data Sharing Model (OTDSM), which considers transmission
probabilities, maximizing network capacities, and timing constraints. Our experimental results have
proved the flexibility and adoptability of the proposed method.
Introduction
Recent rapid development of tele-health systems has been enabled by various updated techniques,
such as cloud computing and big data. A broad application of using tele-health systems results in a
great improvement of health care by increasing the existing services or creating new health care
services. One of the benefits of applying tele-health systems is that it provides possibilities for
multiple healthcare organizations and share the data for the purpose of healthcare upgrades. A few
emerging techniques related to distributed data sharing include Internet-of-Things (IoT), Mass
Distributed Storage (MDS), Data Deduplication (DD), and Backup Storage Compression (BSC).
Fig 2.2.1. High level architecture of Optimal Tele-health Data Sharing model.
However, contemporary implementations of cloud data sharing is dealing with the restrictions that
are caused by the networking capacities and limitations of Virtual Machine (VM) switches. The main
problem is that the data sharing capabilities are usually restricted by the performance of the networks
when a cloud-based solution is applied. The amount of the telehealth sensors is dramatically
17
increasing for a wide data collection. The demands of the data collections are various as well. For
instance, real-time data collections require a higher-level guarantee for data transmissions but those
data for statistical purposes can adapt to a lower performance of networking connections. The
distinct demands in telehealth systems lead to the needs of the dynamic operation manner. Moreover,
the networking traffics can strongly impact on healthcare applications, since it is tightly associated
with the networking capacities. The distributed data storage supporting multiple applications requires
a stable and efficient network connections for carrying real-time services. The impact of the
networking capacity grows up when the size of the data increases. The main challenge is that it is
difficult to have a dynamic file transmission working manner, by which the data delivery can depend
on each networking connection’s conditions and data package sizes. The problem is a NP-hard
problem when considering transmission probability, maximizing network capacities, and timing
constraints. To address this issue, we propose a novel approach that dynamically selects the
networking connections for transferring data packages between tele-health applications and cloud
servers. The proposed approach is called Optimal Tele-health Data Sharing Model (OTDSM). A
high level architecture of applying OTDSM model in telehealth systems. From distributed cloud
servers, different tele-health applications can access to the data via uploading and downloading
operations. We use solid lines to represent the uploading operations implemented by Tele-Health
App2 in the figure. We use broken lines to represent the downloading operations implemented by
Tele-Health App 1, as shown in the figure. The connection conditions’ can be varied due to the
different physical networking circumstances. Our approach is designed to create an optimal solution
to obtaining the maximum network capacity, as well as consider the transmission probabilities, under
timing constraints. All timing constraints are within a timing constraint range, which covers the time
period from the shortest required execution time to the longest required execution time.
The main contributions of this work are twofold: 1) We propose an optimal approach that is designed
for optimizing the implementations of cloud-based telehealth systems in a big data environment. The
proposed issue addresses the data sharing in tele-health systems that is restricted by the networking
capacities. 2) The solved issue is a NP-Hard problem and the developed solution can be applied in
other deducible problems.
Conclusion:
It proposed an approach of creating optimal solutions to big data sharing in cloud-based tele-health
systems. We considered three vital factors that highly impacted on the real-time tele-health services,
including the transmission time, network capacity, and transmission success probability. The
proposed approach was called OTDSM that was supported by a major optimal algorithm, DtGDP
18
algorithm. The experimental results had shown that our solution was strongly superior to the greedy
algorithm.
3) Health-CPS: Healthcare Cyber-Physical System Assisted by Cloud and Big Data

Authors: Y.Zhang, M.Qiu, C.-W.Tsai, M. M.Hassan, and A.Alamri,
Abstract
The advances in information technology have witnessed great progress on healthcare technologies in
various domains nowadays. However, these new technologies have also made healthcare data not
only much bigger but also much more difficult to handle and process. Moreover, because the data are
created from a variety of devices within a short time span, the characteristics of these data are that
they are stored in different formats and created quickly, which can, to a large extent, be regarded as a
big data problem. To provide a more convenient service and environment of healthcare, this paper
proposes a cyber-physical system for patient-centric healthcare applications and services, called
Health-CPS, built on cloud and big data analytics technologies. This system consists of a data
collection layer with a unified standard, a data management layer for distributed storage and parallel
computing, and a data-oriented service layer. The results of this study show that the technologies of
cloud and big data can be used to enhance the performance of the healthcare system so that humans
can then enjoy various smart healthcare applications and services.
1. Introduction
In the past two decades, information technology has been widely utilized in medicine. Electronic
health records (EHRs), biomedical database, and public health have been enhanced not only on the
availability and traceability but also on the liquidity of data. As healthcare-related data are
consistently explosive, there are challenges for data management, storage, and processing, as
follows.
1) Large Scale: With the improvement of medical information, particularly the development of
hospital information systems, the volume of medical data has been increasing. In addition, the
promotion of wearable health devices accelerates the explosion of healthcare data.
2) Rapid Generation: Most medical equipment, particularly wearable devices, continuously collects
data. The rapidly generated data need to be processed promptly for responding to emergency
immediately.
3) Various Structure: Clinical examination, treatment, monitoring, and other healthcare devices
generate complex and heterogeneous data (e.g., text, image, audio, or video) that are either
structured, semi-structured, or non-structured at all.
19
4) Deep Value: The value hidden in an isolated source is limited. However, through data fusion of
EHR and electronic medical records (EMRs), we can maximize the deep value from healthcare data,
such as personalized health guidance and public health warning. With the assistance of cloud
computing and big data, healthcare data from cyber-physical systems (CPS) (e.g., huge files,
complex structures, and different features) can be efficiently managed. Lin proposed a NoSQL based
approach for rapid processing, storage, indexing, and analysis of healthcare data, to overcome the
limitation of a rational database. Nie et al. also designed a multilingual health search engine to return
one multifaceted answer that is well structured and precisely extracted from multiple heterogeneous
healthcare sources. In another study, Takeuchi and Kodama presented a personal dynamic health
system based on cloud computing and big data to store daily healthcare related information collected
by mobile devices. In addition, they also proposed a health data mining algorithm to find out the
correlation between a health condition and lifestyle. Although the innovations are in the healthcare
field, there are some issues that need to be solved, particularly the heterogeneous data fusion and the
open platform for data access and analysis. For example, although studies are focused on the
interconnection between body area networks (BANs) and the cooperation between BANs and
medical institutions, it is difficult to fuse the multisource heterogeneous data and the corresponding
managements without unified standards and systems. Thus, the healthcare data stored together on the
physical layer are still logically separated. It can be easily recognized that several previous works
were focused on the analysis of healthcare data, or on how to deploy and implement cloud
computing, for healthcare systems. However, the greatest challenge of building a comprehensive
healthcare system is in the handling of the heterogeneous healthcare data captured from multiple
sources. That is why a healthcare CPS using technologies of cloud and big data (Health-CPS) is
presented in this, the contributions of which can be summarized as follows: 1) A unified data
collection layer for the integration of public medical resources and personal health devices is
presented; 2) a cloud-enabled and data-driven platform for the storage and analysis of multisource
heterogeneous healthcare data is established; and 3) a healthcare application service cloud is
designed, which provides a unified application programming interface (API) for the developers and a
unified interface for the users.
HEALTH-CPS Architecture
Design Issues For the healthcare industry
Cloud and big data not only are important techniques but also are gradually becoming the trend in
healthcare innovation. Nowadays, medicine is relying much more on specific data collection and
analysis, whereas medical knowledge is explosively growing. Therefore, medical knowledge
published and shared via cloud is popular in practice. Patients typically will know more than a
20
doctor. As such, the information and knowledge base can be enriched and shared by the doctors over
the cloud [23]. The patients can also actively participate in medical activities assisted by big data.
Through smart phones, cloud computing, 3-D printing, gene sequencing, and wireless sensors, the
medical right returns to the patients, and the role of a doctor is as a consultant to provide decision
support to the patients [24]. The revolutions of cloud and big data may have a strong impact on the
healthcare industry, which has even been reformed as a new complex ecosystem. Fig. 1 illustrates the
extended healthcare ecosystem, including traditional roles (e.g., patients and medical institutions)
and other newcomers. That is why we need to design a more suitable healthcare system to cope with
the following challenges in this novel revolution.
1) Multisource Heterogeneous Data Management With Unified Standards. A variety of data types
and heterogeneity of homogeneous data make healthcare data hard to harness. On one hand, the
system must support various healthcare equipment devices to ensure scalability. On the other hand,
the data formats should be converted with the unified standards, to improve the efficiency of data
storage, query, retrieval, processing, and analysis.
2) Diversified Data Analysis Modules With Unified Programming Interface. Diversified healthcare
data include structured, semi-structured, and other unstructured data. According to these different
data structures, it is necessary to deploy suitable methods for efficient online or offline analysis, such
as stream processing, batch processing, iterative processing, and interactive query. To reduce system
complexity and improve development and access efficiency, a unified programming interface will be
a fundamental component.
3) Application Service Platform With Unified Northbound Interface. As shown in Fig. 1, the system
is expected to provide various applications and services for different roles. To provide available and
reliable healthcare services, the application service platform of this system is essential for resource
optimization, technical support, and data sharing.
B. Cloud- and Big-Data-Assisted
It illustrates the architecture of Health-CPS, which consists of three layers, namely, data collection
layer, data management layer, and application service layer.
21
Fig 2.3.1. Health-CPS Architecture
1) Data collection layer: This layer consists of data nodes and adapters, provides a unified system
access interface for multisource heterogeneous data from hospitals, Internet, or user-generated
content. Through an adapter, raw data in a variety of structures and formats can be preprocessed to
ensure the availability and security of the data transmission to the data management layer.
2) Data management layer: This layer consists of a distributed file storage (DFS) module and a
distributed parallel computing (DPC) module. Assisted by bigdata related technologies, DFS will
enhance the performance of the healthcare system by providing efficient data storage and I/O for
heterogeneous healthcare data. According to the timeliness of data and priority of analysis task, DPC
provides the corresponding processing and analysis methods.
3) Application service layer: This layer provides users the basic visual data analysis results. It also
provides an open unified API for the developers aiming at user-centric application to provide rich,
professional, and personalized healthcare services.
Conclusion:
Nowadays, space (from hospital to home and carry) and time (from discrete sampling to continuous
tracking and monitoring) are no longer a stumbling stone for modern healthcare by using more
powerful analysis technologies. Medical diagnosis is evolving to patient centric prevention,
prediction, and treatment. The big data technologies have been developed gradually and will be used
everywhere. Consequently, healthcare will also enter the big data era. More precisely, the big data
analysis technologies can be used as guide in lifestyle, as a tool to support in the decision-making,
and as a source of innovation in the evolving healthcare ecosystem. This paper has presented a smart
health system assisted by cloud and big data, which includes 1) a unified data collection layer for the
integration of public medical resources and personal health devices, 2) a cloud-enabled and data-
22
driven platform for multisource heterogeneous healthcare data storage and analysis, and 3) a unified
API for developers and a unified interface for users. Supported by Health-CPS, various personalized
applications and services are developed to address the challenges in the traditional healthcare,
including centralized resources, Information Island, and patient passive participation. In the future,
we will focus on developing various applications based on the Health-CPS to provide a better
environment to humans.
4) Risk factors and risk assessment tools for falls in hospital in-patients: a systematic review
Authors: D. Oliver, F. Daly, F. C. Martin, and M. E. McMurdo
Abstract
The objective is to identify all published papers on risk factors and risk assessment tools for falls in
hospital inpatients. To identify clinical risk assessment tools or individual clinical risk factors
predictive of falls, with the ultimate aim of informing the design of effective fall prevention
strategies. Design: systematic literature review (Cochrane methodology). Independent assessment of
quality against agreed criteria. Calculation of odds ratios and 95% confidence intervals for risk
factors and of sensitivity, specificity, negative and positive predictive value for risk assessment tools
(with odds ratios and confidence intervals), where published data sufficient. Results: 28 papers on
risk factors were identified, with 15 excluded from further analysis. Despite the identification of 47
papers purporting to describe falls risk assessment tools, only six papers were identified where risk
assessment tools had been subjected to prospective validation, and only two where validation had
been performed in two or more patient cohorts. Conclusions: a small number of significant falls risk
factors emerged consistently, despite the heterogeneity of settings namely gait instability, agitated
confusion, urinary incontinence/frequency, falls history and prescription of ‘culprit’ drugs (especially
sedative/hypnotics). Simple risk assessment tools constructed of similar variables have been shown
to predict falls with sensitivity and specificity in excess of 70%, although validation in a variety of
settings and in routine clinical use is lacking. Effective falls interventions in this population may
require the use of better-validated risk assessment tools, or alternatively, attention to common
reversible falls risk factors in all patients.
Introduction
23
Falls are common among hospital inpatients. Rates from 2.9–13 falls per 1,000 bed days have been
reported. Up to 30% of such falls may result in injury, including fracture, head and soft tissue
trauma, all of which may in turn lead to impaired rehabilitation and co-morbidity. Falls are also
associated with higher anxiety and depression scores, loss of confidence and post-fall syndrome.
They are associated with increased length of hospital stay and higher rates of discharge to long-term
institutional care. Not only are they costly for individual patients and for hospitals, but they may
result in anxiety or guilt among staff, complaints or litigation from patients’ families. There may be a
feeling that something should have been done to prevent the fall and that someone is accountable.
We know that many hospital patients recovering from acute illness may go through a period of
transient risk and that others, with chronic gait instability and cognitive impairment, may be at risk of
falling throughout admission. Moreover, effective rehabilitation entails an inevitable risk of falls as
patients are encouraged to regain independent mobility. It seems intuitively likely however, that
some falls are both predictable and preventable. Systematic review of the literature on falls
prevention in hospitals has found no consistent evidence for single or multiple interventions to
prevent falls. More definitive work in this Weld has been recognized as a key falls research priority.
There is better evidence for falls prevention in older people dwelling in the community. However,
such individuals are likely to have different characteristics from patients admitted to hospital. Whilst
we know that falls are the result of multiple synergistic pathologies and risk factors, we do not know
to what extent the nature and prevalence of these risk factors is different among hospital inpatients,
and therefore whether successful interventions can be extrapolated from the community. Moreover,
as patients may only be in hospital for a short time, long-term interventions (e.g. exercise programs)
are unlikely to be effective. It does seem likely, however, that any successful intervention to prevent
falls in hospital inpatients might rest both on a knowledge of the reversible risk factors for falls in
this group and on an ability to predict high risk of falling in individual patients. With regard to risk
prediction, there are a number of clinical risk assessment tools in the literature whose derivation,
weighting, validation and usefulness are obscure. Wyatt and Altman laid down ‘gold standard’
criteria for the use of such tools. Essentially, they should be validated prospectively, using
sensitivity/specificity analyses, in more than one population, with good face validity, inter-rater
reliability and adherence from staff and transparent, simple calculation of the score. A better
knowledge of the nature and prevalence of risk factors for falls in hospital inpatients and of our
ability to identify high-risk patients is an important step in the design of future falls prevention
interventions in this group. They may also be applicable to other facilities, which provide care for
post-acute patients, such as Intermediate Care units in the UK or skilled nursing facilities in the US.
24
Conclusion
A small number of significant falls risk factors emerged consistently, despite the heterogeneity of
settings namely gait instability, agitated confusion, urinary incontinence/frequency, falls history and
prescription of ‘culprit’ drugs (especially sedative/hypnotics). Simple risk assessment tools
constructed of similar variables have been shown to predict falls with sensitivity and specificity in
excess of 70%, although validation in a variety of settings and in routine clinical use is lacking.
Effective falls interventions in this population may require the use of better-validated risk assessment
tools, or alternatively, attention to common reversible falls risk factors in all patients.
3. TECHNOLOGIES USED
3.1. HADOOP
Introduction
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput access to application
data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX
requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop
Core project. The project URL is http://hadoop.apache.org/.
Assumptions and Goals
Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds
or thousands of server machines, each storing part of the file system’s data. The fact that there are a
huge number of components and that each component has a non-trivial probability of failure means
that some component of HDFS is always non-functional. Therefore, detection of faults and quick,
automatic recovery from them is a core architectural goal of HDFS.
Streaming Data Access
25
Applications that run on HDFS need streaming access to their data sets. They are not general purpose
applications that typically run on general purpose file systems. HDFS is designed more for batch
processing rather than interactive use by users. The emphasis is on high throughput of data access
rather than low latency of data access. POSIX imposes many hard requirements that are not needed
for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to
increase data throughput rates.
Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes
in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth
and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a
single instance.
Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once created, written,
and closed need not be changed except for appends and truncates. Appending the content to the end
of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data
coherency issues and enables high throughput data access. A MapReduce application or a web
crawler application fits perfectly with this model.
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it is executed near the data it
operates on. This is especially true when the size of the data set is huge. This minimizes network
congestion and increases the overall throughput of the system. The assumption is that it is often
better to migrate the computation closer to where the data is located rather than moving the data to
where the application is running. HDFS provides interfaces for applications to move themselves
closer to where the data is located.
Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily portable from one platform to another. This facilitates
widespread adoption of HDFS as a platform of choice for a large set of applications.
NameNode and DataNodes
26
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per node in the cluster, which manage storage attached
to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be
stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes. The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
Fig.3.1.1. HDFS Architecture
The NameNode and DataNode are pieces of software designed to run on commodity machines.
These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java
language; any machine that supports Java can run the NameNode or the DataNode software. Usage
of the highly portable Java language means that HDFS can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of the DataNode software. The architecture does not
27
preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the
case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system.
The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in
such a way that user data never flows through the NameNode.
The File System Namespace
HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar to
most other existing file systems; one can create and remove files, move a file from one directory to
another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS
does not support hard links or soft links. However, the HDFS architecture does not preclude
implementing these features.
The NameNode maintains the file system namespace. Any change to the file system namespace or its
properties is recorded by the NameNode. An application can specify the number of replicas of a file
that should be maintained by HDFS. The number of copies of a file is called the replication factor of
that file. This information is stored by the NameNode.
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of
a file are replicated for fault tolerance. The block size and replication factor are configurable per file.
An application can specify the number of replicas of a file. The replication factor can be specified at
file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer
at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.
28
Fig.3.1.2. Replica Placement: The First Baby Steps
The placement of replicas is critical to HDFS reliability and performance. Optimizing replica
placement distinguishes HDFS from most other distributed file systems. This is a feature that needs
lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve
data reliability, availability, and network bandwidth utilization. The current implementation for the
replica placement policy is a first effort in this direction. The short-term goals of implementing this
policy are to validate it on production systems, learn more about its behavior, and build a foundation
to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches. In most cases,
network bandwidth between machines in the same rack is greater than network bandwidth between
machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in Hadoop
Rack Awareness. A simple but non-optimal policy is to place replicas on unique racks. This prevents
losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading
data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on
component failure. However, this policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put one
replica on one node in the local rack, another on a different node in the local rack, and the last on a
different node in a different rack. This policy cuts the inter-rack write traffic which generally
29
improves write performance. The chance of rack failure is far less than that of node failure; this
policy does not impact data reliability and availability guarantees. However, it does reduce the
aggregate network bandwidth used when reading data since a block is placed in only two unique
racks rather than three. With this policy, the replicas of a file do not evenly distribute across the
racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other
third are evenly distributed across the remaining racks. This policy improves write performance
without compromising data reliability or read performance.
The current, default replica placement policy described here is a work in progress.
Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request
from a replica that is closest to the reader. If there exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple
data centers, then a replica that is resident in the local data center is preferred over any remote
replica.
Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does
not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and
Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a
DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered
safely replicated when the minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then
determines the list of data blocks (if any) that still have fewer than the specified number of replicas.
The NameNode then replicates these blocks to other DataNodes.
The Persistence of File System Metadata
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the
EditLog to persistently record every change that occurs to file system metadata. For example,
creating a new file in HDFS causes the NameNode to insert a record into the Edit Log indicating
this. Similarly, changing the replication factor of a file causes a new record to be inserted into the
EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire
30
file system namespace, including the mapping of blocks to files and file system properties, is stored
in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is
plenty to support a huge number of files and directories. When the NameNode starts up, it reads the
FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can
then truncate the old EditLog because its transactions have been applied to the persistent FsImage.
This process is called a checkpoint. In the current implementation, a checkpoint only occurs when
the NameNode starts up. Work is in progress to support periodic checkpointing in the near future.
The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge
about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The
DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the
optimal number of files per directory and creates subdirectories appropriately. It is not optimal to
create all local files in the same directory because the local file system might not be able to
efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans
through its local file system, generates a list of all HDFS data blocks that correspond to each of these
local files and sends this report to the NameNode: this is the Blockreport.
The Communication Protocols
All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a
connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with
the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote
Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By
design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by
DataNodes or clients.
Robustness
The primary objective of HDFS is to store data reliably even in the presence of failures. The three
common types of failures are NameNode failures, DataNode failures and network partitions.
Data Disk Failure, Heartbeats and Re-Replication
31
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can
cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this
condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent
Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered
to a dead DataNode is not available to HDFS anymore. DataNode death may cause the replication
factor of some blocks to fall below their specified value. The NameNode constantly tracks which
blocks need to be replicated and initiates replication whenever necessary. The necessity for re-
replication may arise due to many reasons: a DataNode may become unavailable, a replica may
become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be
increased.
Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically
move data from one DataNode to another if the free space on a DataNode falls below a certain
threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically
create additional replicas and rebalance other data in the cluster. These types of data rebalancing
schemes are not yet implemented.
Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can
occur because of faults in a storage device, network faults, or buggy software. The HDFS client
software implements checksum checking on the contents of HDFS files. When a client creates an
HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate
hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the
data it received from each DataNode matches the checksum stored in the associated checksum file. If
not, then the client can opt to retrieve that block from another DataNode that has a replica of that
block.
Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can
cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to
support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage
or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This
synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of
32
namespace transactions per second that a NameNode can support. However, this degradation is
acceptable because even though HDFS applications are very data intensive in nature, they are not
metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog
to use.
The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine
fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode
software to another machine is not supported.
Snapshots
Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot
feature may be to roll back a corrupted HDFS instance to a previously known good point in time.
HDFS does not currently support snapshots but will in a future release.
Data Organization
Data Blocks
HDFS is designed to support very large files. Applications that are compatible with HDFS are those
that deal with large data sets. These applications write their data only once but they read it one or
more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-
read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is
chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.
Staging
A client request to create a file does not reach the NameNode immediately. In fact, initially the
HDFS client caches the file data into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file accumulates data worth over one HDFS
block size, the client contacts the NameNode. The NameNode inserts the file name into the file
system hierarchy and allocates a data block for it. The NameNode responds to the client request with
the identity of the DataNode and the destination data block. Then the client flushes the block of data
from the local temporary file to the specified DataNode. When a file is closed, the remaining un-
flushed data in the temporary local file is transferred to the DataNode. The client then tells the
NameNode that the file is closed. At this point, the NameNode commits the file creation operation
into a persistent store. If the NameNode dies before the file is closed, the file is lost.
33
The above approach has been adopted after careful consideration of target applications that run on
HDFS. These applications need streaming writes to files. If a client writes to a remote file directly
without any client side buffering, the network speed and the congestion in the network impacts
throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g.
AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed
to achieve higher performance of data uploads.
Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as explained in the
previous section. Suppose the HDFS file has a replication factor of three. When the local file
accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode.
This list contains the DataNodes that will host a replica of that block. The client then flushes the data
block to the first DataNode. The first DataNode starts receiving the data in small portions, writes
each portion to its local repository and transfers that portion to the second DataNode in the list. The
second DataNode, in turn starts receiving each portion of the data block, writes that portion to its
repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the
data to its local repository. Thus, a DataNode can be receiving data from the previous one in the
pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is
pipelined from one DataNode to the next.
Accessibility
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a
FileSystem Java API for applications to use. A C language wrapper for this Java API is also
available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance.
Work is in progress to expose HDFS through the WebDAV protocol.
FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a
commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of
this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here
are some sample action/command pairs:
Action Command
Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
34
Action Command
Remove a directory named /foodir bin/hadoop fs -rm -R /foodir
View the contents of a file named bin/hadoop dfs -cat

/foodir/myfile.txt /foodir/myfile.txt
FS shell is targeted for applications that need a scripting language to interact with the stored data.
DFS Admin
The DFS Admin command set is used for administering an HDFS cluster. These are commands that
are used only by an HDFS administrator. Here are some sample action/command pairs:
Action Command
Put the cluster in Safemode bin/hdfs dfsadmin -safemode enter
Generate a list of Data Nodes bin/hdfs dfsadmin -report
Recommission or decommission DataNode(s) bin/hdfs dfsadmin –refresh Nodes
Browser Interface
A typical HDFS install configures a web server to expose the HDFS namespace through a
configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of
its files using a web browser.
Space Reclamation
File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS.
Instead, HDFS moves it to a trash directory (each user has its own trash directory under
/user/<username>/.Trash). The file can be restored quickly as long as it remains in trash. Most recent
deleted files are moved to the current trash directory (/user/<username>/.Trash/Current), and in a
configurable interval, HDFS creates checkpoints (under /user/<username>/.Trash/<date>) for files in
current trash directory and deletes old checkpoints when they are expired. After the expiry of its life
35
in trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the
blocks associated with the file to be freed. Note that there could be an appreciable time delay
between the time a file is deleted by a user and the time of the corresponding increase in free space in
HDFS.
Currently, the trash feature is disabled by default (deleting files without storing in trash). User can
enable this feature by setting a value greater than zero for parameter fs.trash.interval (in core-
site.xml). This value tells the NameNode how long a checkpoint will be expired and removed from
HDFS. In addition, user can configure an appropriate time to tell NameNode how often to create
checkpoints in trash (the parameter stored as fs.trash.checkpoint.interval in core-site.xml), this value
should be smaller or equal to fs.trash.interval.
Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode selects excess replicas that can be
deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes
the corresponding blocks and the corresponding free space appears in the cluster. Once again, there
might be a time delay between the completion of the setReplication API call and the appearance of
free space in the cluster.
4. INPUT DESIGN & OUTPUT DESIGN
INPUT DESIGN
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to put
transaction data in to a usable form for processing can be achieved by inspecting the computer to
read data from a written or printed document or it can occur by having people keying the data
directly into the system. The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the process simple. The input
is designed in such a way so that it provides security and ease of use with retaining the privacy. Input
Design considered the following things:
 What data should be given as input?
36
 How the data should be arranged or coded?
 The dialog to guide the operating personnel in providing input.
 Methods for preparing input validations and steps to follow when error occur.
Objectives
1. Input Design is the process of converting a user-oriented description of the input into a computer-
based system. This design is important to avoid errors in the data input process and show the correct
direction to the management for getting correct information from the computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume of data.
The goal of designing input is to make data entry easier and to be free from errors. The data entry
screen is designed in such a way that all the data manipulates can be performed. It also provides
record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of screens.
Appropriate messages are provided as when needed so that the user will not be in maize of instant.
Thus the objective of input design is to create an input layout that is easy to follow
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents the information
clearly. In any system results of processing are communicated to the users and to other system
through outputs. In output design it is determined how the information is to be displaced for
immediate need and also the hard copy output. It is the most important and direct source information
to the user. Efficient and intelligent output design improves the system’s relationship to help user
decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the right
output must be developed while ensuring that each output element is designed so that people will
find the system can use easily and effectively. When analysis design computer output, they should
Identify the specific output that is needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following
objectives.
37
 Convey information about past activities, current status or projections of the
 Future.
 Signal important events, opportunities, problems, or warnings.
 Trigger an action.
 Confirm an action.
5. IMPLEMENTATION
MODULES:
1. Data imputation module,

2. Training Word Embedding,
3. KNN algorithm,
4. Computation module.
MODULES DESCRIPTION:
Data imputation module:
38
For patient’s examination data, there is a large number of missing data due to human error. Thus, we
need to fill the structured data. Before data imputation, we first identify uncertain or incomplete
medical data and then modify or delete them to improve the data quality. Then, we use data
integration for data pre-processing.
Training Word Embedding:
Word vector training requires pure corpus, the purer the better, that is, it is better to use a
professional corpus. In this paper, we extracted the text data of all patients in the hospital from the
medical large data center.
KNN algorithm:
K-nearest Neighbor (KNN) to predict the risk of cerebral infarction disease. For T-data, we propose
KNN-based unimodal disease risk prediction (KNN-UDRP) algorithm to predict the risk of cerebral
infarction disease. In the remaining of the paper, let KNN (T-data) denote the KNN algorithm used
for T-data. For S&T data, we predict the risk of cerebral infarction disease by the use of KNN
algorithm, which is denoted by KNN (S&T-data) for the sake of simplicity.
Computation module:
The work performs a processing phase followed by two MapReduce jobs. This method also only
uses n Map tasks to compute the distances and the number of final process. Since this method uses a
distance based partitioning method, the size of varies, depending on the number of cells required to
perform the computation and the number of replications required by each phase.
Source Code:
Disease.java:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
39
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class DriverConf {
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
JobClient client = new JobClient();
//Configurations for Job set in this variable
JobConf conf = new JobConf(DriverConf.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
//Name of the Job
conf.setJobName("NGSensorData");
conf.setMapperClass(Map.class);
// conf.setCombinerClass(Red.class);
conf.setReducerClass(Red.class);
//Formats of the Data Type of Input and Output
conf.setInputFormat(KeyValueTextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
40
//Data type of Output key and value
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
client.setConf(conf);
try{
//Running the job with Configurations set in the conf
JobClient.runJob(conf);
} catch(Exception e){
e.printStackTrace();
Knn.java:
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
41
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
public class Knn {
public static class Map extends MapReduceBase implements Mapper<Text, Text, Text,
IntWritable> {
public void map(Text key, Text value, OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {
String splitword = key.toString();
String[] part = splitword.split(",");
output.collect(new Text( part[9]), new IntWritable(1));
public static class Red extends MapReduceBase implements Reducer<Text, IntWritable,

Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException {
int count = 0;
42
while (values.hasNext()) {
values.next();
++count;
System.out.println(key + " " + count);
output.collect(key, new IntWritable(count));
public static void main(String[] args) throws IOException {
JobClient client = new JobClient();
JobConf conf = new JobConf(Knn.class);
FileInputFormat.addInputPath((JobConf)conf, new Path(args[0]));
FileOutputFormat.setOutputPath((JobConf)conf, new Path(args[1]));
conf.setJobName("combiner_partitioner");
conf.setMapperClass(Map.class);
conf.setReducerClass(Red.class);
conf.setInputFormat(KeyValueTextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setNumReduceTasks(1);
client.setConf((Configuration)conf);
try {
43
JobClient.runJob((JobConf)conf);
catch (Exception e) {
e.printStackTrace();
Map.java:
import java.util.StringTokenizer;
import org.apache.hadoop.mapred.Mapper;
public class Map extends MapReduceBase implements
Mapper<Text, Text, Text, IntWritable>{
String returnValidString(String parseString)
String subString="";
String match = "]";
44
for(int index=42;index<parseString.length();index++)
if(parseString.charAt(index) !=match.charAt(0) )
subString += parseString.charAt(index);
else
break;
return subString;
@Override
public void map(Text key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String splitword=key.toString();
String[] part=splitword.split(",");
int humidity = 1;
Double h = Double.parseDouble(part[2]);
humidity = h.intValue();
output.collect(new Text( Integer.toString(humidity)), new IntWritable(1));
45
Reduce.java:
import java.util.Iterator;
import org.apache.hadoop.mapred.Reducer;
public class Red extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable>{
@Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws

IOException {
//to get the count
int count = 0;
while(values.hasNext())
values.next();
count++;
46
}
output.collect(key, new IntWritable(count));
MapReduce.java:
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;
public class Mapreduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, IntWritable> {
private static final IntWritable one = new IntWritable(1);
@Override
47
public void map(ImmutableBytesWritable row, Result values, Context context) throws
IOException {
String Data=new String(row.get());
String Split_data[]=Data.split(":");
switch(Split_data[0])
case "09":Split_data[1]="10";break;
ImmutableBytesWritable userKey = new

ImmutableBytesWritable(Bytes.toBytes(Split_data[0]+" to "+Split_data[1]));
try {
context.write(userKey, one);
} catch (InterruptedException e) {
throw new IOException(e);
48
}
public static class Reducer1 extends TableReducer<ImmutableBytesWritable, IntWritable,

ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values,

Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
Put put = new Put(key.get());
put.add(Bytes.toBytes("details"), Bytes.toBytes("total"),
Bytes.toBytes(String.valueOf(sum)));
System.out.println(String.format("stats : key : %d, count : %d", Bytes.toInt(key.get()),

sum));
context.write(key, put);
public static void main(String[] args) throws Exception {
HBaseConfiguration conf = new HBaseConfiguration();
49
Job job = new Job(conf, "Hbase_FreqCounter1");
job.setJarByClass(Mapreduce.class);
Scan scan = new Scan();
String columns = "station";
scan.addFamily(Bytes.toBytes(columns));
scan.setFilter(new FirstKeyOnlyFilter());
TableMapReduceUtil.initTableMapperJob(args[0], scan, Mapper1.class,

ImmutableBytesWritable.class,
IntWritable.class, job);
TableMapReduceUtil.initTableReducerJob(args[1], Reducer1.class, job);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Result Analysis:
50
Fig. 5.1 Bar Graph
Fig. 5.2 Pie Chart
6.SYSTEM DESIGN
SYSTEM ARCHITECTURE
Fig. 6.1. System Architecture

DATA FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to
represent a system in terms of input data to the system, various processing carried out on this
data, and the output data is generated by this system.
51
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model
the system components. These components are the system process, the data used by the
process, an external entity that interacts with the system and the information flows in the
system.
3. DFD shows how the information moves through the system and how it is modified by a series
of transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of
abstraction.
Fig.6.2. Dataflow diagram
UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose

modelling language in the field of object-oriented software engineering. The standard is managed,
and was created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object oriented
computer software. In its current form UML is comprised of two major components: a Meta-model
and a notation. In the future, some form of method or process may also be added to; or associated
with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business modelling and
other non-software systems.
The UML represents a collection of best engineering practices that have proven successful in
the modelling of large and complex systems.
52
The UML is a very important part of developing objects oriented software and the software
development process. The UML uses mostly graphical notations to express the design of software
projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modelling Language so that they can develop
and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core concepts.
3. Be independent of particular programming languages and development process.
4. Provide a formal basis for understanding the modelling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations, frameworks, patterns and
components.
7. Integrate best practices.
USE CASE DIAGRAM:
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented as use
cases), and any dependencies between those use cases. The main purpose of a use case diagram is to
show what system functions are performed for which actor. Roles of the actors in the system can be
depicted.
53
Fig.6.3. Use case Diagram
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of static
structure diagram that describes the structure of a system by showing the system’s classes, their
attributes, operations (or methods), and the relationships among the classes. It explains which class
contains information.
Fig.6.4.Class Diagram
SEQUENCE DIAGRAM:
54
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.
Fig.6.5.Sequence Diagram
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities and actions with
support for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams
can be used to describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.
55
Fig.6.6.Activity Diagram
56
7. ANALYSIS
EXISTING SYSTEM:
 Prediction using traditional disease risk models usually involves a machine learning
algorithm (e.g., logistic regression and regression analysis, etc.), and especially a supervised
learning algorithm by the use of training data with labels to train the model.
 In the test set, patients can be classified into groups of either high-risk or low-risk. These
models are valuable in clinical situations and are widely studied.
 Chen et.al proposed a healthcare system using smart clothing for sustainable health
monitoring.
 Qiu et al. had thoroughly studied the heterogeneous systems and achieved the best results for
cost minimization on tree and simple path cases for heterogeneous systems. Patients’
statistical information, test results and disease history are recorded in the EHR, enabling us to
identify potential data-centric solutions to reduce the costs of medical case studies.
 Qiu et al. proposed an efficient flow estimating algorithm for the telehealth cloud system and
designed a data coherence protocol for the PHR (Personal Health Record)-based distributed
system.
 Bates et al. proposed six applications of big data in the field of healthcare.
 Qiu et al. proposed an optimal big data sharing algorithm to handle the complicate data set in
telehealth with cloud techniques. One of the applications is to identify high-risk patients
which can be utilized to reduce medical cost since high-risk patients often require expensive
healthcare.
DISADVANTAGES OF EXISTING SYSTEM:
 In the existing system the data set is typically small, for patients and diseases with specific
conditions, the characteristics are selected through experience. However, these pre-selected
characteristics maybe not satisfy the changes in the disease and its influencing factors.
 Lower Accuracy
 More Time Consuming
PROPOSED SYSTEM:
 We combine the structured and unstructured data in healthcare field to assess the risk of
disease. First, we used latent factor model to reconstruct the missing data from the medical
57
records collected from a hospital in central China. Second, by using statistical knowledge, we
could determine the major chronic diseases in the region. Third, to handle structured data, we
consult with hospital experts to extract useful features.
 For unstructured text data, we select the features automatically using CNN algorithm. Finally,
we propose a novel CNN-based multimodal disease risk prediction (CNN-MDRP) algorithm
for structured and unstructured data.
 The disease risk model is obtained by the combination of structured and unstructured
features. Through the experiment, we draw a conclusion that the performance of CNN-
MDPR is better than other existing methods.
ADVANTAGES OF PROPOSED SYSTEM:
 Higher Accuracy.
 We leverage not only the structured data but also the text data of patients based on the
proposed CNN-MDPR algorithm.
 We find that by combining these two data, the accuracy rate can reach 94.80%, so as to better
evaluate the risk of cerebral infarction disease.
 To the best of our knowledge, none of the existing work focused on both data types in the
area of medical big data analytics.
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
 Economical Feasibility
 Technical Feasibility
 Social Feasibility
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development of
58
the system is limited. The expenditures must be justified. Thus the developed system as well within
the budget and this was achieved because most of the technologies used are freely available. Only
the customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources. This will
lead to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened by the
system, instead must accept it as a necessity. The level of acceptance by the users solely depends on
the methods that are employed to educate the user about the system and to make him familiar with it.
His level of confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.
59
8. SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub-assemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail in an unacceptable
manner. There are various types of test. Each test type addresses a specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a structural
testing, that relies on knowledge of its construction and is invasive. Unit tests perform basic tests at
component level and test a specific business process, application, and/or system configuration. Unit
tests ensure that each unique path of a business process performs accurately to the documented
specifications and contains clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic outcome
of screens or fields. Integration tests demonstrate that although the components were individually
satisfaction, as shown by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
60
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures : interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions, or

special test cases. In addition, systematic coverage pertaining to identify Business process flows;
data fields, predefined processes, and successive processes must be considered for testing. Before
functional testing is complete, additional tests are identified and the effective value of current tests is
determined.
System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used to
test areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings, structure or
language of the module being tested. Black box tests, as most other kinds of tests, must be written
from a definitive source document, such as specification or requirements document, such as
specification or requirements document. It is a testing in which the software under test is treated, as a
black box .you cannot “see” into it. The test provides inputs and responds to outputs without
considering how the software works.
UNIT TESTING
61
Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.
Test strategy and approach

Field testing will be performed manually and functional tests will be written in detail.
Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
INTEGRATION TESTING
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
ACCEPTENCE TESTING
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
62
9. SYSTEM REQUIREMENTS
HARDWARE REQUIREMENTS:
 System : i3 Processor
 Hard Disk : 500 GB.
 Monitor : 15’’ LED
 Input Devices : Keyboard, Mouse
 Ram : 4GB.
SOFTWARE REQUIREMENTS:
 Operating system : Windows 7/UBUNTU.

 Coding Language : Java 1.7 ,Hadoop 0.8.1 (for Mapper and Reducer)
 Back End : Hadoop Cluster
 Tool : Virtual Box Oracle tool
 Evaluation : PHP, JavaScript (Intelligent Graph)
63
10. CONCLUSION
In this paper, we proposed a new convolutional neural network based multimodal disease risk
prediction (CNN-MDRP) algorithm using structured and unstructured data from hospital. To the best
of our knowledge, none of the existing work focused on both data types in the area of medical big
data analytics. Compared to several typical prediction algorithms, the prediction accuracy of our
proposed algorithm reaches 94.8% with a convergence speed which is faster than that of the CNN-
based unimodal disease risk prediction (CNN-UDRP) algorithm.
64
11. REFERENCES
[1] P. Groves, B. Kayyali, D. Knott, and S. V. Kuiken, “The big data revolution in healthcare:
Accelerating value and innovation,” 2016.
[2] M. Chen, S. Mao, and Y. Liu, “Big data: A survey,” Mobile Networks and Applications, vol. 19,
no. 2, pp. 171–209, 2014.
[3] P. B. Jensen, L. J. Jensen, and S. Brunak, “Mining electronic health records: towards better
research applications and clinical care,” NatureReviews Genetics, vol. 13, no. 6, pp. 395–405, 2012.
[4] D. Tian, J. Zhou, Y. Wang, Y. Lu, H. Xia, and Z. Yi, “A dynamic and self-adaptive network
selection method for multimode communications in heterogeneous vehicular telematics,” IEEE
Transactions on Intelligent Transportation Systems, vol. 16, no. 6, pp. 3033–3049, 2015.
[5] M. Chen, Y. Ma, Y. Li, D. Wu, Y. Zhang, C. Youn, “Wearable 2.0: Enable Human-Cloud
Integration in Next Generation Healthcare System,” IEEECommunications, Vol. 55, No. 1, pp. 54–
61, Jan. 2017.
[6] M. Chen, Y. Ma, J. Song, C. Lai, B. Hu, ”Smart Clothing: Connecting Human with Clouds and
Big Data for Sustainable Health Monitoring,”ACM/Springer Mobile Networks and Applications,
Vol. 21, No. 5, pp.825C845, 2016.
[7] M. Chen, P. Zhou, G. Fortino, “Emotion Communication System,” IEEEAccess, DOI:

10.1109/ACCESS.2016.2641480, 2016.
[8] M. Qiu and E. H.-M. Sha, “Cost minimization while satisfying hard/soft timing constraints for
heterogeneous embedded systems,” ACM Transactions on Design Automation of Electronic Systems
(TODAES), vol. 14,no. 2, p. 25, 2009.
65
[9] J. Wang, M. Qiu, and B. Guo, “Enabling real-time information service on telehealth system over
cloud-based big data platform,” Journal of Systems Architecture, vol. 72, pp. 69–79, 2017.
[10] D. W. Bates, S. Saria, L. Ohno-Machado, A. Shah, and G. Escobar, “Bigdata in health care:
using analytics to identify and manage high-risk and high-cost patients,” Health Affairs, vol. 33, no.
7, pp. 1123–1131, 2014.
[11] L. Qiu, K. Gai, and M. Qiu, “Optimal big data sharing approach for tele-health in cloud
computing,” in Smart Cloud (SmartCloud), IEEEInternational Conference on. IEEE, 2016, pp. 184–
189.
[12] Y. Zhang, M. Qiu, C.-W. Tsai, M. M. Hassan, and A. Alamri, “Healthcps:Healthcare cyber-
physical system assisted by cloud and big data,”IEEE Systems Journal, 2015.
[13] K. Lin, J. Luo, L. Hu, M. S. Hossain, and A. Ghoneim, “Localization based on social big data
analysis in the vehicular networks,” IEEE Transactions on Industrial Informatics, 2016.
[14] K. Lin, M. Chen, J. Deng, M. M. Hassan, and G. Fortino, “Enhanced fingerprinting and
trajectory prediction for iot localization in smart buildings,” IEEE Transactions on Automation
Science and Engineering,vol. 13, no. 3, pp. 1294–1307, 2016.
[15] D. Oliver, F. Daly, F. C. Martin, and M. E. McMurdo, “Risk factors and risk assessment tools
for falls in hospital in-patients: a systematic review,” Age and ageing, vol. 33, no. 2, pp. 122–130,
2004.
[16] S. Marcoon, A. M. Chang, B. Lee, R. Salhi, and J. E. Hollander, “Heart score to further risk
stratify patients with low time scores,” Critical path ways in cardiology, vol. 12, no. 1, pp. 1–5,
2013.
[17] S. Bandyopadhyay, J. Wolfson, D. M. Vock, G. Vazquez-Benitez,G. Adomavicius, M. Elidrisi,

P. E. Johnson, and P. J. O’Connor, “Datamining for censored time-to-event data: a bayesian network
model for predicting cardiovascular risk from electronic health record data,” DataMining and
Knowledge Discovery, vol. 29, no. 4, pp. 1033–1069, 2015.
66
[18] B. Qian, X. Wang, N. Cao, H. Li, and Y.-G. Jiang, “A relative similarity based method for
interactive patient risk prediction,” Data Mining and Knowledge Discovery, vol. 29, no. 4, pp. 1070–
1093, 2015.
[19] A. Singh, G. Nadkarni, O. Gottesman, S. B. Ellis, E. P. Bottinger, and J. V. Guttag,

“Incorporating temporal EHR data in predictive models for risk stratification of renal function
deterioration,” Journal of biomedical informatics, vol. 53, pp. 220–228, 2015.
[20] J. Wan, S. Tang, D. Li, S. Wang, C. Liu, H. Abbas and A. Vasilakos, “A Manufacturing Big
Data Solution for Active Preventive Maintenance”, IEEE Transactions on Industrial Informatics,
DOI: 10.1109/TII.2017.2670505, 2017.
[21] W. Yin and H. Schutze, “Convolutional neural network for paraphrase identification.” in HLT-
NAACL, 2015, pp. 901–911.
[22] N. Nori, H. Kashima, K. Yamashita, H. Ikai, and Y. Imanaka, “Simultaneous modeling of

multiple diseases for mortality prediction in acute hospital care,” in Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015,pp. 855–
864.
[23] S. Zhai, K.h. Chang, R. Zhang, and Z. M. Zhang, “Deep intent: Learning attentions for online
advertising with recurrent neural networks,” in Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1295–1304.
67

Project FInal Report

Uploaded by

Copyright:

Available Formats

Project FInal Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project FInal Report

Uploaded by

Copyright:

Available Formats

ABSTRACT

1.1. INTRODUCTION TO BIGDATA

Fig 1.1. Pictorial representation of BigData

1.2. CHARACTERISTICS OF BIGDATA

Fig 1.2. Characteristics of BigData

1.3. BIG DATA ANALYTICS

Fig 1.3.BigData Analytics

Fig 1.4. BigData Computing

1.4. BIG DATA ANALYTICS IN CLOUD ENVIRONMENT

MOVING BIG DATA INTO CLOUD

Fig 1.5. Moving BigData into cloud

Targeting at a cloud encompassing disparate data centers, we model a cost-minimizing data

KEY TECHNOLOGIES FOR EXTRACTING BUSINESS VALUE FROM BIG DATA

3) Health-CPS: Healthcare Cyber-Physical System Assisted by Cloud and Big Data

Assumptions and Goals

Streaming Data Access

Large Data Sets

Simple Coherency Model

“Moving Computation is Cheaper than Moving Data”

Portability Across Heterogeneous Hardware and Software Platforms

NameNode and DataNodes

Fig.3.1.1. HDFS Architecture

The File System Namespace

The Persistence of File System Metadata

The Communication Protocols

Data Disk Failure, Heartbeats and Re-Replication

Metadata Disk Failure

Create a directory named /foodir bin/hadoop dfs -mkdir /foodir

Remove a directory named /foodir bin/hadoop fs -rm -R /foodir

View the contents of a file named bin/hadoop dfs -cat

Put the cluster in Safemode bin/hdfs dfsadmin -safemode enter

Generate a list of Data Nodes bin/hdfs dfsadmin -report

Recommission or decommission DataNode(s) bin/hdfs dfsadmin –refresh Nodes

File Deletes and Undeletes

Decrease Replication Factor

4. INPUT DESIGN & OUTPUT DESIGN

 What data should be given as input?

2. Select methods for presenting information.

1. Data imputation module,

Data imputation module:

Training Word Embedding:

public class DriverConf {

public static void main(String[] args) throws IOException {

// TODO Auto-generated method stub

// TODO Auto-generated method stub

JobClient client = new JobClient();

//Configurations for Job set in this variable

JobConf conf = new JobConf(DriverConf.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

//Name of the Job

//Formats of the Data Type of Input and Output

//Running the job with Configurations set in the conf

public class Knn {

public void map(Text key, Text value, OutputCollector<Text, IntWritable> output,

String splitword = key.toString();

String[] part = splitword.split(",");

output.collect(new Text( part[9]), new IntWritable(1));

public static class Red extends MapReduceBase implements Reducer<Text, IntWritable,