Critical Study of Hadoop Implementation and Performance Issues
Madhavi Vaidya
Asst. Professor, Dept of Computer Sc.
Vivekanand College, Mumbai, India
Dr. Shriniwas Deshpande
Associate Professor, Head of PG Dept of
Computer Science & Technology, DCPE,
HVPM, Amravati, India
Abstract
The MapReduce model has become an
important parallel processing model for largescale data-intensive applications like data
mining and web indexing. Hadoop, an opensource implementation of MapReduce, is widely
applied to support cluster computing jobs
requiring low response time. The different issues
of Hadoop are discussed here and then for them
what are the solutions which are proposed in the
various papers which are studied by the author
are discussed here. Finally, Hadoop is not an
easy environment to manage. The current
Hadoop implementation assumes that computing
nodes in a cluster are homogeneous in nature.
Network delays due to data movement during
running time have been ignored in the recent
Hadoop research. Unfortunately, both the
homogeneity and data locality assumptions in
Hadoop are optimistic at best and unachievable
at worst, introduces performance problems in
virtualized data centers. The analysis of SPOF
existing in critical nodes of Hadoop and
proposes a metadata replication based solution
to enable Hadoop high availability. The goal of
heterogeneity can be achieved by a data
placement scheme which distributes and stores
data across multiple heterogeneous nodes based
on their computing capacities. Analysts said that
IT using the technology to aggregate and store
data from multiple sources can create a whole
slew of problems related to access control and
ownership. Applications analyzing merged data
in a Hadoop environment can result in the
creation of new datasets that may also need to be
protected.
Keywords : Fault , Distributed, HDFS,
NameNode
Introduction
The phenomenal growth of internet based
applications and web services in last decade
have brought a change in the mindset of
researchers. The traditional technique to store
and analyze voluminous data has been
improved. The organizations are ready to
acquire solutions which are highly reliable. [1]
Behavior information of the web users are
concealed in the web log. The web log mining
can find characteristics and rules of the users’
visiting behavior to improve the service quality
to users. Clustering is one of technologies of
data mining which applied by web log mining.
Applying clustering to analysis users’ visiting
behavior can realize clustering of users
according to their interest, thus it will help us to
improve the web site’s structure.[2] Several
system architectures have been implemented for
data-intensive computing and large-scale data
analysis, such as applications including parallel
and distributed relational database management
systems. As a platform of computing and
storage, availability of Hadoop is the foundation
do applications availability on it. It is necessary
to keep full time availability of platform for
product environment. Hadoop has tried some
methods to enhance the availability of
applications running on it, e.g. maintaining
multiple replicas of application data and
redeploying application tasks based on failures,
but it doesn’t provide high availability for itself.
In the architecture of Hadoop, there exists SPOF
(Single Point of Failure), which means the
whole system gives up and becomes out of work
caused by the failure of critical node where only
a single copy is kept.[1,2] MapReduce proposed
by Google is a programming model and an
associated implementation for large-scale data
processing in distributed cluster. In the first
stage a Map function is applied in parallel to
each partition of the input data, performing the
grouping operations; and in the second stage, a
Reduce function is applied in parallel to each
group produced in the first stage, to perform the
final aggregation. The MapReduce model allows
users to easily develop data analysis programs
that can be scaled to thousands of nodes, without
worrying about the details of parallelism. Its
popular open source implementation, Hadoop,
has been used by many companies (such as
Yahoo and Facebook) in production for largescale data analysis in cloud computing. Thus, it
is essential to monitor distributed cluster status
through MapReduce-based data analysis using
Hadoop.
Hadoop Distributed File System
HDFS is the file system component of
Hadoop(Refer Figure 1). While the interface to
HDFS is patterned after the UNIX file system,
faithfulness to standards was sacrificed in favor
of improved performance for the applications at
hand. [3]
Architecture of Hadoop
A. NameNode
The NameNode maintains the namespace tree
and the mapping of file blocks to DataNodes
(the physical location of file data). An HDFS
client wanting to read a file first contacts the
NameNode for the locations of data blocks
comprising the file and then reads block contents
from the DataNode closest to the client. When
writing data, the client requests the NameNode
to nominate a suite of three DataNodes to host
the block replicas. The client then writes data to
the DataNodes in a pipeline fashion.
Fig 1 : Hadoop Architecture
The persistent record of the image stored in the
local host’s native files system is called a
checkpoint. The NameNode also stores the
modification log of the image called the journal
in the local host’s native file system. For
improved durability, redundant copies of the
checkpoint and journal can be made at other
servers.
B. Data Nodes
During startup each DataNode connects to the
NameNode and performs a handshake. The
purpose of the handshake is to verify the
namespace ID and the software version of the
Data Node. If either does not match, the
NameNode the DataNode automatically shuts
down. After the handshake the DataNode
registers with the NameNode.
During normal operation DataNodes
send heartbeats to the NameNode. The default
heartbeat interval is three seconds. If the
NameNode does not receive a heartbeat from a
DataNode in ten minutes the NameNode
considers the DataNode to be out of service. The
NameNode schedules creation of new replicas
of those blocks on other DataNodes. [3]
Fig 2 : Map Reduce Framework
Hadoop MapReduce is a framework that
can be used for executing applications
containing vast amounts of data (terabytes of
data) in parallel on largely built clusters with
numerous nodes in a reliable and fault-tolerant
manner. Though it can be executed in a single
machine, its true power lies in its ability to scale
to several thousands of systems each with
several processor cores. Hadoop is designed in
such a way that it distributes data efficiently
across various nodes in the cluster. It includes a
distributed file system that takes care of
distributing the huge amount of data sets
efficiently across the nodes in the cluster.
MapReduce framework (Refer Figure 2) splits
the job into various numbers of chunks which
the Map tasks process in parallel. The outputs
from the map tasks are sorted by the framework
and given to Reduce tasks as input. Both the
input and output of the tasks are stored in a file
system. The framework takes care of scheduling
the tasks, monitoring those tasks and reexecuting the failed tasks.
Each cluster has only one JobTracker which is
actually a daemon service for submitting and
tracking MapReduce jobs in Hadoop. So it is a
single point of failure for MapReduce service
and hence if it goes down all running jobs is
halted. The slaves are configured to the node
location of the JobTracker and perform tasks as
directed by the JobTracker. Each slave node has
only one TaskTracker (Refer Figure 3) which
keeps track of task instances and notifies the
JobTracker about the status.
Implementation of appropriate interfaces and
abstract classes by the applications specify the
input and output functions and supply Map and
Reduce functions. Job configuration comprises
of these and other parameters. The Hadoop Job
client submits the job and configuration to the
JobTracker which distributes the configuration
to the slaves, schedules tasks and monitors them.
It then submits the job report to the Job client.
The report consists of status and diagnostic
information about the tasks.
Fig 3 : Role of Job Tracker and Task Tracker
Related Work
This paper proposes a metadata
replication based solution to enable Hadoop high
availability by removing single point of failure
in Hadoop. A key component of Hadoop is the
Hadoop Distributed File System (HDFS), which
is used to store all input and output data for
applications [4]. In initialization phase, each
standby/slave
node
is
registered
to
active/primary node and its initial metadata
(such as version file and file system image) are
caught up with those of active/primary node; in
replication phase, the runtime metadata (such as
outstanding operations and lease states) for
failover in future are replicated; in failover
phase, standby/new elected primary node takes
over all communications. [3,4]. Hadoop has tried
some methods to enhance the availability of
applications running on it, e.g. maintaining
multiple replicas of application data and
redeploying application tasks based on failures,
but it doesn’t provide high availability for itself.
In the architecture of Hadoop, there exists SPOF
(Single Point of Failure), which means the
whole system gives up and becomes out of work
caused by the failure of critical node where only
a single copy is kept. SPOF of Hadoop thus is a
huge threat to the availability of Hadoop.
To provide high availability for Hadoop, there
are several challenges as follows.
(1) SPOF identification:
Namenode and
jobtracker are SPOF in Hadoop, and how to
identify the critical component and state
information more exactly to remove these SPOF
is not an easy job.
(2) Low overhead: Achieve high availability
needs additional time cost for runtime
synchronization among different nodes, so a
performance
optimized
solution
for
implementing high availability is necessary.
(3) Flexible configuration: To implement high
availability for Hadoop, many configurable
options should be considered to meet
performance
requirements
of
different
workloads in different execution environments
(e.g. network bandwidth and latency).[4]
The execution environment of high availability
consists of the critical node and one or more
nodes used for its backup. In this paper the
solution proposes two types of topology
architecture of nodes in execution environment:
one is active-standby topology which consists of
one active critical node and one standby node;
the other is primary-slaves topology which
consists of one primary critical node and several
slave nodes.
The analysis of SPOF existing in critical nodes
of Hadoop and proposes a metadata replication
based solution to enable Hadoop high
availability. The solution involves three major
phases:
in
initialization
phase,
each
standby/slave
node
is
registered
to
active/primary node and its initial metadata
(such as version file, file system image) are
caught up with those of active/primary node;
1. In replication is the core phase of
solution suggested here, the runtime
metadata
(such
as
outstanding
operations, lease states) for failover are
replicated;
in
failover
phase,
standby/new elected primary node takes
over all communications.
To reduce performance penalty for
replication, this white paper of MapR
suggests that it only replicates metadata
which
are
the
most
valuable
management information for failover
instead of a complete data copy stored in
active/primary critical node. Note that
all management information contained
in jobtracker is stored in HDFS
persistently and the information can be
recovered for failover of jobtracker, so it
is unnecessary to design specific
metadata replication mechanism for
jobtracker.
2. Metadata : Metadata are the most
important management information
replicated for namenode failover. The
initial metadata include two types of
files: version file which contains the
version information of running HDFS
and file system image (fsimage) file
which is a persistent checkpoint of the
file system.
3. Initialization : The main tasks of
initialization phase include node
registration to register slave nodes and
initial metadata synchronization to make
initial metadata consistent between
primary node and slave nodes. [4]
AvatarNode, developed by Facebook, makes
it possible for an administrator to switch a live
Hadoop cluster’s NameNode from one node to
another node so that the administrator can
perform maintenance on the node. The failover
must be manually initiated by an administrator,
so it doesn’t provide protection from software or
hardware failures. If a node fails, the metadata
that was on that node is quickly re-replicated to
other nodes in the cluster so that the replication
factor can quickly hit the configured level again.
This is what makes MapR’s HA self-healing.[5]
HDFS clients are configured to access the
AvatarNode via a Virtual IP. When Primary
node is down , the Standby Avatar Node takes
the relay. The Standby Avatar Node ingests all
committed transactions because it reopens the
edits log and consumes all transactions until the
end of file. The Standby Avatar node finishes
ingestion of all transactions from shared NF and
then leaves safe node. The VIP switches from
Avatar node to Standby Avatar node [6]
In particular, Hadoop has a single
NameNode. This is where the metadata is stored
about the Hadoop cluster. Unfortunately, there is
only one of them, which means that the
NameNode is a single point of failure for the
entire environment. One may go with a different
distribution of Hadoop such as MapR, which
fixes the NameNode problem. Or there are
companies such as ZettaSet that have built
additional tooling around Hadoop, including
NameNode high availability, but which do not
fork the Apache distribution. Or, since this
NameNode issue is specific to HDFS (Hadoop
distributed file system), one could replace this
with IBM's GPFS-SNC, which similarly averts
this problem.[7]
Some findings are observed here [8] and
they are, Hadoop is willing to wait for nonresponsive nodes for a long time (on the order of
10 minutes). This conservative design allows
Hadoop to tolerate non-responsiveness caused
by network congestion or compute node
overload. A completed map task whose output
data is inaccessible is re-executed very
conservatively. This makes sense if the
inaccessibility of the data is rooted in congestion
or overload. This design decision is in stark
contrast to the much more aggressive
speculative re-execution of straggler tasks that
are still running. The health of a reducer is a
function of the progress of the shuffle phase (i.e.
the number of successfully copied. In Hadoop,
information about failures is not shared among
different tasks in a job nor even among different
code-level objects belonging to the same task.
At the task level, when a failure is encountered,
this information is not shared with the other
tasks. Therefore, tasks may be impacted by a
failure even if the same failure had already been
encountered by other tasks. In particular, a task
can encounter the same failure that previously
affected the initial
task. The reason for this lack of task-level
information sharing is that HDFS is designed
with scalability in mind. To avoid placing
excessive burden on the Name Node, much of
the functionality, including failure detection and
recovery, is relegated to the compute nodes.
Inside a task, information about failures is not
shared among the objects composing the task.
Rather, failure information is stored and used on
a per object basis. [8]
Hadoop’s fault tolerance focuses on two failure
levels and uses replication to avoid data loss.
The first level is the node level which means a
node failure should not affect the data integrity
of the cluster. The second level is the rack level
which means the data is safe if a whole rack of
nodes fail. In traditional Hadoop, the data node
will contact the namenode and report its status
including information on the size of the disk on
the remote node and how much is available for
Hadoop to store. The namenode will determine
what data files should be stored on the node by
the location of the node. using rack awareness
and by the percent of the space that is used by
Hadoop Rack awareness provides both load
balancing and improved fault tolerance for the
file system. Rack awareness is designed to
separate nodes into physical failure domains and
to load balance. It assumes that bandwidth inside
a rack is much larger than the bandwidth
between racks, therefore the namenode will use
the rack awareness to place data closer to the
source. For fault tolerance, the namenode uses
rack awareness to put data on the source rack
and one other rack to guard against whole rack
failure. An entire rack could fail if it is possible
that a whole site could fail. Here the author has
suggested data placement and replication policy
which takes the site failure into account when it
places data blocks. [8]
Another
issue
of
Hadoop
is
Heterogeneous cluster which is solved here by
implementation of MapReduce enjoying wide
adoption and is often used for short jobs where
low response time is critical. Hadoop’s
performance is closely tied to its task scheduler,
who implicitly assumes that cluster nodes are
homogeneous and tasks make progress linearly,
and uses these assumptions to decide when to
speculatively re-execute tasks that appear to be
stragglers. [2]
A key benefit of MapReduce is that it
automatically handles failures, hiding the
complexity of fault-tolerance from the
programmer. If a node crashes, MapReduce
reruns its tasks on a different machine. Equally
importantly, if a node is available but is
performing poorly, a condition that we call a
straggler, MapReduce runs a speculative copy of
its task (also called a “backup task”) on another
machine to finish the computation faster.
Without this mechanism of speculative
execution1, a job would be as slow as the
misbehaving task. Stragglers can arise for many
reasons, including faulty hardware and misconfiguration. HDFS enables Hadoop Map
Reduce applications to transfer processing
operations toward
nodes storing application data to be processed by
the operations. In a heterogeneous cluster, the
computing capacities of nodes may vary
significantly. A high-speed node can finish
processing data stored in a local disk of the node
faster than low-speed counterparts. After a fast
node completes the processing of its local input
data, the node must support load sharing by
handling unprocessed data located in one or
more remote slow nodes. When the amount of
transferred data due to load sharing is very large,
the overhead of moving unprocessed data from
slow nodes to fast nodes becomes a critical issue
affecting Hadoop’s performance. To boost the
performance of Hadoop in heterogeneous
clusters, this paper aims at minimizing data
movement between slow and fast nodes. This
goal can be achieved by a data placement
scheme that distribute and store data across
multiple heterogeneous nodes based on their
computing capacities. Data movement can be
reduced if the number of file fragments placed
on the disk of each node is proportional to the
node’s data processing speed. To achieve the
best I/O performance, one may make replicas of
an input data file of a Hadoop application in a
way that each node in a Hadoop cluster has a
local copy of the input data. Such a data
replication scheme can, of course, minimize data
transfer among slow and fast nodes in the cluster
during the execution of the Hadoop application.
The data replication approach has several
limitations. First, it is very expensive to create
replicas in a large-scale cluster. Second,
distributing a large number of replicas can
wastefully consume scarce network bandwidth
in Hadoop clusters. Third, storing replicas
requires an unreasonably large amount of disk
capacity, which in turn increases the cost of
Hadoop clusters. [2,3] There is a single master
managing a number of slaves. The input file,
which resides on a distributed filesystem
throughout the cluster, is split into even-sized
chunks replicated for fault-tolerance. Hadoop
divides each MapReduce job into a set of tasks.
Each chunk of input is first processed by a map
task, which outputs a list of key-value pairs
generated by a user defined map function. Map
outputs are split into buckets based on key.
When all maps have finished, reduce tasks apply
a reduce function to the list of map outputs with
each key.
Although several approaches have been
proposed to solve the resource allocation
problem in a heterogeneous cloud, most of them
focus on allocating resources to single job or
overlook the resource constraints. However, in
practical, the problem is more complex since
there must be multiple jobs
requested by users simultaneously. In this paper,
we first formulate the optimization problem of
allocating the limited resources to multiple jobs
according to the job feature and node capability.
The objective is to maximizing the aggregate
resulting utility. Moreover, the node CapabilityAware Resource Provisioner (CARP) is
proposed based on Apache Hadoop [8, 9,10,11]
to show its feasibility to solve above
optimization problem.
By default, Hadoop adopts FIFO
scheduler which is absolutely unfair in a cloud
with multiple jobs. Thus, fair scheduler is
proposed here to equally share the resources
among all
jobs.[11] However, in a
heterogeneous cloud, because each node has
distinct capability and workload, the nodes with
high capability or low workload must wait the
nodes with low capability or high workload
before integrating the
intermediate results
output by these nodes. Consequently, the job
execution time is prolonged.
Hence, other intelligent schedulers, which
achieve better resource provisioning, is required
to improve the system utility and minimize the
execution time of submitted jobs , especially in a
heterogeneous cloud with resource constraints.
For example, the capacity scheduler in Hadoop,
supports multiple queues and job priority, is
more flexible and suitable for heterogeneous
clouds with various job types.
In Hadoop, the jobs are identical no matter
whether the default FIFO scheduler or fair
scheduler is adopted. In FIFO scheduler, it is
based on the best effort resource allocation.
Uniform resource allocation is employed by fair
scheduler. However, in a heterogeneous cloud,
because each node has distinct capability and
workload, the nodes with high capability or low
workload must wait the nodes with low
capability or high workload before integrating
the intermediate results output by these nodes.
Consequently, the job execution time is
prolonged. Hence, other intelligent schedulers,
which achieve better resource provisioning, is
required to improve the system utility and
minimize the execution time of submitted jobs,
especially in a heterogeneous cloud with
resource constraints.[12] For example, the
capacity scheduler in Hadoop, which supports
multiple queues and job priority, is more flexible
and suitable for heterogeneous clouds with
various job types. [10]
Security is the next issue which is
handled in Hadoop by taking the support from
the research papers note here. It says Open
source Hadoop technology allows companies to
collect, aggregate, share and analyze huge
volumes of structured and unstructured data
from enterprise data stores as well as from
weblogs, online transactions and social media
interactions.
A growing number of firms are using
Hadoop and related technologies such as Hive,
Pig and Hbase to analyze data in ways that
cannot easily or affordably be done using
traditional relational database technologies.
JPMorgan Chase,[11] for instance, is using
Hadoop to improve fraud detection. Meanwhile,
Ebay is using Hadoop technology and the Hbase
open source database to build a new search
engine for its auction site. Analysts said that IT
operations using Hadoop technology for such
applications must be aware of potential security
problems. Using the technology to aggregate and
store data from multiple sources can create a
whole slew of problems related to access control
and management as well as data entitlement and
ownership. Applications analyzing merged data
in a Hadoop environment can result in the
creation of new datasets that may also need to be
protected. Several agencies won't put sensitive
data into Hadoop databases because of data
access concerns. Several agencies are simply
building firewalls
to protect
Hadoop
environments. For many Hadoop users, the most
effective security approach is to encrypt data at
the individual record level, while it is in transit
or being stored in a Hadoop environment.[13]
Recently, Hadoop is used in the cloud
then there are numerous security issues for cloud
computing as it encompasses many technologies
including networks, databases, operating
systems, resource scheduling, load balancing,
concurrency control and memory management.
For example, the network that interconnects the
systems in a Hadoop cluster has to be secure.
Data security involves encrypting the data as
well as ensuring that appropriate policies are
enforced for data sharing. In addition, resource
allocation and memory management algorithms
have to be secure. Finally, data mining
techniques may be applicable. Hadoop is
increasingly useful; here are the security issues
with it: Hadoop holds data in HDFS. This file
system does not have read and write control, any
user can access the input files and results. All
jobs are run as Hadoop user, which can execute
the applications without any permission. For
example, the user with limited access to the jobs
they can run, can potentially run that job on any
data set on the cluster. Any job running on a
Hadoop cluster can access any data on that
cluster. Hadoop can set a access control which is
held at the client level. Access control list
checks should be performed at the start of any
read or write when it should be at the file system
level. User authentication should use a more
secure method, such as a password or RSA key
authentication.[14]
Distributed systems are becoming more
prominent nowadays. For a large cluster, it's
very important to detect system anomalies,
including erroneous behavior or unexpected long
response times, which will often result in system
crash. These anomalies may be caused by
hardware problems, network communication
congestion or software bugs in distributed
system components. Owing to large scale and
complex of distributed systems, it's impossible
to detect anomalies by manually checking
system printed logs. Automatically system
anomaly monitoring and detection tools are
eagerly required by many distributed systems.
Although there exist many log analysis tools,
most of them are developed on single node and
it is very time consuming to diagnose a great
amount of log messages produced by a large
scale distributed system on just one node.
Therefore, there is a great demand to use a
distributed way for anomaly detection
techniques based on log analysis [15] with the
rapid development of Internet, e-commerce
websites have brought unprecedented huge
records from users. Behavior information of the
web users are concealed in the web log. The web
log mining can find characteristics and rules of
the users’ visiting behavior to improve the
service quality to users. Clustering is one of
technologies of data mining which applied by
web log mining. Applying clustering to analysis
users’ visiting behavior can realize clustering of
users according to their interest, thus it will help
us to improve the web site’s structure. We can
also apply the information on recommend
system. However, the information is covered in
log files which are up to a few terabytes in size.
Processing huge datasets have to consume large
amount of computation. In general, distributed
computing is a good solution. Computing tasks
are assigned in parallel to multiple machines to
improving processing speed. [16]
Conclusion
The author is tried to identify the performance
issues in HDFS on heterogeneous clusters.
Motivated by the performance degradation
caused by heterogeneity, for this data placement
mechanism in HDFS is suggested. The new
mechanism distributes fragments of an input file
to heterogeneous nodes based on their
computing capacities. The problems related to
access control and ownership in terms of
security, applications analyzing merged data in a
Hadoop environment can result in the creation of
new datasets that may also need to be protected.
In this manner the security can be maintained on
the data which is needed for processing. To
boost the performance of Hadoop in
heterogeneous clusters the solution is to
minimize data movement between slow and fast
nodes. This goal can be achieved by a data
placement scheme that distribute and store data
across multiple heterogeneous nodes based on
their computing capacities.
References
[1] Feng Wang,Bo Dong,Jie Qiu,Xinhui Li,Jie
Yang,Ying Li “Hadoop High Availability
through Metadata Replication” 2009 ACM 9781-60558-802-5/09 Pg 37-44
[2] Matei Zaharia, Andy Konwinski, Anthony D.
Joseph, Randy Katz, Ion Stoica “Improving
MapReduce Performance in Heterogeneous
Environments”
[3] Harcharan Jit Singh V. P. Singh “High
Scalability of HDFS using Distributed
Namespace”
International
Journal
of
ComputerApplications (0975 – 8887) Volume
52– No.17, August 2012
[4] Jeffrey Shafer, Scott Rixner, and Alan L.
Cox “The Hadoop Distributed Filesystem:
Balancing Portability and Performance “
[5] White Paper 2011 MapR : High
Availability: No single points of failure
[6]
http://www.slideshare.net/PhilippeJulio/hadooparchitecture
[7]
http://www.itdirector.com/technology/data_mgmt/content.php
?cid=13041
[8] hadoop.apache.org/
[9] Florin Dinu T. S. Eugene Ng “Analysis of
Hadoop’s Performance under Failures” Rice
University
[10]B.Thirumala Rao, N.V.Sridevi, V.Krishna
Reddy, L.S.S.Reddy “Performance Issues of
Heterogeneous Hadoop Clusters in Cloud
Computing” Global Journal of Computer
Science and Technology Volume 11 Issue 8
Version 1.0 May 2011
[11] Wei-Tsung Su and Sun-Ming Wu “Node
Capability Aware Resource Provisioning in a
Heterogeneous Cloud” 978-1-4673-2815-9 2012
IEEE
International
Conference
on
Communications in China: Advanced Internet
and Cloud (AIC)
[12]
http://www.computerworld.com/s/article/922165
2/IT_must_prepare_for_Hadoop_security_issues
[13] A dissertation by Jiong Xie Graduate
Faculty of Auburn University “Improving
Performance of Hadoop Clusters”
[14]Mahout,http://mahout.apache.org/.
[15] Mingyue Luo,Gang Liu “Distributed log
information processing with Map-Reduce :A
Case Study from Raw Data to Final Models” 5
IEEE 978-1-4244-6943-7©2010
[16] Yan Liu “System Anomaly Detection in
Distributed Systems through MapReduce-Based
Log
Analysis” Advanced International Conference
on Advanced Computer Theory and Engineering
(ICACTE)