Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
Big Data: Technologies, Trends and Applications
Sudhakar Singh a,*, Pankaj Singh b, Rakhi Garg c, P K Mishra a
a
Department of Computer Science, Faculty of Science,
Banaras Hindu University, Varanasi 221005, India
b
Faculty of Education,
Banaras Hindu University, Varanasi 221005, India
c
Mahila Maha Vidyalaya,
Banaras Hindu University, Varanasi 221005, India
Abstract-Big Data is an excessive amount of imprecise data in
variety of formats generated from variety of sources with
rapid speed. It is most buzzed terms among researcher,
industry and academia. Big Data is not only limited to data
perspective but it has been emerged as a stream that includes
associated technologies, tools and real word applications. The
objective of this paper is to provide a simple, comprehensive
and brief introduction of Big Data to the beginners in subject.
In this paper, we provide an overview of Hadoop and its subprojects and a brief review of various developed technologies
for Big Data. We also discuss some recent trends and eminent
applications in Big Data. Although this paper does not touch
each and every dimension of Big Data as it is not possible to
make it in a single paper but essential aspects are covered,
which may benefit to the people new in Big Data world.
Keywords: Big Data;
Technology; Eco-System
Hadoop;
MapReduce;
Yarn;
1.
INTRODUCTION
Big Data is one of the most buzzed and hyped phrase
nowadays. Before come to define the Big Data, we would
like to first explore the sources generating excessive data.
Data may be generated either by human or by machine.
Human generates data as documents, emails, images,
videos, posts on facebook or tweeter etc. Data comes into
machine generated category are sensor data and logs data
i.e. web logs, click logs, email logs. Machine generated
data are of larger size than human generated data. After the
invention of big data technologies, machine generated data
came into play in order to process them. Major sources of
Big Data are purchase transaction records, web data, social
media data, click stream data, cell phone GPS signals, and
sensor data [1-2]. Social networking sites like facebook,
Twitter, LinkedIn generates a large volume of social media
data. Online advertising and E-Commerce companies
always looking for user navigation data i.e. users click
stream on a website. Sensors embedded in machines
generate large amount of data. As the real word examples
of Big Data, facebook has 40 PB data captures 100 TB of
data per day, Yahoo has 60 PB data and Twitter captures 8
TB data per day [1].
Large scale data processing or analysis and mining
intelligence from it is always being a centre of attraction.
Typical data analytical tools cannot support large scale
www.ijcsit.com
data. We have to use some different distributed tools and
techniques to analyze such large scale data since traditional
storage systems do not have analytical power and
traditional data analysis tools are unable to handle Big
Data. There may be a reasonable doubt that in spite of well
known distributed system like MPI (Message Passing
Interface), do we need another distributed system. We need
different distributed system since typical distributed system
has some problems as follows. First, it is highly dependent
on network and requires huge bandwidth. Second, partial
hardware or job failures are difficult to handle. Third, it
wastes a lot of processing power in movement and
distribution of data.
In case of analysis of Big Data, the complex characteristics
of Big Data are the major challenges in the way of
processing and managing it. A new distributed system
Hadoop has been developed for processing large and
excessive data in distributed and parallel fashion. We
define Big Data in section 2. Section 3 discusses the
evolution of Hadoop and describes various components and
daemons of Hadoop. Other associated Big Data
technologies are described in section 4. Section 5 discusses
the recent trends in Big Data. Section 6 enumerates a
number of applications of Big Data and technologies.
Finally we conclude paper in section 7.
2.
BIG DATA
Big Data is that extent of data, which cannot be stored and
processed by a single machine. Big Data do not refers to
the data only big in size. Most well known definition of Big
Data jointly given by Gartner and IBM [2-4] is a four Vs
concept: Volume, Velocity, Variety and Veracity. So data
possesses large volume, comes with high velocity, from
variety of sources and formats and having great uncertainty
is referred as Big Data. Volume- represents scale of data
i.e. Big Data has massive volume. Velocity- refers speed of
generation and processing of data i.e. rate of entering
streaming data in the system is really fast. Variety- refers
different form of data i.e. unstructured or semi-structured
data (text, sensor data, audio, video, click stream, log file,
XML) originated from different sources. Veracity- refers
uncertainty of data i.e. quality of data being captured. Data
like posts on social networking sites are imprecise [5-6].
4633
Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
3.
APACHE HADOOP
In this section, we focus on the evolution of Hadoop and
architecture of Hadoop components. Daemons of Hadoop
as well as its versions are also discussed.
3.1.
Evolution of Hadoop
Hadoop was created by Doug Cutting in 2005 [7]. It is
consequent result of Nutch search engine project of Dough
Cutting. Google published two papers on GFS (Google File
System) [8] and MapReduce [9] in 2003 and 2004
respectively. Nutch project was then rewritten to use
MapReduce. Cutting jointly with a team at Yahoo! Started
a new project and named it after his son's toy elephant. In
2006, Apache Hadoop project was started for the
development of HDFS (Hadoop Distributed File System)
and Hadoop MapReduce, Now Hadoop is top level project
of the Apache software foundation [10]. In 2008, a Hadoop
Cluster at Yahoo! has won Terabyte Sort Benchmark [1011].
3.2.
Core Components of Hadoop
Hadoop is a large-scale distributed batch processing
infrastructure for parallel processing of big data on large
cluster of commodity computers [12]. Hadoop consists of
three core components: HDFS, MapReduce and YARN.
HDFS and MapReduce design are based on Google’s File
System and MapReduce. YARN framework is a NextGen
MapReduce also called MapReduce 2.0, was added in
Hadoop-2.x version for job scheduling and resource
management of Hadoop cluster. Hadoop is extremely
scalable distributed system and requires minimum networks
bandwidth. Hadoop infrastructure automatically handles
fault tolerance, data distribution, parallelization and load
balancing tasks. In traditional parallel and distributed
system, data are moved to the node for computation which
can never be feasible in case of Big Data. Hadoop is a joint
system providing computational power i.e. MapReduce and
distributed storage i.e. HDFS at one place. Its design is
based on distributing computational power to where the
data is; instead of moving data [13].
3.2.1. HDFS Architecture
HDFS is a distributed file system, which provides
unlimited storage, scalable and fast access to stored data. It
supports horizontal scalability. Thousands of nodes in a
cluster hold petabyte scale of data and if there is a
requirement of more storage, one needs to just add more
nodes only [1]. It uses block-structured file system and
stores the files in a replicated manner after breaking the file
into fixed size blocks. Default block size is 64 MB and
each block is replicated at three nodes by default. Storing
data in this way provides high fault tolerance and
availability during execution of Big Data applications on
Hadoop cluster [12-13].
Hadoop is designed on Master-Slave architecture. There is
single master node known as NameNode and multiple slave
nodes known as DataNodes. Master node coordinates all
slave nodes. DataNodes are the workhorses and stores all
data. NameNode is the administrator of file system
operations i.e. file creation, permissions etc. Without
NameNode no one can operate cluster and write/read data.
NameNode is called a single point failure [1]. Fig. 1 shows
the functionality of NameNode and DataNode in HDFS.
NameNode assigns a block id to each block of a file and
stores all the metadata of the files in its memory in order to
be fast accessed. Metadata are the file name, permission,
replication and location of each block of the file.
DataNodes store all the files as replicated blocks and
retrieve them whenever required.
3.2.2. MapReduce Framework
MapReduce is an efficient, scalable and simplified
programming model for large scale distributed data
processing on a large cluster of commodity computers [12]
[14-15]. It works on the data residing in HDFS.
MapReduce is a programming framework, which provides
generic templates that can be customized by programmer’s
requirements. It process large volumes of data in parallel
by breaking the computation job into independent tasks
across a large number of machines. It distributes the tasks
across machines in Hadoop cluster and put together the
results of computations from each machine. It takes care of
the hardware and network failure. A failed task is assigned
to other node to re-execute itself without re-executing other
tasks. It balances the workload and increase the throughput
by assigning work of slower or busy nodes to idle nodes [1]
[16].
Fig. 1. NameNode and DataNodes in HDFS [14]
www.ijcsit.com
4634
Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
Fig. 2. Work flow in MapReduce framework [14]
Map Reduce programs can run on Hadoop in multiple
languages like Java, Python, Ruby and C++ [17].
MapReduce program consists of two functions Mapper and
Reducer which runs on all machines in parallel fashion.
The input and output of these functions must be in form of
(key, value) pairs. Fig. 2 illustrates the work flow in
MapReduce framework.
Map Function: Mapper is applied in parallel on input data
set. The Mapper takes the input (k1, v1) pairs from HDFS
and produces a list of intermediate (k2, v2) pairs. Mapper
output are partitioned per reducer i.e. the number of reduce
task for that job.
Reduce Function: The Reducer takes (k2, list (v2)) values as
input, make sum of the values in list (v2) and produce new
pairs (k3, v3) as final result.
Combiner Function: It is optional and also known as Mini
Reducer. It is applied to reduce the communication cost of
transferring intermediate outputs of mappers to reducers.
Shuffle and exchange is the single point of communication
in MapReduce. MapReduce framework shuffle the
intermediate output pairs of mappers and exchange them
between reducers to send all pairs with the same key to a
single reducer [12].
3.2.3. Daemon Processes in Hadoop
Hadoop has five daemons that are the processes running in
background. These are NameNode (NN), Secondary
NameNode (SNN), DataNodes (DN), JobTracker and
TaskTrackers and described as follows [12] [19-20].
NameNode: Each Hadoop cluster has exactly one
NameNode which runs on master machine. NameNode
manages metadata and access control of the file system.
Secondary NameNode: There is also a backup NameNode
named as Secondary NameNode which periodically wakes
www.ijcsit.com
up and process check points and downloads updates from
NameNode. It can be used latter to restore failed
NameNode, providing fault tolerance.
DataNodes: DataNode runs on each slave machines in
cluster and holds file system. Each DataNode manages
blocks of the file system assigned to it.
JobTracker: Exactly one JobTracker runs in a cluster. All
running tasks are halted if JobTracker goes down. Initially
jobs are submitted to JobTracker. Then it talks to the
NameNode to determine the location of data and talks to
TaskTrackers to submit the tasks.
TaskTrackers: TaskTracker runs on each slave node and
accepts map & reduce tasks and shuffle operations from
JobTracker.
3.3.
Hadoop-1.x vs. Hadoop-2.x
Apache releases a new version of Hadoop after fixing bugs
of previous releases and incorporating new functionality
and performance improvements. It introduced MapReduce
2.0, an improved and optimized framework in Hadoop-2.
The major difference between Hadoop-1.x and Hadoop-2.x
is the computational framework, NextGen MapReduce
(YARN) or MapReduce 2.0 (MRv2). Hadoop-1.x uses MRv1
which have two daemon process JobTracker on Master and
TaskTracker on Slave. While Hadoop 2.x uses MRv2
(YARN), which has ResourceManager (RM) on master
machine and NodeManager (NM) on slave machines and a
Application Specific ApplicationMaster (AM) [21].
Hadoop YARN (Yet Another Resource Negotiator) is a
framework for job scheduling and cluster resource
management [10]. In Hadoop-2.x, the functionality of
JobTracker of Hadoop-1.x splits into separate daemons,
global
ResourceManager
and
per-application
4635
Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
ApplicationMaster for resource management among all
applications in the system and job scheduling/monitoring.
Data-computation framework is formed by the
ResourceManager and per-node slave, the NodeManager.
The ApplicationMaster is a framework specific library
which negotiates resources from the ResourceManager and
works with NodeManagers to execute and monitor the tasks
[21].
4.
HADOOP ECO-SYSTEM
Apache Software Foundation supports a number of other
Hadoop related projects [10]. Each project deals with a
certain aspect of Big Data and provides complementary
services to Hadoop. The Hadoop related projects come
under umbrella of Hadoop Eco-System [22]. We describe
each one by one as follows.
1) HBase: HBase is the Hadoop database, inspired by
Google's BigTable [23]. It is a scalable, distributed and
non-relational database that supports storage for big
tables of structured data. It uses HDFS as its
underlying storage. HBase is used when there is a need
of random and real time read/write access of Big Data.
It provides BigTable like capabilities on top of Hadoop
[24].
2) Cassandra: Cassandra is a scalable database provides
high availability and supports multi-master to avoid
single points of failure. MapReduce can retrieve data
from Cassandra. It is a BDDB i.e. Big Data Data Base,
which can run without HDFS. Its supporting systems
are derived from Google Big Table [23] and Google
File System [8] [25].
3) Hive: Hive is data warehouse infrastructure that
provides data summarization, ad-hoc querying and
analysis of large datasets residing in HDFS. It provides
a mechanism to project structure on this data and also a
query language HiveQL based on SQL. It also
provides flexibility to plug in custom mappers and
reducers when logic could not be efficiently expressed
in HiveQL [26].
4) Pig: Pig is a high level data-flow language and also an
execution framework for parallel computation. A pig
program is amenable to substantial parallelization,
which enables them to handle big datasets. Pig's
underlying infrastructure consists of a compiler that
generates sequences of MapReduce programs whose
parallel implementations already exist. Pig's language,
Pig Latin express data flow sequences and also
provides ability to the users to develop their own
function for reading, writing and processing data [27].
5) Tez: Tez is a generalized data flow programming
framework, currently built on top of Hadoop YARN. It
provides a powerful and flexible engine for executing a
complex DAG (directed acyclic graph) of tasks to
process data in batch or interactive way. It makes
MapReduce paradigm to more powerful by expressing
computations in data flow graph. Hive, Pig and other
framework of Hadoop eco-system is adopting Tez to
replace MapReduce jobs [28].
6) Chukwa: Chukwa is a data collection system for
monitoring large distributed clusters. It is built on top
www.ijcsit.com
7)
8)
9)
10)
11)
of HDFS & MapReduce framework and provides large
scale log aggregation and analytics. It has a flexible
and powerful toolkit for displaying, monitoring and
analyzing the results to apply on the collected data
[29].
Zookeeper: Zookeeper makes high performance
coordination among distributed applications. Several
Hadoop projects use Zookeeper to coordinate the
cluster and provide highly available distributed
services. It gives a centralized service for maintaining
configuration
information,
naming,
providing
distributed synchronization and providing group
services [30].
Ambari: Ambari is a web-based tool for making
Hadoop management simpler. It provision the Hadoop
cluster by providing a step-by-step wizard for
installing services e.g. Hive, HBase, Pig, Zookeeper
etc. on Hadoop cluster and also handles configuration
of these services. It provides central management to
start, stop and reconfigure the Hadoop services over
cluster. It monitors the health and status of Hadoop
cluster [31].
Avro: Avro is a data serialization system. It provides
rich data structures; a compact and fast binary data
format; a container file to store persistent data; and
remote procedure call (RPC). It does not require code
generation to read or write data nor to use or
implement RPC protocols [32].
Mahout: Mahout is a machine learning, data mining
and math library on top of MapReduce. The goal of
this project is to provide scalable and fast machine
learning and data mining algorithms [33].
Spark: Spark is a fast and general engine for
processing large scale data. Spark provide an easier to
use alternative to MapReduce and run programs up to
100 time faster than Hadoop MapReduce in memory or
10 time faster on disk. It has an advanced directed
acyclic graph (DAG) execution engine that supports
cyclic data flow and fast in-memory computation.
Spark runs on Hadoop and can access HDFS,
Cassandra, and HBase [34].
5.
BIG DATA TRENDS
Big Data opens new opportunities in research and
development and is not only limited to Hadoop and its ecosystem. A number of tools and projects dedicated to
customized requirements are being developed to deploy on
top of Hadoop. Many enterprises are launching their own
Hadoop distributions. Cloud computing is using Hadoop to
provide data processing and storage services. Computation
framework of Hadoop is being efficient and flexible. This
section gives a brief description of some trends of Big
Data.
5.1.
Big Data Eco-System
Big Data Eco-system is even bigger than Hadoop EcoSystem and growing rapidly. We can categorize the
projects and tools of Big Data Eco-System on the basis of
their core functionality for which they are developed. Table
1 summarizes the Big Data related projects.
4636
Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
Table 1. Summary of Big Data related Projects [1]
Sl. No.
Core Functionality
Tools/Projects
1
Getting Data into HDFS
Flume, Chukwa, Scoop, Kafka, Scribe
2
Compute Frameworks
3
Querying Data in HDFS
4
Real Time Data Access
MapReduce, YARN, Cloudera SDK, Weave
Pig, Hive, Cascading Lingual, Stinger, Hadapt, Greenplum HAWQ, Cloudera
Search
HBase, Apache Drill, Citus Data, Impala, Phoenix, Accumulo, Spire
5
Big Data Database
HBase, Cassandra, Amazon SimpleDB, Redis, Voldermort
6
Hadoop in the Cloud
Amazon Elastic MapReduce (EMR), Whirr
7
Work Flow Tools
Oozie, Cascading, Scalding, Lipstick
8
Serialization Framework
Avro, Protobuf, Trevni
9
Monitoring Systems
Hue, Ganglia, Open, Nagios
10
Applications
Mahout, Giraph
11
Stream Processing
Storm, Apache S4, Samza, Malhar
12
Business Intelligence Tools
Datameer, Tableau, Pentaho, SiSense, SumoLogic
5.2.
Hadoop Distributions
A distribution provides easy installation and packages
multiple components to work together. It is tested and
patched with works & improvements. Hadoop is an open
source project of Apache. Like Linux distributions as
RedHat, Ubuntu and Suse some enterprises launched their
own Hadoop distributions with tools to manage and
administer the cluster and also with a free/premium policy.
Cloudera [35] is an oldest distribution of Hadoop.
HortonWorks [36] is a newer distribution very close to
Apache Hadoop. MapR [37] provides its distribution with
their own file system alternative to HDFS. Intel [38]
provides its distribution with encryption support.
5.3.
Hadoop in the Cloud and Virtualized
Environment
Hadoop is originally designed to process on cluster of
physical machines but now it is also used in cloud and
virtual machines [39-40]. Hadoop clusters can be set up in
public and private cloud. Amazon offers on demand
Hadoop cluster. Google provides Hadoop on Google
Compute Engine. Hadoop can be launched as a service in
the public cloud like AWS, Rackspace, MS Azure, IBM
Smart Cloud etc. Amazon's EMR (Elastic MapReduce)
offers a quick and easy way to run MapReduce jobs
without installing Hadoop clusters on its cloud. MapR is
the only commercial distribution available through the
EMR service. Amazon EC2 (Elastic Compute Cloud)
service also provides option to independently deploy
MapR. Hadoop can be run using Amazon's S3 (Simple
Storage Service) instead of HDFS [41-45]. Hadoop clusters
deployed in virtual infrastructures have their own benefits.
A single image can be cloned save operation costs. Cluster
can be set up on demand and physical infrastructure can be
reused. Also cluster size can be enlarged or reduced on
demand [46].
5.4.
Hadoop as a Big Data Operating System
Hadoop is turning into a general purpose data operating
system. Its distributed analytic frameworks MapReduce 2.0
i.e. YARN is a now functioning as distributed resource
manager. YARN provides the daemons and APIs to
www.ijcsit.com
develop generic distributed applications of real world and
also handles and schedule resources. Different data
analytics operations i.e. graph analytics, streaming data
analysis etc. can be plugged in with Hadoop to use storage
and computation framework [47-48].
5.5.
Big Data Security and Privacy Issues
Big Data characteristics volume, velocity, variety have
magnified the security and privacy issues. Security and
privacy issues become more critical due to data hosted in
large scale cloud infrastructures, diversity of data format
and sources, streaming data and high volume inter-cloud
migration. Large scale cloud infrastructures use a diversity
of software platforms and are spread across large networks
of computers, which provide more opportunities to
attackers [49]. A. C. Mora et. al surveyed and drafted a list
of top ten Big Data security and privacy challenges
6.
BIG DATA APPLICATIONS
Big data technologies have wide and long list of their
applications. It is used for Search Engine, Log Processing,
Recommender System, Data Warehousing, Video and
Image Analysis, Banking & Financial, Telecom, Retail,
Manufacturing, Web & Social Media, Medicine,
Healthcare, Science & Research and Social Life. We are
discussing some of the eminent applications here.
6.1.
Politics
Big Data analytics help Mr. Barack Obama to win the US
presidential election in 2012 [51]. His campaign was built
of 100-strong analytics staff to shake dozens of terabyte
scale data. They used a combination of the HP Vertica
massively parallel processing analytical database [52] and
predictive models with R [53] and Stata [54] tools.
6.2.
National Security
Babak Akhgar et. al [55] authored a book on Application
of Big Data for National Security. Authors relate the Big
Data technologies to national security and crime detection
and prevention. They present strategic approaches to
deploy Big Data technologies for preventing terrorism and
reducing crime.
4637
Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
6.3.
Health Care and Medicine
Big Data technologies can be used for storing and
processing medical records. Streaming data can be captured
from sensors or machines attached to patients, stored in
HDFS and analyzed quickly [1]. With Big Data tools and
human genome mapping, there may be a commonplace for
people to have their genes mapped as the part of their
medical record. Genetic determinants that cause a disease
will be easy to find, which help in the development of
personalized medicine [56].
6.4.
Science and Research
Science and research are now driven by technologies. Big
Data adds new possibilities to them. CERN, the European
Organization for Nuclear Research have started the world’s
largest and most powerful particle accelerator, Large
Hadron Collider (LHC). The experiment generated
excessive amount of data. Data center at CERN has 65,000
processors, which analyzed 30 petabytes of data. Its
computing powers of thousands of computers are
distributed across 150 data centers worldwide [57-58].
6.5.
Social Media Analysis
IBM provides a social media analytics, a powerful SaaS
solution to discover hidden insights from millions of web
sources. It is used by businesses to gain a better
understanding of their customers, market and competition.
It captures consumer data from social media, predicts
customer behavior and creates customized campaign [59].
7.
CONCLUSION
Big Data is not only concerned to data big in volume but
also data with big velocity, big variety and big veracity. Big
Data has introduced a new attitude in data processing and
analysis and new opportunities to provide solutions of real
world problems, which are considered infeasible as before.
Apache Hadoop is the most revolutionary technology
which opened the door of infinite possibilities in Big Data.
Initially Hadoop developed with two core components
HDFS and MapReduce. YARN, the NextGen MapReduce
framework turns Hadoop as a general purpose data
operating system. Apache supports a number of subprojects providing specific services and works on top of
Hadoop. Apache is not the only organization that develops
tools and projects for Big Data, many other organizations
are also contributing and some provides their own Hadoop
distributions. Hadoop can also be set up and configured in
cloud and virtualization infrastructures. Cloud provides
Hadoop services without having our own cluster while
virtualization enables us to set up on demand Hadoop
clusters. Hadoop is adopted in wide areas from science &
engineering to social life and has changed the way of
thinking and solving problems.
[1]
[2]
[3]
REFERENCES
Mark Kerzner and Sujee Maniyam, "Hadoop Illuminated,"
https://github.com/hadoop-illuminated/hadoop-book
,
2013,
Accessed on Sept. 20, 2015.
L. Douglas, "3d data management: Controlling data volume, velocity
and variety," Gartner, Retrieved 6 (2001).
IBM What is big data? - Bringing big data to the enterprise.
http://www-01.ibm.com/software/in/data/bigdata/, Accessed on
Sept. 20, 2015.
www.ijcsit.com
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
M. A. Beyer and L. Douglas, “The importance of big data: A
definition,” Stamford, CT: Gartner, 2012.
IBM
Big
Data
&
Analytics
Hub,
http://www.ibmbigdatahub.com/infographic/four-vs-big-data,
Accessed on Sept. 20, 2015.
J. S. Ward and A. Barker, “Undefined By Data: A Survey of Big
Data Definitions,” http://arxiv.org/abs/1309.5821v1.
Tom White, “Hadoop: The definitive guide,” O'Reilly Media, Inc.,
2012.
S. Ghemawat, H. Gobioff and ST Leung, "The Google file system,"
in ACM SIGOPS operating systems review, vol. 37, no. 5, ACM,
2003.
J. Dean and S. Ghemawat, "MapReduce: simplified data processing
on large clusters," in Proc. 6th Symposium on Opearting Systems
Design & Implementation, 2004.
Apache Hadoop, http://hadoop.apache.org
Sort Benchmark, http://sortbenchmark.org/
Yahoo!
Hadoop
Tutorial,
http://developer.yahoo.com/hadoop/tutorial/index.html, Accessed on
Sept. 20, 2015.
HDFS
Architecture
Guide,
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html, Accessed
on Sept. 20, 2015.
Sudhakar Singh, Rakhi Garg and P K Mishra, “Review of Apriori
Based Algorithms on
MapReduce Framework,” in Proc.
International Conference on Communication and Computing (ICC 2014), Elsevier Science and Technology Publications, 2014.
MapReduce Tutorial, http://hadoop.apache.org/docs/current/hadoopmapreduce-client/hadoop-mapreduce-clientcore/MapReduceTutorial.html, Accessed on Sept. 20, 2015.
K-H. Lee, Y-J. Lee, H. Choi, Y. D. Chung and B. Moon, “Parallel
Data Processing with
MapReduce: A Survey,” in ACM
SIGMOD Record, vol. 40, no. 4, pp. 11–20, (2011).
Hadoop
Tutorials,
http://hadooptutorials.co.in/tutorials/hadoop/understanding-hadoopecosystem.html, Accessed on Sept. 20, 2015.
Hadoop
Architecture
Overview,
http://ercoppa.github.io/HadoopInternals/HadoopArchitectureO
verview.html, Accessed
on Sept. 20, 2015.
IBM
developerWorks,
http://www.ibm.com/developerworks/library/l-hadoop-1/,
Accessed on Sept. 20, 2015.
R. P. Padhy, "Big Data Processing with Hadoop-MapReduce in
Cloud Systems," in
International
Journal
of
Cloud
Computing and Services Science (IJ-CLOSER), vol. 2, no. 1, pp.
16-27, 2013.
Apache
Hadoop
NextGen
MapReduce
(YARN),
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoopyarn-site/YARN.html,
Accessed on Sept. 20, 2015.
The
Hadoop
Ecosystem
Table,
https://hadoopecosystemtable.github.io/, Accessed on Sept. 20,
2015.
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M.
Burrows, T. Chandra,
A.
Fikes and R. E. Gruber,
"Bigtable: A distributed storage system for structured data," in
ACM Transactions on Computer Systems (TOCS), vol. 26,
no.2, (2008): 4.
Apache HBase, http://hbase.apache.org/
Apache Cassandra, http://cassandra.apache.org/
Apache HIVE, http://hive.apache.org/
Apache Pig, http://pig.apache.org/
Apache TEZ, http://tez.apache.org/
Apache Chukwa, http://chukwa.apache.org/
Apache Zookeeper, http://zookeeper.apache.org/
Apache Ambari, http://ambari.apache.org/
Apache Avro, http://avro.apache.org/docs/current/
Apache Mahout, http://mahout.apache.org/
Apache Spark, http://spark.apache.org/
Cloudera
CDH,
http://www.cloudera.com/content/cloudera/en/products-andservices/cdh.html
HortonWorks, http://hortonworks.com/
MapR, https://www.mapr.com/
4638
Sudhakar Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (5) , 2015, 4633-4639
[38] The Intel Distribution for Apache Hadoop Software,
http://www.intel.in/content/www/in/en/big-data/big-data-inteldistribution-for-apache-hadoop.html
[39] Wen-Chung Shih, Shian-Shyong Tseng and Chao-Tung Yang,
"Performance study of parallel
programming
on
cloud
computing environments using mapreduce," in Porc. International
conference on Information Science and Applications (ICISA),
IEEE, 2010.
[40] Maryam Kontagora, and Horacio Gonzalez-Velez, "Benchmarking a
MapReduce environment on
a full virtualisation platform,"
in Proc. International Conference on Complex, Intelligent and
Software Intensive Systems (CISIS), IEEE, 2010.
[41] Google Cloud Platform, https://cloud.google.com/solutions/hadoop/
[42] Running
Hadoop
in
Cloud,
http://www.ibmbigdatahub.com/blog/running-hadoop-cloud
[43] MapR in Cloud, https://www.mapr.com/products/hadoop-as-aservice
[44] Amazon EMR, https://aws.amazon.com/elasticmapreduce/
[45] HDInsight, http://azure.microsoft.com/en-in/services/hdinsight/
[46] Virtual Hadoop, https://wiki.apache.org/hadoop/Virtual%20Hadoop
[47] 8
big
trends
in
big
data
analytics,
http://www.computerworld.com/article/2690856/big-data/8-bigtrends-in-big-data-analytics.html, Accessed on Sept. 20, 2015.
[48] YARN to Spin Hadoop into Big Data Operating System,
http://www.datanami.com/2013/05/28/yarn_to_spin_hadoop_into_a_
big_data_operating_system_/, Accessed on Sept. 20, 2015.
[49] Yong Yu, Yi Mu and Giuseppe Ateniese, "Recent advances in
security and privacy in big data,"
(2015): 365.
[50] A. C. Mora et. al, "Top ten big data security and privacy challenges,"
Cloud Security Alliance
(2012).
[51] InfoWorld, http://www.infoworld.com/article/2613587/big-data/thereal-story-of-how-big-data-analytics-helped-obama-win.html,
Accessed on Sept. 20, 2015.
[52] HP Vertica, https://www.vertica.com/
[53] The R Project for Statistical Computing, https://www.r-project.org/
[54] Stata: Data Analysis and Statistical Software, http://www.stata.com/
[55] Babak Akhgar, G. B. Saathoff, H. R. Arabnia, R. Hill, A. Staniforth
and P. S. Bayerl,
"Application of Big Data for National
Security," 1st Edition, Elsevier Store, 2015.
[56] Ten
Practical
Big
Data
Benefits,
http://datascienceseries.com/stories/ten-practical-big-data-benefits,
Accessed on Sept. 20, 2015.
[57] CERN Computing, http://home.web.cern.ch/about/computing
[58] Bernard Marr, The Awesome Ways Big Data Is Used Today To
Change
Our
World,
https://www.linkedin.com/pulse/20131113065157-64875646-theawesome-ways-big-data-is-used-today-to-change-ourworld#notifications, Accessed on Sept. 20, 2015.
[59] IBM
Social
Media
Analytics,
http://www01.ibm.com/software/analytics/solutions/customer-analytics/socialmedia-analytics/, Accessed on Sept. 20, 2015.
www.ijcsit.com
4639