International Journal of Advanced Research in Engineering and Technology (IJARET)
Volume 11, Issue 5, May 2020, pp.624-646, Article ID: IJARET_11_05_066
Available online at https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=5
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
DOI: 10.34218/IJARET.11.5.2020.065
© IAEME Publication
Scopus Indexed
BIG DATA: A COMPREHENSIVE SURVEY
Wasim Haidar SK1, Surendra Pal Singh2, Prashant Johri3
1
Department of Computer Science & IT, NIMS University, Rajasthan, India
2
Department of Computer Science & IT, NIMS University, Rajasthan, India
3
Department of Computer Science and Engineering, Galgotias University, Uttar Pradesh, India
ABSTRACT
During the recreation of monetary competence, data is entirety and everything is
data. However data is dependent upon the world that is chaotic, insane, unpredictable,
and sentimental. The acceleration of so called big data and the evolution of tools and
techniques with the intention of are able to enumerate our every move, desire and
practice, have exposed where the conflict lies between the unstable actualities that we
reside in and the need to seizure it in data. From the ancient anatomy of calligraphy to
extant data centres, the human chase has always collected knowledge and facts. The
advancement in tools and technology has led to the flood of data which desire more
mature data ordnance system. This excessive growth of data makes great dilemma to
human beings. Although there are much potential and extremely useful value, masked in
the large amount of data. Big data is significantly helpful to increase the productivity and
efficiency in business and revolutionary discovery in scientific disciplines and great
opportunities in other fields also. While big data also derives many difficulties and
challenges which is the other side of coin. This paper intended to reveal a close view
about big data including its brief history, big data applications, big data opportunities
and challenges, current tools and techniques to deal with big data problems. We also
converse numerous underlying technologies related to big data such as cloud computing,
internet of things and NoSQL databases.
Key words: Big Data, Hadoop, MapReduce, NoSQL
Cite this Article: Wasim Haidar SK, Surendra Pal Singh, Prashant Johri, Big Data: A
Comprehensive Survey, International Journal of Advanced Research in Engineering and
Technology, 11(5), 2020, pp.624-646.
https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=5
1. INTRODUCTION
In 2010, Gartner and company termed, “Information will be the oil of 21st century”. And it is so
true here that data is everything and everything is data in today world. As we live in the digital
world, data has increased rapidly in every field. According to a report of International Data
Corporation (IDC) in 2011 the overall created and copied data volume in the world was 1.8ZB
which increased by nearly nine times within five years [1]. Under the explosive growth of data
https://iaeme.com/Home/journal/IJARET
624
[email protected]
Big Data: A Comprehensive Survey
in digital global world big data is used to describe the massive amount of datasets. Compared
with traditional databases big data also include unstructured data that needs more real-time
analysis. Big data refers to the data that previously not viable to store,
Supervise and analyze [2]. Moreover big data gives new insights to business or organization
by changing the way of producing the new values from data and also help to understand the indepth knowledge of hidden values from large amount of data and also bring new challenges such
as how to store and manage such large amount of data. With regard to size of data, mercantile
relational systems in fact perform pretty well, most analytics merchants such as Greenplum,
Netezza, Teradata and so on report being capable to handle multi-petabyte datasets [3]. It’s on
the too fast and too hard foreparts where databases systems don’t fit. Databases can’t handle
streaming data. What about provisions such as Hadoop or MapReduce?? Like DBMS they can
scale to the extent by which they can easily handle large amount of data but with more physical
machines. But they are limited too as they provide a low level infrastructure to process data but
could not manage it. To support data management capabilities developers can construct data
administration tools on the top of these promises although a lot of being built such as Hive, HBase
etc which in essence seems to be remodelling DBMSs rather than figure-out new problems and
challenges that are at the essence of big data. Moreover these platforms provide deprived support
for the too fast problem because they process large blocks of data which makes it difficult to
attain high response time. Data sets are raising more swiftly because of the ever increasing cheap
and effective tools and technologies such as mobile devices, cameras, microphone, remote
sensing devices, and radio frequency identification (RFID), wireless sensor networks, internet of
things and so on. Traditional databases such as a relational database and data management and
visualization tool also cannot handle big data and often found difficult to handle such large
amount of data with existing tools and technology. As an alternative it is required that a new tool
which is capable of running parallel software on tens, hundreds, or even thousands of servers
which is effectively manage and process this large complex data [4]. Big data usually includes
data sets with too high volume and complex in nature which is beyond the ability of existing
software and hardware tools to store, manage, and process within reasonable elapsed time. In
2001 in a research report of META group analysts, Doug Laney delineated that increscent
amount of data augmented the challenges and opportunities as in three dimensional i.e. increasing
volumes, speed of data in and out in terms of velocity and variety [5]. Gartner and often of the
industry use this three v’s for describing the big data. But now additional vs have been added
such as veracity and value in the characteristics of big data. Big data usually includes stacks of
unstructured data that needs more real time analysis. All though big data enlarged new
opportunities for realizing new values from large amount of data and facilitate us to achieve
profound knowledge of hidden values and also incurs new challenges such as how to integrate,
store, manage, and process such large amount of data. Now a day’s big data related companies
which manage and process such large amount of data, grows swiftly. Examples of such
companies are Google which process hundred petabytes of data, Facebook which generates more
than 10 petabyte of data per month only, YouTube which generates 72 hour of video data every
minute and so on. The swift growth of new technologies such as Cloud Computing and Internet
of Things further encourages the quick growth of data. Such data in terms of quantity and
complexity will go beyond the capacities of existing IT infrastructure and architecture and also
its real time requirement will significantly hassles the current computing competence. The
increasingly emergent data cause a problem of how to capture and manage such heterogeneous
data. In manifestation of volume, variety, complexity, velocity, veracity and security of big data,
we shall efficiently mine such datasets at different levels during the processing, managing,
visualization and forecasting so as effectively disclose its essential hidden values and properties
and improve the decision making.
https://iaeme.com/Home/journal/IJARET
625
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
1.1. Big Data: A Current and Future Extremity
There are many definitions of big data in the extant literature but in common terms Big Data is
nothing but when data cannot be handled by a single machine it is termed as Big Data. With the
swift advancement in tools and technology Big Data is quickly growing in all trades and science.
Big data has other important characteristics other than volume, which distinguish it from huge
data or very large amount of data. At present when the significance of big data has been already
realized by all trades and industries, people still have diverse belief on its definition. In general,
big data is the amount of data that could not be feasibly stored, managed, and processed by
existing tools. There are different definition of big data because of different anxiety of different
people such as research scholar, scientific and technological enterprises, data analysts and
technical practitioners. The subsequent definition helps us to better understand the economic,
social, and technological nuance of big data. Apache Hadoop, in 2010 defined big data as
“datasets which could not be stored, supervised and processed by existing general tools and
technologies within a tolerable span”. McKisinsey & company in 2011, announced Big Data as
the forthcoming frontier for advancement, rivalry and productivity. Big data refers to the amount
of data that is too big in context of volume, too fast in context of velocity and too hard in context
of variety. The extant science defines Big Data as the succeeding immense thing in innovation
[6]. Big data is the quantity of data that is prior not practicable to store, process and manage by
existing tools. Big data technologies are new generation technologies and architectures which
were intended to extort value from multivariate elevated volume data sets efficiently by providing
immense speed acquisition, devising and analyzing [7]. Big data has been defined in early 2001.
An analyst at META, Doug Laney, in a research report, depicted the challenges and opportunity
brought by growing amount of data with a 3V model such as high Volume, Velocity and Variety
[5]. Although such a model was not initially used to defined big data. Many enterprises such as
IBM, Microsoft, and Gartner and so on still used this 3V model to defined big data in the
subsequent ten years. In the 3V model of big data volume defines the production and gathering
of high quantity of data, variety defines the diversity of data types such as traditional structured
data, semi-structured data and unstructured data, and velocity defines the appropriateness of data
such as data capturing and analysis must be swiftly and timely conducted so as to properly and
efficiently utilizes the mercantile value of big data [3]. In 2011 an IDC report describes big data
as “it is a new generation technologies which are designed to extract value from large amount
and variety of data, by enabling the high speed capturing, discovery and analysis”. With this
definition of big data, big data can be characterized by four Vs such as Volume, Velocity, Variety
and Value. This characterization was widely recognized since it brings lights upon the meaning
and inevitability of big data such as discovering the immense hidden values of data. This
definition also discovers the most vital dilemma of big data i.e. how to exposed values from such
large and diverse datasets. Additional V such as Veracity is also added in the definition of big
data which defines the trustworthiness of data. These five Vs are properly defined as follows:
velocity
variety
(producteion
rate of data)
(to turn out
data into
benefits)
(diversity of
data)
volume
(data in rest)
value
big
data
veracity
(trustworthin
ess of data)
Figure 1 5v’s of Big Data
https://iaeme.com/Home/journal/IJARET
626
[email protected]
Big Data: A Comprehensive Survey
Volume
Volume described the size forepart of big data which represent the amount of data generated.
Volume signifies the data in rest. Data volume describes the masses of data.
Velocity
Velocity described the speed of incoming data as well as to stream fast moving data into bulk
storage for later batch processing. While production rate is conspicuously high so data should
also be analyzed more promptly. The rapid generation of data signifies the velocity of data.
Variety:
Variety described the diversity in data types as the data comes from the different sources. Data
can be of three types: Structured, Semi-Structured, And Unstructured. Structured; All the data
which can be stored in relational database in table with rows and columns. Semi-Structured; Data
that cannot stored in relational database and have some executive properties that make it simpler
to analyze such as email, social media feeds, and web logs and so on. Unstructured; It includes
text and multimedia data, sensor data, photographs and images, videos and audio files etc.
Although these types of files may have an internal structure, they are still regarded as
unstructured because the data they contain doesn’t fit methodically in a database.
Value
Value described our capability to turn our data into value. How to discover the great hidden value
from data so as to efficiently use the economic as well as commercial benefit of big data. Having
such large amount of data but cannot use it effectively for some meaning make it useless. So
value is a most important V of big data.
Veracity
Veracity refers to the credibility, quality and accuracy of data. How correct our data is? With
new techniques and technological advancements we can now work with the data which is less
reliable or accurate such as twitter post with hash tags, abbreviations, speech etc.
2. BIG DATA: A BRIEF HISTORY
The journey of big data starts many years before the present buzzword of “Big Data”. It’s already
a seventy years ago when we first confront with the high growth rate in volume part of data or
what has traditionally been known as the information explosion firstly a term used in 1941. The
realization of big data was firstly recognized in 1944, by Fremont Rider, Wesleyan University
Librarian, who predicted that American University libraries were doubling in volume every
sixteen years [8]. In context with this increase rate Rider hypothesize that the Yale library would
extent over 200,000,000 volumes in 2040. In 1949 Claude Shannon known as the “father of
information” accomplished a research on Big Storage Capacity on data items such as punch cards
and photographic data in which one of the largest data items was the Library of Congress,
measuring over 100 trillion bits of data. In 1961, Derek Price concluded in his research on
Scientific Knowledge that the numerator of novel publications has grown exponentially quite
linearly, doubling in each fifteen years and growing during every half century by a factor of ten.
In November 1967 a paper titled as “Automatic data compression” was published in the
Communications of the ACM by B.A Marron and P.A.D de Maine stating that “the information
explosion renowned in early years makes it crucial that storage requirements for all datasets be
reserved to a minimal” [9]. The paper depicts the automatic compression technique which can be
used to decrease the storage requirement and increase the transmission rate of information. In
1971 Arthur R. Miller writes the book titled as “The Assault on Privacy” and attained that “The
https://iaeme.com/Home/journal/IJARET
627
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
extensive number of information system managers seem to measure a man by the number of bits
of storage capacity his database will captured” [10]. The Ministry of Post and
Telecommunications in Japan began regulating the Information Flow Census, capturing the
quantity of information dissipating in Japan in 1975 [11]. The census introduces “amount of
words” as the consolidated unit of quantity across all media. The census of 1975 already discover
that information supply is escalating more rapidly than information consumption and in 1978 it
releases a report that “the demand for information administer by mass media, which are regarded
as one-way communication, has become dormant, and the demand for information administer by
personal telecommunications media, which are regarded as two-way communications, has
significantly increased. 1981 The Hungarian Central Statistics Office began a research project
which involved accounting for the information industries of the country, including measuring
information volume in bits. In 1993, Istvan Dienes, chief scientist of the Hungarian Central
Statistics Office, accumulates a manual for a standard system of national information accounts
[12]. In 1983 author Ithiel de Sola Pool publishes a “Tracking the Flow of Information” [13] in
which he articulates by reviewed at the growth trends in 17 major communications media from
1960 to 1977 , he certified that the flood of information had exponentially matured by 2.9%
throughout that period due to broadcasting and media. In July 1986 Hal B. Becker publishes a
paper and refrains that “Semiconductor RAM (Random Access Memory) should be storing
1.25*10^11 bytes per cubic inch, by the year 2000.” In 1996 according to R. Morris and B.J.
Truskowski, Digital storage becomes ever more cost-effective and worthwhile for storing data
than paper [14]. In October 1997 Michael Cox and David Ellsworth publicized a paper in the
Proceedings of the IEEE 8th conference on Visualization, labelled as “Application-controlled
demand paging for out-of-core visualization” [15]. They concluded in his article that
“Visualization brings a tempting challenge for computer systems since data sets are high in
volume, tiring the competence of main memory, local disk, and even remote disk. They call this
the crisis of big data. It is the first article in the ACM digital library which uses the term “big
data.”
In April 1998 a meeting USENIX was held by John R. Masey, Chief Scientist at SGI, and
presents a paper titled “Big Data and the Next Wave of Infrastress”. In October 1998 K.G.
Coffman and Andrew Odlyzko concluded in his paper that the growth rate of traffic on public
internet is much higher than that of other networks [16]. Later Odlyzko entrenched the Minnesota
Internet Traffic Studies (MINTS), for capturing the expansion in Internet traffic from 2002 to
2009. In October 1999 a panel of members named Bryson, Kenwright and Haimes join, Robert
van Liere, David Banks and Sam Uselton, titled “Automation or interaction: what’s best for big
data?” at the IEEE 1999 conference on Visualization [17]. Peter Lyman and Hal R. Varian in
October 2000 publish “How Much Information?” which is the first inclusive study to measure,
in terms of computer storage, the total amount of new and unique information created in the
world annually. The study discovered that in 1999, the world produced around 1.5 Exabyte of
exclusive information. In February 2001 Doug Laney, an analyst with the Meta Group, concluded
a research note titled “3D Data Management” in which he clarifies how to supervised and control
Data Volume, Velocity, and Variety. A decade later, the “3Vs” have turn out to be the wellaccepted three dimensions of big data. In September 2005 an article was published by Tim
O’Reilly titled as “What is Web 2.0?” in which he claims that “Data is new Intel Inside and SQL
is the new HTML”. In March 2007 IDC (International Data Corporation) published a white paper
in which it is the first study to estimate and predict the amount of digital data produced and
replicated each year. IDC forecasts that in between 2006 to 2010 the information supplemented
yearly to the digital world will doubling every 18 months [18]. Bret Swanson and George Gilder
in January 2008, publicize “Estimating the Exaflood”[19] in which they foretell that U.S. IP
traffic could attain one zettabyte by 2015 and the U.S. Internet in 2015 will be at least 50 times
larger than it was in 2006. In December 2009 the researchers Roger E. Bohn and James E. Short
https://iaeme.com/Home/journal/IJARET
628
[email protected]
Big Data: A Comprehensive Survey
discerns in his study that “Americans engrossed information for about 1.3 trillion hours, an
average of about 12 hours a day”. In February 2011 Martin Hilbert and Priscila Lopez estimate
that in between 1986 and 2007, world’s information storage competence exceed at a compound
annual enlargement rate of 25% per year. They also estimated that, 99.2% of all storage capacity
was analog in 1986, but in 2007, 94% of storage capacity were digital, a complete reversal of
roles. From 2013 and present Businesses are commencing to put into practice new in-memory
technology such as SAP HANA to examine and optimize large volume of data. Industries are
becoming ever more dependent on exploiting data as a commerce aid to realize economic
advantages, with Big Data leading the charge as perhaps the most vital new technology to value
and make use of in order to stay important in today’s swiftly altering market.
Table 1 Evolution of Big Data
Year
1941
1944
1949
1961
1967
1971
1975
1978
1981
1983
1986
1996
Description
First time confront with high growth rate in volume of data known as
Information Explosion.
Realization of big data for the first time by Fremont Rider, who
predicted that American University libraries were doubling in volume
every sixteen years.
Research on Big Storage Capacity by Claude Shannon , on data items
such as punch cards and photographic data in which one of the largest
data items was the Library of Congress, measuring over 100 trillion bits
of data
Derek Price concluded in his research on Scientific Knowledge that the
numerator of novel publications has grown exponentially quite linearly,
doubling in each fifteen years and growing during every half century by
a factor of ten.
A paper titled as “Automatic data compression” was published in the
Communications of the ACM by B.A Marron and P.A.D de Maine
stating that “the information explosion renowned in early years makes it
crucial that storage requirements for all datasets be reserved to a
minimal”
Arthur R. Miller writes the book titled as “The Assault on Privacy” and
attained that “The extensive number of information system managers
seem to measure a man by the number of bits of storage capacity his
database will captured”
The Ministry of Post and Telecommunications in Japan began regulating
the Information Flow Census, capturing the quantity of information
dissipating in Japan
Releases a report that “the demand for information administer by mass
media, which are regarded as one-way communication, has become
dormant, and the demand for information administer by personal
telecommunications media, which are regarded as two-way
communications, has significantly increased.
The Hungarian Central Statistics Office began a research project which
involved accounting for the information industries of the country,
including measuring information volume in bits.
Ithiel de Sola Pool publishes a “Tracking the Flow of Information” in
which he articulates by reviewed at the growth trends in 17 major
communications media from 1960 to 1977 , he certified that the flood of
information had exponentially matured by 2.9% throughout that period
due to broadcasting and media.
Hal B. Becker publishes a paper and refrains that “Semiconductor RAM
(Random Access Memory) should be storing 1.25*10^11 bytes per cubic
inch, by the year 2000.”
R. Morris and B.J. Truskowski, Digital storage becomes ever more costeffective and worthwhile for storing data than paper (The Evolution of
Storage Systems, IBM Systems Publications, July 1, 2003).
https://iaeme.com/Home/journal/IJARET
629
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
1997
1998
(Apr.)
1998
(Oct)
1999
2000
2001
2005
2007
2008
2009
2011
2013 &
Present
Michael Cox and David Ellsworth publicized a paper in the Proceedings
of the IEEE 8th conference on Visualization, labelled as “Applicationcontrolled demand paging for out-of-core visualization” in which they
concluded that “Visualization brings a tempting challenge for computer
systems since data sets are high in volume, tiring the competence of
main memory, local disk, and even remote disk. They call this the crisis
of big data. It is the first article in the ACM digital library which uses
the term “big data.”
A meeting USENIX was held by John R. Masey, Chief Scientist at SGI,
and presents a paper titled “Big Data and the Next Wave of Infrastress”.
K.G. Coffman and Andrew Odlyzko concluded in his paper that the
growth rate of traffic on public internet is much higher than that of other
networks
A panel of members named Bryson, Kenwright and Haimes join, Robert
van Liere, David Banks and Sam Uselton, titled “Automation or
interaction: what’s best for big data?” at the IEEE 1999 conference on
Visualization.
Peter Lyman and Hal R. Varian publish “How Much Information?”
which is the first inclusive study to measure, in terms of computer
storage, the total amount of new and unique information created in the
world annually. The study discovered that in 1999, the world produced
around 1.5 Exabyte of exclusive information.
Doug Laney, an analyst with the Meta Group, concluded a research note
titled “3D Data Management” in which he clarifies how to supervised
and control Data Volume, Velocity, and Variety. A decade later, the
“3Vs” have turn out to be the well-accepted three dimensions of big
data.
An article was published by Tim O’Reilly titled as “What is Web 2.0?”
in which he claims that “Data is new Intel Inside and SQL is the new
HTML”.
IDC (International Data Corporation) published a white paper in which it
is the first study to estimate and predict the amount of digital data
produced and replicated each year. IDC forecasts that in between 2006
to 2010 the information supplemented yearly to the digital world will
doubling every 18 months.
Bret Swanson and George Gilder publicize “Estimating the Exaflood” in
which they foretell that U.S. IP traffic could attain one zettabyte by 2015
and the U.S. Internet in 2015 will be at least 50 times larger than it was
in 2006.
The researcher Roger E. Bohn and James E. Short discerns in his study
that “Americans engrossed information for about 1.3 trillion hours, an
average of about 12 hours a day”.
Martin Hilbert and Priscila Lopez estimate that in between 1986 and
2007, world’s information storage competence exceed at a compound
annual enlargement rate of 25% per year. They also estimated that,
99.2% of all storage capacity was analog in 1986, but in 2007, 94% of
storage capacity were digital, a complete reversal of roles.
Businesses are commencing to put into practice new in-memory
technology such as SAP HANA to examine and optimize large volume
of data. Industries are becoming ever more dependent on exploiting data
as a commerce aid to realize economical advantages, with Big Data
leading the charge as perhaps the most vital new technology to value and
make use of in order to stay important in today’s swiftly altering market.
3. CONTRIBUTION IN THE AREA OF BIG DATA
Data swiftness has been one of the big bases which are responsible for development of big data
technology as processing and operational speed of legacy databases and data warehouses have
demonstrated too slow and cannot feed many business needs. Data swiftness will become even
https://iaeme.com/Home/journal/IJARET
630
[email protected]
Big Data: A Comprehensive Survey
more important as organization change their focus from just storing and managing data to actively
use it. This segment presents the recent developments and progress in big data era. We subdivide
this segment in five sections such as Classical big data technology, big data in cloud computing,
data engineering and benchmarking approach, mobile big data technology and real time decision
making with big data, which covers the timeline between 2011 and 2016.
3.1. Clinical Big Data Technology
In classical big data research, challenges and opportunities of storage i.e. databases in existence
of big data revealed [1]. Virtualization planning and cloud computing methods in IBM data centre
are also introduced [20]. From platform architect perspective, Ferguson accounts their
advancement for stepping up big data analysis [21]. As a current effort, Dittrich begins
contributions on optimizing big data processing competence in Hadoop and MapReduce [22].
Herodotos Herodotou, a researcher from Duke University, proposed a self-tuning system,
Starfish, for big data analytics [23]. Yongqiang He, a member of Facebook infrastructure team,
proposed an RCFile as a fast and space-efficient data placement structure in MapReduce based
warehouse [24]. H Zou, a researcher, designed a framework for flexible data analytics and also
improves the I/O performance [25]. N. Rahman, a researcher from University of new Haven, US,
build a architecture of a hybrid data center for big data which reduces the delay approximately
39% and also decreases the cost of cooling systems from 49.57% to 27% [26].
3.2. Big Data in Cloud Computing
A notable progress of big data has also been report proclaimed in era of cloud computing.
Agrawal disclosed existing states and future opportunities for cloud computing and big data [27].
In 2013 Lakew in his thesis introduced the Resource management and allocation techniques in
multi-cluster cloud [28]. Jinquan Dai in Intel Asia-Pacific development and research ltd
presented a method HiTune an analyzer for Hadoop, which does the data-flow based performance
analysis for big data cloud [29]. Sifei Lu in his research presented a framework for cloud based
large scale data analytics and visualization; case studies on climate data of different scales were
presented also [30]. A data centric cooling energy costs reduction approach for cloud based big
data analytics were presented by T. Kaushik in 2012 [31]. Y. Zhao in his paper gives the idea of
Aaas (Analytics as a Service) on cloud computing and focuses on scheduling of cloud resources
for Big Data Analytics Applications (BDAA) to satisfy the Quality of Service [32]. R. Buyya in
his paper shows that the most cost effective way of performing Big Data Analytics is to employ
machine learning on cloud computing [33].
3.3. Data Engineering and Benchmarking Approaches
Excess with methodologies, there have been minute appealing data engineering and
benchmarking achievement accounted for big data. Wanling Gao presented a big data benchmark
project based on search engine workloads analysis and its benchmark methodology and also a
tool to generate scalable big data [34]. Juha K. Laurila, a researcher in Nokia Research Centre,
presents a mobile data gathering challenge introduced by Nokia, which signifies an important
step in the direction of mobile big data computing [35]. L. Wang presents a Big Data benchmark
suit in which he covers broad application scenarios and diverse datasets and data sources,
algorithms, and software stacks; and effectively measures and evaluates the big data systems and
architecture [36]. R. Han in his paper presents a survey on state-of-the-art of big data benchmark
in academics as well as industry and provides a foundation for building a successful benchmark
[37]. He also gives the brief introduction to data generation techniques and workload
implementation techniques to characterise the big data big data benchmark.
https://iaeme.com/Home/journal/IJARET
631
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
3.4. Mobile Big Data Technology
Mobile computing is becoming popular with greater extent and more important counterpart of
big data and traditional internet. Shekhar, a faculty in university of Minnesota, accounted a spatial
big data challenges interconnecting with mobility and cloud computing [38]. In June 2011,
Chittaranjan presents a research in which he investigates large scale Smartphone data for
personality studies [39]. From big data application perspective, Zaslavsky introduces challenges
in Internet of Things and also presented an interesting application of Sensing as a service and big
data [40]. A. Alsheikh in his paper presents a general idea of deep learning in mobile big data
analytics and gives a scalable framework over Apache Spark [41].
3.5. Real Time Decision Making with Big Data
Initial big data development focuses on storage. Organization focuses on to measure the swiftness
of data rather than the amount of data being managed. In 2014 raw data is stored in its natural
form in object based storage repositories. In 2015 organizations moves from batch to real time
processing to enhance the ability to make real time decisions. It is now not only about to store
large amount of data and support bigger queries and reports; the big trend in 2015 was to
continuous access and processing of data and records in real time to gain constant alertness and
take instant decision. For processing real time data there are many tools. Hortonworks DataFlow
[42], a project by Apache NiFi, which provides a way to capture and curate data in motion. It
basically for streams of information comes from Internet of Thing devices such as sensors, GPS,
and other machines and even social networks. Apache Strom, Apache Flink, Apache Spark,
Apache Kafka, SAPHana and so on; there are many tools for real time data processing.
4. BIG DATA OPPURTUNITIES AND CHALLENGES
The severely ever-increasing data surge in every field of Big Data lead to various challenges such
as data procurement, repository, administration and analysis. Prior data ordnance and reasoning
systems are able to process only structured data as they are based on relational database
management system. It seems that it could not be exploit large amount and diversity of big data.
The research community has proposed many solutions from which one is of cloud which satisfies
the need of big data as it provides the infrastructure for big data in cost effective manner with
easy upgrading and downgrading. For storing the large and complex data effectively and
permanently, distributed file systems and NoSQL databases are finest option. As data is large
and diverse in nature there are many challenges that are of important to consider such as data
representation, data redundancy, data analysis, data confidentiality, data integration and
interoperability, data visualization, and data management.
4.1. In Enterprise
At present, big data primarily comes from and used in enterprises. By using big data in enterprise,
one can enhance their productivity and competitiveness in many facets. On marketing, data
association with analysis can enhance the productivity by predicting the consumer behaviour and
find new selling plans. On business planning by comparing large amount of data, enterprise can
optimize their goods and service prices and increase the productivity as well as competence in
competitive world. On operation enterprise can advance their operation efficiency and customer
satisfaction, accurately estimate resources required and production capacity and diminish labour
cost. In finance big data also improves the total revenues of Enterprise by analysing the customer
sales data and predicting patterns of commodities buys frequently or together. On supply chain
enterprise controls budget and improves services offered by performing inventory optimization,
logistic optimization and supplier coordination. It can also help to lessen the gap between supply
and demand.
https://iaeme.com/Home/journal/IJARET
632
[email protected]
Big Data: A Comprehensive Survey
4.2. IoT based Big Data
IoT is an important source of big data. IoT includes high variety of devices and objects such as
smart car, mobile phones, GPS, and so on. Logistics enterprises may have greatly experienced
the IoT big data because at present almost cars and other mechanical devices and objects are
equipped with sensors, GPS and wireless adapters which generates the data more swiftly. So that
one can track the object position and prevent many abnormal circumstances such as engine
failures etc. On smart city, data generated by smart city can be used to support for decision
making on managing the water resources, electronic grids, reducing traffic jam, and improving
public welfare.
4.3. Social Network
Online social network service allows an individual to communicate with each other via messages
by using information network. Big data of social network essentially comes from instant
messages, blogs, shared space, etc. The analysis of social network data done by using
computational analytical method which involves informatics, mathematics, management and
sociology science, provided for understanding relations in human society. In social network data
analysis, analysis can be performed in various dimensions such as content based, structure based
and so on. In content based application language and text are two most vital forms of presentation.
By analysing language and text, user preference, fondness, interest, demand and emotions are
revealed. In structured based applications social relations, interests and hobbies of users are
amassed relations among users and it is represented as clustered structure while users are
represented as nodes. Application of big data from online social network may help to understand
one’s behaviour and master the law of social and economic activity by providing early warning
if any abnormal activity is ongoing or abnormal use electronics services and goods and real time
monitoring of events and usage.
4.4. Medical and Health Care
At present medical data is swiftly and increasingly growing all over the world. As data is diverse
in nature and complex so it has to be efficiently stored and effectively processed, analyzed, and
query. Big data has the eventual potential to handle the medical data which improves the
healthcare business. By using big data of medical industry, one can predict the recovery of
patients, treatment of a particular decease, and average percent of patient of a same syndrome
and so on. In 2007 Microsoft launches a medical application Health Vault by which one can
manage the individual medical record or information of individual or family medical devices.
4.5. Science
Io In science and research, big data has a huge application. In 2000, Sloan Digital Sky Survey
(SDSS), which gathers the astronomical data which was more than that all the data in the history
of astronomy. It collects about 140 terabyte of data which is approx 200 GB per night [43].
Google’s DNAStack, a part of Google Genomics, amasses and organizes DNA samples of
genetic data from all around the world to identify diseases and other medical terms and also
allows scientists to use the sample data all around the world. The NASA Centre for Climate
Simulation stores about 32 petabytes of climate data [44]. .
4.6. Government
The use of big data in government is advantageous and enables to gain the profit, productivity
and efficiency. In 2012, big data analysis played a great role in Barack Obama's successful reelection campaign [45]. In India, big data analysis helped NDA to win Indian General Election
2014. Although in all sectors of government big data has the applications, easy and timely
https://iaeme.com/Home/journal/IJARET
633
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
capturing and analysis of data is essential for different government agencies to gain and improve
mission requirement. Government organizations are starting to deploy big data tools and
technologies to analyze massive amount of data to prevent fraud, abuse, and devastate.
4.7. Private Sector
Big data has been used in retail, banking, real estate etc. Walmart, an American multinational
retail corporation which handles more than 1 million customer transactions every hour which in
turn create more than 2.5 petabytes of data. Windermere real estate, which is a largest company
of real estate in Pacific Northwest, with 300 offices and 7000 agents, utilize unstipulated GPS
signals from roughly 100 million drivers to help new home buyers to locate out their typical drive
times to and from work, during different times of the day. FICO Falcon Fraud detection system
for credit and debit card protects debits and credit card world wide as well as user’s bank account.
FICO falcon handles about 2.1 billion active accounts.
Private Sector
Government
Science
Medical and healthcare
Social Network
IoT Based
In Enterprise
0
5
10
15
20
Adoption of Big Data
Figure 2 Adoption of Big Data in different areas
4.8. Data Volume
The amount of data is one of the biggest challenges of big data and gives rise to the other
important challenges. With big data you should be able to scale very swiftly and flexibly
whenever you want. So this is not actually the volume part which creates problem but scaling as
whenever the data generation rate is notably high. Scaling, problem of big data give rise to the
other challenges such as how to represent the huge amount of data so that analytics gives
prominent results. Data life cycle management challenge is also comes from data volume part
which described that which data shall be stored and which data shall be leftover. So, big data
comes with huge amount of nuisance which is to be considered significantly to take advantages
of actual big data technology.
4.9. Data Diversity
As data is diverse in nature because it is comes from different sources, rigid schemas is not
beneficial instead we need a more flexible and efficient design to handle our data. We need our
technology to flexible enough to handle our data more swiftly and effectively so that we can
perform transaction from our data in real time and also perform analytics as fast as possible
without regarding the type of data we have. Diversity in Data brings serious challenge to analytics
of data so as to decision making of business. How to coordinate with different sources is another
serious problem which degrades the performance rate.
https://iaeme.com/Home/journal/IJARET
634
[email protected]
Big Data: A Comprehensive Survey
4.10. Data Security
It is the major and most vital challenge of big data technology. Big data contains clumsy as well
as sensitive data such as credit card information, personal information, and other sensitive
information. So data has to be secured against data leakage, security breaches and other security
threats. Most of the NoSQL databases provide only few security mechanisms to maintain the
security in the big data platform. So security against threats is one of the fundamental and critical
challenges of big data. Although many frameworks has been designed in big data platform for
providing security to big data.
4.11. Performance
As we live in the digital world, a microsecond delays can cost you a lot, big data must progress
at exceedingly high velocities no matter how much data or what type of data your database must
execute. The data handling techniques of RDBMS and NoSQL database solutions put a serious
nuisance on performance.
4.12. Data Management
Managing big data with RDBMS is a costly, time consuming, and often useless. Most NoSQL
solutions overwhelmed by complex functioning and hard configuration. So it is hard to manage
high volume, variety, velocity and complex data. It is the foremost fundamental problem of big
data to efficiently manage such large amount of data.
4.13. Integration
As data is coming from multiple sources and used to analyze and helps in gaining the fruitful
insights; however it necessitates the need of data integration at multiple levels. Assume a scenario
where two organizations want to merge and have different data management systems. So in this
situation integration plays a vital role. Integration includes an integration of processes,
integration of databases or integration of administrative task. Data has to be integrated and
optimized to ensure the smooth operation of business application and it is hard in real time
scenario.
4.14. Interoperability
It is the challenge developed by integration. Integration and interoperability are the two sides of
coin. Interoperability is highly dependent on integration. If integration is not effectively applied,
interoperability could not achieve as two different databases have to be first integrated to achieve
interoperability between them. It is the major challenge of big data.
4.15. Continuous Availability
When you rely on big data for your business applications, constant high availability is not
sometimes sufficiently high. Your data can never go down as it is the most vital asset of your
application. Although certain amount of halting is inbuilt in RDBMS and NoSQL systems.
4.16. Data Visualization
The main purpose of data visualization is to represent the knowledge more intuitively and
effectively by using graphical view of data. To harvesting the big data correctly and more
effectively, the next big challenge is figuring out how to display the information in a way that is
useful for decision making. Helping organization to find outliers, interpretation of hidden values
and patterns and to understand their fast changing dataset is where visualization provides real
value. Traditional data analytics and reporting tools are not sufficiently handles big data. Most
vital and greatest challenge possesses in the part of variety and velocity of big data. Another
https://iaeme.com/Home/journal/IJARET
635
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
challenge of visualization based data discovery tools is in the IoT based data such as mobile user
data.
5. BIG DATA TOOLS AND TECHNIQUES
5.1. Big Data Techniques
Big data needs techniques to process massive amount of data and at same time creates business
value. Practically big data techniques are driven by précised application areas. Big data
techniques include number of domains such as data mining, machine learning, social network
analysis, statistics, neural network, sentiment analysis and genetic algorithm.
5.1. Data Mining
It is a technique to extract valuable patterns from data. It further includes techniques such as
association rule mining, cluster analysis, classification and regression. Data mining with big data
is more challenging as compared to traditional data mining. For example cluster analysis, an
intrinsic way of clustering with big data is to extend existing methods so that it can deal with the
massive amounts of workloads. Most extension generally works with sample of big data, and
sample based results are used to derive a division for overall dataset [46].
5.2. Machine Learning
Machine learning is an essential part of artificial intelligence which is intended to design
algorithms that let machine to learn from observed data. The most vital feature of machine
learning is to ascertain knowledge and make decisions involuntarily [46]. To handle big data
efficiently, traditional machine learning methods should be scaled up; Deep machine learning
has turned to a new research extremity in the area of artificial intelligence [47]. Particularly
machine learning algorithm is used in areas such as to distinguish between spam and non-spam
emails, learns user preferences and patterns of activity and recommend on behalf of this
information, determine the probability of wining, determine the best material to engage the
customer and so on.
5.3. Social Network Analysis
Social network analysis is a technique which is firstly used by telecommunication industry and
in present is a key technique in recent sociology. It examines social relationships with the help
of network theory which consists of nodes and ties between them where nodes represents the
individual in network and ties represents the relationship between individuals . Social network
analysis includes social system design, human behaviour modelling, social network
visualization, social network evolution analysis, and graph query and mining [46]. Today online
social media has become popular thus analysis of a network consisting millions or billions of
connected people is usually difficult and costly.
5.4. Statistics
It is a science of collecting, organizing, and interpreting data. Statistical techniques are used to
extract correlations and associations between different objects in terms of numerical description.
Traditional statistical algorithms are not well suited with big data. Data driven statistical analysis
which is a part of statistics, focuses on scaling and parallelization of statistical algorithms.
5.5. Data Mining
It is also an important part of artificial intelligence. It is a mature technique and has a wide range
of applications such as pattern recognition, speech recognition, image analysis, time series
prediction, computer numerical control and so on. Neural network is used to approximate
https://iaeme.com/Home/journal/IJARET
636
[email protected]
Big Data: A Comprehensive Survey
function which can be based on large number of inputs. Generally it consists of an
interconnection network of neurons which exchange message between each other. The
connection have numerical value or weight, which can be adjusted based on experience , making
network robust to inputs and adapted to learning.
5.6. Sentimental Analysis
It helps to extracts the subjective information in source material. It generally estimates the
polarity of a sentence in a document, sentence or entity, whether the articulated review or opinion
is positive or negative or neutral. In advanced concept “Beyond Polarity” sentiment classification
classifies the opinion at emotional states also such as happy, angry and sad.
5.7. Genetic Algorithm
It belongs to the class of Evolutionary Algorithms which produced the solutions to optimization,
and search problems with the help of techniques inspired by natural evolution such as inheritance,
crossover, mutation and selection thus imitates the process of natural selection.
BIG DATA TOOLS FOR BATCH PROCESSING
5.8. Hadoop and MapReduce
Hadoop is industry-wide accepted and used tool in big data applications. Hadoop is open source
software, batch offline oriented, data and I/O intensive general purpose framework which
provides a reliable, efficient and distributed computing. Hadoop provides a framework which
provides distributed storage and processing of large amount of data in distributed environment
across clusters of computer. It is designed to scale up from single server to thousands of machines
each having a local storage and computation. Hadoop project includes four modules as follows:
Hadoop Common
It includes common functionalities that support other Hadoop modules.
Hadoop Distributed File System (HDFS)
It provides the high performance access to the high volume data. HDFS stores data in chunks
across many nodes in the cluster. It Replicates data across nodes for durability purpose. It has a
master slave architecture in which one node in cluster is a Namenode and others are Datanode.
Namenode which is master process which runs on a single node directs the user access to files in
HDFS. Datanode which is slave runs on all the nodes and responsible for block creation,
replication and deletion.
Hadoop YARN
YARN stands for Yet Another Resource Negotiator. It is a framework that provides the job
scheduling and cluster resource management. All the functionality of YARN is available in the
upgraded version of Hadoop (Hadoop 2.0).
Hadoop MapReduce
It provides parallel processing of elevated data set. Hadoop is used in applications such as spam
filtering, social recommendation, network searching etc. At current a biggest Hadoop cluster
includes 4000 nodes. Number of nodes could be increased to 10,000 with new release of Hadoop
2.7.1. In November 2012, Facebook declared that their Hadoop cluster can process 100 petabytes
of data, which increased by 0.5 petabyte per day. So Hadoop provides a cost effective framework
and most opportune service for acquisition and processing the large amount of data. MapReduce
includes two logical functions which is mapper and reducer. Hadoop handles distributing the
https://iaeme.com/Home/journal/IJARET
637
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
Map and Reduce tasks across the cluster. Map function takes a data in a form of key value pair
and gives the output as set of key value pair. Reduce function takes a key and a associated set of
array of values and gives the output in a set of key value pair.
5.9. Apache Mahout
Apache Mahout is a project governs by Apache Software Foundation to produce scalable,
business friendly machine learning algorithms for constructing intelligent and efficient
applications [48]. Machine learning is an essential part of artificial intelligence and used to
improve the computer’s output based on past experiences. Several approaches are there in
machine learning but most basic and important are two which is supervised and unsupervised
which are supported by Mahout. Collaborative filtering, clustering and categorization are the
three main machine learning tasks which are currently Mahout Implements and also very much
used in real time applications [48]. Current stable version of Apache Mahout is 0.11.1 released
on November 2015 which has a new math environment called Mahout Samsara Environment
which has additional features such as support for Spark and H2O environments, Distributed
algebraic optimizer, R-like DSL Scala API and easily integrate with compatible libraries such as
MLLib [49].
5.10. Dryad
Dryad is a project governed by Microsoft Research [50], for general purpose distributed
execution engine for data parallel application. The Dryad project is examining programming
model for writing parallel and distributed programs that can scale up from very small cluster to
large cluster. An application written for Dryad is represented as a Directed Acyclic Graph which
defines the flow of data in application. The vertex of graph signifies the operations that are to be
executed on data. The Dryad runtime parallelize the data flow graph by distributing the
computational vertices across nodes in cluster. The flow of data between different computational
vertices is implemented by communication channels between the vertices and scheduling of
computational vertices upon available hardware are handled by Dryad Runtime without the
intervention of developer or administrator. But in 2011, on windows HPC team blog, Microsoft
announces, with minor update to the latest release Dryad, that this will be the final preview and
it drops its Dryad big data processing tool and focuses on the development on windows
implementation of Hadoop [51].
BIG DATA TOOLS FOR STREAM PROCESSING
5.11. Storm
Apache Storm [52] is a free and open source distributed real time computation system for
processing streams of data. It is simple and can be used with any programming model and
specifically designed for real time processing contrasts with hadoop which is for batch
processing. It is scalable, fault tolerance and reliable to provide great performance and easy to
setup and operate. Storm is fast at approx a million tuples processed per second per node. Storm
can integrates with the existing databases and queuing systems. In storm cluster jobs are runs as
topologies and cannot eventually finishes, unlike Hadoop, until you kill it. As Storm jobs are
based on topologies, to do real time computation, user needs to create different topology for
different storm task. A topology is a graph which describes how computation can be done in real
time. Storm is based on master slave architecture like Hadoop, in which one node is master and
others are worker nodes. Master node runs a process called Nimbus which is responsible for
distributing the code among nodes in cluster and assigning task to machines. Each worker node
runs a process called supervisor which listens for job assigned to its. Each node in topology
encloses processing logic and links between nodes signifies how data should be passed between
https://iaeme.com/Home/journal/IJARET
638
[email protected]
Big Data: A Comprehensive Survey
nodes. The whole computational topology is partitioned and distributed among number of worker
nodes. Another kind of process is here called Zookeeper which plays an important role and used
to coordinate between nimbus and supervisors by recording all the states of them in local disk.
5.12. Apache Kafka
Apache Kafka [53] is a distributed, high throughput, partitioned commit log service, written in
Scala which provides a high performance messaging system with unique design that was
originally developed by LinkedIn and was subsequently open sourced in early 2011. It was
developed to provide cohesive, high performance, low latency platform for handling real time
data feeds. It is a tool to manage streaming and operational data by in memory analytical
techniques and thus provides real time decision making. Kafka is publish-subscribe messaging
system which have mainly four features: messages persistence and replication on disk thus
durable, high throughput thus fast, distributed processing, parallel data load into Hadoop.
5.13. SAP Hana
SAP Hana [54] is an in-memory data analytics platform which can be deploy on on-premise or
on cloud. It provides real-time analytics such as operational reporting, data warehousing,
predictive and text analysis, for real time applications such as sense and response applications,
planning and optimization and core process accelerators. SAP Hana database is the core part of
the real time platform which is different from, other database systems. Due to its hybrid structure
for processing transactional and analytical data fully in-memory, SAPHana combines the best of
both worlds.
5.14. Apache Spark
Apache Spark [55] is an open source cluster computing framework for large dataset, originally
developed at University of California, Berkeley AMPLab and later donated to the Apache
Software Foundation. It provides an interface centred on resilient distributed dataset (RDD) for
programming entire cluster with constant data parallelism and fault tolerance. The availability of
RDD eases the implementation of both iterative algorithms and interactive data analysis. Apache
Spark run programs faster up to 100x than Hadoop MapReduce in memory. Spark supports a
mass of libraries including SQL, Spark Streaming, MLlib for machine learning, and GraphX.
5.15. Apache Flink
Apache Flink [56] is an open source distributed framework for batch and stream data processing.
The core of Flink is a Distributed streaming dataflow engine. It essentially aims to bridge the gap
between MarReduce like systems and shared nothing parallel databases so Flink executes the
arbitrary dataflow programs in a parallel and pipelined manner. Flink’s runtime supports the
iterative algorithms execution naturally. Flink programs are automatically complied and
optimized into dataflow programs which were written in Java or Scala. Flink does not provide
any storage medium to store data but data must be store in distributed storage systems such as
HDFS or HBase. Flink consumes data from message queues like Kafka for processing streaming
data.
BIG DATA TOOLS FOR INTERACTIVE ANALYSIS
5.16. Dremel
Dremel [57] is a distributed interactive querying system developed at Google in 2010. It is
scalable, interactive, ad-hoc processing system for analysis of read only nested data. It is capable
to run aggregation queries over trillion row tables in seconds. The system has thousands of users
and scales to thousands of CPUs and petabytes of data at Google.
https://iaeme.com/Home/journal/IJARET
639
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
5.17. Apache Drill
Apache Drill [58] is an open source SQL query system for Big Data exploration. It provides high
performance interactive analysis on the semi structured and rapidly evolving data. It is similar to
Google’s Dremel, and particularly designed to well utilize the nested data. It can be scaled up to
10,000 servers or more and capable to process petabytes of data in seconds. Drill can be viewed
as an open sourced version of Dremel. Low latency SQL queries, Dynamic queries on selfdescribing data, Nested data support, easy integration with Apache Hive are some of the key
features of Drill.
6. UNDERLYING TECHNOLOGIES
6.1. Cloud Computing
Cloud computing is strongly related to Big Data. The main intention of cloud computing is to
provide high computing and storage resources under condensed administration so that big data
applications can use it efficiently [3]. The maturity of cloud computing technology provides
solution such as storage and processing for Big Data. Moreover the evolution of Big Data also
speed ups the growth of cloud computing. The parallel computing and distributed storage
capability of cloud computing helps big data to storing and analyzing and efficiently manage
large amount of data. Although both technologies are closely related to each other but the main
difference between these two is that cloud computing altered the Information technology
architecture while big data effects the industry decision making. Cloud provides the three types
of three types of deployment and delivery models as follows:
6.1.1. Delivery Model
Cloud computing provides three types of delivery model by which it can provides the services to
the consumers.
Infrastructure as a Service (SaaS)
This service model of cloud computing provides computing resources such as virtual machine,
storage, networking services, servers and application programming interfaces on a pay per use
basis and let the users put their workloads on a cloud.
Platform as a Service (SaaS)
Platform as a service provides a cloud based environment with essential requisites to support the
comprehensive development and delivering web-based applications without the expenditure and
complexity of buying and administering the required hardware, software, provisioning and
hosting. In brief it provides the application development environment for developers.
Platform as a Service (SaaS)
Software as a Service, provides software applications on the remote computers, on cloud and
used by consumers by connecting to cloud through an internet link or simply a web browser.
6.1.2. Deployment Model
There are three forms by which we can set up the cloud environment for our use which are
different in size, ownership and access.
Public Cloud
Public cloud provides the publically accessible cloud environment pioneered by a third party
cloud provider. In this type of deployment, all the assets and resources are managed, administered
and maintained by a cloud provider.
https://iaeme.com/Home/journal/IJARET
640
[email protected]
Big Data: A Comprehensive Survey
Private Cloud
Private Cloud provides the cloud environment which is pioneered by a single organization.
Private cloud enables the organization to use cloud computing as a means of centrally access and
manage its IT resources. An internal staff or outsourced staff is actually managed and administer
the cloud environment. Risk and security challenge is low as compared to public cloud because
of the single body is responsible for management and maintenance.
Hybrid Cloud
Hybrid cloud model is combination of both public and private cloud model. When a cloud
consumer wants to deploy sensitive data service processing on private cloud and other clumsy
data services on public cloud then the result of this combination is a Hybrid deployment model.
Internet of Things includes the large amount of networking and sensor data embedded into
various devices and machines in the real world. The Big data collected from these devices and
machines has different in nature such as data is heterogeneous, data has diverse variety, noise
and redundancy and it is also unstructured.
6.2. Internet of Things (IoT)
The Internet of Things is the network of devices or objects entrenched with sensors and network
connectivity which enables these objects to collect and exchange data via internet. As predicted
by HP, “Although currently IoT is not the prevailing part of big data, but by 2030 the networking
and sensor data will reach one trillion and then the IoT data will be the most dominant part of big
data”. The Internet of Things let devices and objects, in physical word, to be sensed and collect
and exchange data and controlled these objects and devices remotely so that more efficiently
integration between the computer based systems and the physical world and resulting in
improved economic benefit. In IoT each device or object is identifiable by its embedded sensor
or actuator and it is able to interoperate with the existing Internet infrastructure. As of 2013, the
Internet of Things has evolved due to the concurrence of multiple technology such as embedded
system, wireless sensor network, control systems, automation, and so on. The notion of smart
devices which are able to sensed data was conferred in early 1982 with a tailored Coke machine
at Carnegie Mellon University which was the first appliance connected to the internet [59]. Big
data and Internet of Things are two faces of same coin. It is estimated that the physical objects
connected to the internet will risen from 13 billion today to 50 billion by 2020. Internet of Thing’s
connected devices generates the large amount of data so we can say that IoT’s internet connected
objects tied with Big Data.
Internet
of Things
GPS
smart car
microwave
mobile
computing
lighting/heating/refige
rator
alarm
windows
sensors
mobile
phone
PDA
Figure 3 Data generation sources in IoT
https://iaeme.com/Home/journal/IJARET
641
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
6.3. NoSQL Database
NoSQL referred to a non-relational database which provides the storage and retrieval of data that
is other than the relational data. These type of databases become popular and have been used in
early twenty first century when need is elicited by the Web 2.0 companies such as Google,
Facebook, and Amazon etc. The most dominant behavior of these databases is that it can be scale
up by just horizontally adding clusters of computer, simplicity of design and last but not least
greater control on high availability. NoSQL databases are gradually used in big data and real time
applications. It is currently known as “Not only SQL” to highlight the fact that it may support
SQL like query languages. There are an assortment of NoSQL database available such as
Keyvalue, column oriented, document orient, graph and multi-model. Examples of NoSQL
databases which are used presently in science and technology are Cassandra, CouchDB, HBase,
MarkLogic, MongoDB, BigTable, Dynamo and so on.
Keyvalue Database
Keyvalue database is the simplest type of NoSQL database from an API perspective in which
data is stored in key value pairs. These types of database stores data in a schema-less manner
that’s why considered as less complex, simple and easy to handle. In keyvalue data store all the
data is stored within indexed key and value hence the name. Since keyvalue data store uses
primary key access they usually have great performance and easily extended up or down. Some
exemplar of keyvalue NoSQL databases are Cassandra, BerkeleyDB, DyanmoDB, Redis, Riak,
CouchDB, Voldmort and so on.
Column Oriented
Column oriented database stores data in columns instead of rows. These types of databases are
designed to store data tables as a section of columns instead of rows. Column oriented database
stores data as column families as a row which have many columns associated with row key.
Column families include data set which are similar in nature or accessed together. Each column
family considered as a container of rows where a key identifies the row and row have a multiple
columns associated with it. Some examples of column oriented databases are Hbase, BigTable,
and HyperTable and so on.
6.4. Document Oriented
In document oriented databases is the notion of document type data in which data is encapsulated
and programmed in document in some standard formats and encoding. The encoding standard
includes such as XML, JSON, and BSON etc. The database stores data in form of document and
retrieves document via document key which associates with each document and uniquely
identifies that document. Document oriented database is the expansion of key value store in
which document is more complex i.e. it contains data and each document is assigned a key which
is used to retrieved the document. These databases are used to storing, retrieving and managing
document oriented information which is mostly semi-structured in nature. Examples of such
databases are MongoDB, CouchDB, MarkLogic, HyperDEX, and OrientDB etc.
6.5. Graph Databases
Graph database deliberated for data whose relations are more efficiently presented with graph.
Such type of data includes network topologies, social relations, road maps, and so on. Graph
databases are allowed you to store the entities and relation between them. Entities can be
considered as nodes which have properties and edges between them can be considered as relation
between entities. Edges have directional implications; nodes have structured by relationships
which let you to find interesting pattern. The organization of graph lets the data to be stored once
https://iaeme.com/Home/journal/IJARET
642
[email protected]
Big Data: A Comprehensive Survey
and deduced in different manners based on relationship. Examples of such databases are Neo4j,
FlockDB etc.
6
5
4
performance
3
scalability
2
flexibility
complexity
1
0
key value
column document graph
relational
oriented oriented oriented database
Figure 4 Evaluation of NoSQL Databases with each other and relational database
7. CONCLUSION
We have entered the period of data torrential rain in which data is everything and it is present
everywhere. Big data and its analysis persuade the way of thinking and business of individual.
Big data is the next revolution in the ages of information technology. While big data opens up
new opportunity to fine gain the new insights of business, it also brings several difficulties which
have to consider for achieving a new real time benefits. This paper defines the concept and
characteristics of big data and its challenges with focus on features of big data. This paper defines
the big data challenges including related technologies. There is no doubt that big data technology
is still in development, since existing big data techniques are very limited to solve the big data
problems absolutely. From hardware to software, we still need a more sophisticated storage and
I/O techniques to truly solve the big data problems.
REFERENCES
[1]
S. Madden, From Databases to Big Data, Massachusetts Institute of Technology, IEEE Internet
Computing, 2012.
[2]
O. Tene, and J. Polonetsky, Privacy in the age of Big Data: A time for Big Decisions, Symposium
Issue, 64 stan. l. rev. online 63, Feb.2012.
[3]
M. Chen, S. Mao, and Y. Liu, Big Data: A Survey. Springer Science Business Media, New York
2014.
[4]
A. Jacobs, The Pathologies of Big Data, ACM Queue, 6 Jul. 2009.
[5]
Laney, and Douglas, 3D Data Management: Controlling Data Volume, Velocity and
Variety, Gartner, 6 Feb. 2001.
[6]
S. F. Wamba, S. Akter, D. A. Edwards, G. Chopin, and D. Gnanzou, How ‘big data’ can make
big impact : Findings from a systematic review and a longitudinal case study, International
Journal of Production Economics, vol. 165, pp. 234–246, 2015.
https://iaeme.com/Home/journal/IJARET
643
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
[7]
H. Ozkosea, E. S. Ari, and C. Gencerb, Yesterday, Today and Tomorrow of Big Data, Procedia
- Social and Behavioural Sciences, World Conference on Technology, Innovation and
Entrepreneurship, vol. 195, pp. 1042 – 105, 2015.
[8]
Gil Press, How Data Became Big, Smart Data Collective, Jun. 2012.
[9]
B. A. Maroon, and P. A. D. de Maine, Automatic data Compression, Communications of the
ACM, vol. 10, no. 11, pp. 711-715, Nov. 1967.
[10]
A. R. Miller, and C. R. Ashman, The Assault on Privacy, 20 DePaul L. Rev. 1062, 1971.
[11]
T. Akiyama, The continued growth of text information: from an analysis of information flow
censuses taken during the past twenty years, Keio Communication Review, no. 25, 2003.
[12]
I. Dienes, and A. Meta, Study of 26 “How Much Information” Studies, Sine Qua Nons and
Solutions, Hungarian Central Statistical Office (HCSO) 1979–1997, International Journal of
Communication, vol. 6, pp. 874–906, 2012.
[13]
I. de. S. Pool, Tracking the Flow of Information, Science 12 Aug 1983, vol. 221, no. 4611, pp.
609-613, DOI: 10.1126/science.221.4611.609.
[14]
R. J. T. Morris, The evolution of storage systems, IBM Systems Journal, vol. 42, no. 2, pp. 205217, ISSN: 0018-8670.
[15]
M.l Cox and D. Ellsworth, Application-Controlled Demand Paging for Out-of-Core
Visualization, Report NAS-97-010, Jul. 1997.
[16]
K. Coffman, and A. Odlyzko, The work of the encyclopedia in the age of electronic reproduction,
vol. 3, no. 10 – 5, Oct. 1998.
[17]
D. Kenwright, Automation or interaction: what’s best for big data? NASA Ames Research Center,
Visualization '99 Proceedings, 24-29, pp. 491 – 495, Oct. 1999, ISSN: 1070-2385.
[18]
IDC, The expanding digital universe, A forecast of worldwide information growth through 2010,
An IDC White paper, Mar. 2007. (http://www.emc.com/collateral/analystreports/expandingdigital-idc-white-paper.pdf)
[19]
B. Swanson, and G. Gilder, Estimating the Exaflood. The Impact of Video and Rich Media on
the Internet, A “zettabyte” by 2015?, Discovery Institute Seattle, Washington, Jan. 2008.
[20]
M. Friedman, M. Girola, M. Lewis, and A. M. Tarenzio, IBM Data Centre Networking Planning
for Virtualization and Cloud Computing, IBM Redbooks.
[21]
M. Ferguson, Architecting a big data platform for analytics, Intelligent Business Strategies,
whitepaper prepared for IBM, Oct. 2012.
[22]
J. Dittrich, Efficient big data processing in Hadoop MapReduce, Information Systems Group
Saarland University (http://infosys.cs.uni-saarland.de), Proceedings of the VLDB Endowment,
vol. 5, no.12, 2014.
[23]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, Starfish: A Self-tuning
System for Big Data Analytics, Department of Computer Science Duke University, CIDR, vol.
11, 2011.
[24]
Y. He, RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based
Warehouse Systems, ICDE IEEE Conference, 2011.
https://iaeme.com/Home/journal/IJARET
644
[email protected]
Big Data: A Comprehensive Survey
[25]
H. Zou, Y. Yu, W. Tang, and H.-W. M. Chen, Flex Analytics: A Flexible Data Analytics
Framework for Big Data Applications with I/O performance Improvement, Special issue on
Scalable Computing for Big Data Elsevier, Big Data Research, vol. 1, pp. 4-13, Aug. 2014.
[26]
M. N. Rahman, and A. Esmailpour, A Hybrid Data Center Architecture for Big Data, Big data
research, Elsevier, Feb. 2016.
[27]
D. Agrawal, Big data and cloud computing: current state and future opportunities, EDBT 2011,
Uppsala, Sweden, ACM 978-1-4503-0528-0/11/0003, Mar. 22–24, 2011.
[28]
E. B. Lakew, Managing resource usage and allocation in multi-cluster cloud, Umea University,
Licentiate Thesis May 2013, Published in IEEE Computer Society, 2012. ISSN 0348-0542.
[29]
J. Dai, Data-flow based performance analysis for big data cloud, Intel Asia-Pacific research and
development Ltd, 2011.
[30]
S. Lu, A framework for cloud-based large-scale data analytics and visualization; a case study on
climate data, IEEE Cloud Computing Technology and Science (CloudCom), Nov. 29, 2011 – Dec,
1, 2011, pp. 618-622, 2011.
[31]
R. T. Kaushik, A data centric cooling energy costs reduction approach for big data analytics cloud,
SC12, Nov. 10-16, 2012, Salt Lake City, Utah, USA 978-1-4673-0806-9/12 2012 IEEE, 2012.
[32]
Y. Zhao, R. N. Calheiros, G. Gange, K. Ramamohanarao, and R. Buyya, SLA-Based Resource
Scheduling for Big Data Analytics as a Service in Cloud Computing Environments, Parallel
Processing, 2015 44th International Conference on, IEEE, ISSN: 0190-3918, pp. 510-519, 2015
[33]
C. Wu, R. Buyya, and K. Ramamohanarao, Big Data Analytics = Machine Learning + Cloud
Computing, arXiv preprint arXiv: 1601.03115, arxiv.org, 2016.
[34]
W. Gao, Big Data Bench: a Big Data Benchmark Suite from Web Search Engines, ArXiv preprint
arXiv: 1307.0320, 2013.
[35]
J. K. Laurila, The Mobile Data Challenge: Big Data for Mobile Computing Research, Nokia
Research Centre Lausanne, Switzerland, Workshop on the Nokia Mobile Data Challenge in
coincidence with the 10th International Conference on Pervasive Computing, 2012.
[36]
L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng,
G. Lu, K. Zhan, X. Li, and B. Qiu, Big Data Bench: a Big Data Benchmark Suite from Internet
Services, ArXiv: 1401.1406v2, 22 Feb. 2014.
[37]
R. Han, Z. Jia, W. Gao, X. Tian, and L. Wang, Benchmarking Big Data Systems: State-of-theArt and Future Directions, ArXiv: 1506.01494v1 4, Jun. 2015.
[38]
S. Shekhar, Spatial Big-Data Challenges Intersecting Mobility and Cloud Computing, ACM
International Workshop on Data Engineering for Wireless and Mobile Access, ACM, 2012.
[39]
G. Chittaranjan, Mining large-scale Smartphone data for personality studies Personal and
Ubiquitous Computing, vol.17, no. 3, pp. 433-450, 2013.
[40]
A. Zaslavsky, Sensing as a Service and Big Data, ArXiv preprint arXiv: 1301.0159, 2013.
[41]
M. A. Alsheikh, D. Niyato, S. Lin, H.-P. Tan, and Z. Han, Mobile Big Data Analytics Using Deep
Learning and Apache Spark, ArXiv: 1602.07031v1, arXiv.org, Feb. 2016.
[42]
Hortonworks, Hortonworks DataFlow. Hortonworks.com/hdf/.
https://iaeme.com/Home/journal/IJARET
645
[email protected]
Wasim Haidar SK, Surendra Pal Singh, Prashant Johri
[43]
The Economist, Data, data everywhere, 25 Feb. 2010.
[44]
W. Phil, Supercomputing the Climate: NASA's Big Data Mission, CSC World, Computer
Sciences Corporation, 2013.
[45]
L. Andrew, The real story of how big data analytics helped Obama win, InfoWorld, 31 May 2014.
[46]
C. L. P. Chen, and C.-Y. Zhang, Data intensive applications, challenges, techniques and
technologies: A survey on Big Data, Information Sciences, vol. 275, pp. 314-347, 2014.
[47]
I. Arel, D. C. Rose, and T. P. Karnowski, Deep machine learning- a new frontier in artificial
intelligence research, IEEE Computer Intelligence Magazine, vol. 5, no. 4, pp. 13-18, 2010.
[48]
G. Ingersoll, Introducing Apache Mahout, Member, Technical Staff Lucid Imagination IBM
Corporation, Sep. 2009.
[49]
D. Lyubimov and A. Palumbo, Apache Mahout: Beyond MapReduce. February 2016.
[50]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed Data-Parallel Programs
from Sequential Building Blocks, Microsoft Research, Silicon Valley, 2007.
[51]
M. J. Foley, Microsoft drops Dryad; puts its big-data bets on Hadoop, All About Microsoft, Nov.
16, 2011.
[52]
Apache Storm. www.storm.apache.org/index.htm (Retrieved on Jun. 20, 2020)
[53]
A. Auradkar, C. Botev, S. Das, D. DeMaagd, A. Feinberg, P. Ganti, B. G. L. Gao, K.
Gopalakrishna, B. Harris, J. Koshy, K. Krawez, J. Kreps, S. Lu, S. Nagaraj, N. Narkhede, S.
Pachev, I. Perisic, L. Qiao, T. Quiggle, J. Rao, B. Schulman, A. Sebastian, O. Seeliger, A.
Silberstein, B. Shkolnik, C. Soman, R. Sumbaly, K. Surlaker, S. Topiwala, C. Tran, B.
Varadarajan, J. Westerman, Z. White, D. Zhang, and J. Zhang, Data infrastructure at LinkedIn,
2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 1370–1381, 2012.
[54]
SAP community Network, What is sap hana?, saphana.com, Scn.sap.com/docs/DOC-60338, Sep.
2012.
[55]
M. Zaharia, M. Chowdhury, J. M. Franklin, S. Shenker, and I. Stoica, Spark: Cluster Computing
with Working Sets, USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
[56]
Apache Flink, Apache Flink: Scalable Batch and Stream Data Processing, flink.apache.org.
[57]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis,
Dremel: Interactive Analysis of Web-Scale Datasets, Google, Inc. Proceedings of the 36th
International Conference on Very large Data Bases, pp. 330-339, 2010.
[58]
M. Hausenblas and J. Nadeau, Apache Drill: Interactive Ad-Hoc Analysis at Scale, MapR
Technologies. Mary Ann Liebert, Inc., vol. 1, no. 2, Jun. 2013.
[59]
Carnegie Mellon University, The "Only" Coke Machine on the Internet, 10 Nov. 2014.
https://iaeme.com/Home/journal/IJARET
646
[email protected]