BIG DATA: A COMPREHENSIVE SURVEY

IAEME Publication

BIG DATA: A COMPREHENSIVE SURVEY

IAEME Publication

2020, IAEME PUBLICATION

https://doi.org/10.34218/IJARET.11.5.2020.065

visibility

…

description

23 pages

link

1 file

During the recreation of monetary competence, data is entirety and everything is data. However data is dependent upon the world that is chaotic, insane, unpredictable, and sentimental. The acceleration of so called big data and the evolution of tools and techniques with the intention of are able to enumerate our every move, desire and practice, have exposed where the conflict lies between the unstable actualities that we reside in and the need to seizure it in data. From the ancient anatomy of calligraphy to extant data centres, the human chase has always collected knowledge and facts. The advancement in tools and technology has led to the flood of data which desire more mature data ordnance system. This excessive growth of data makes great dilemma to human beings. Although there are much potential and extremely useful value, masked in the large amount of data. Big data is significantly helpful to increase the productivity and efficiency in business and revolutionary discovery in scientific disciplines and great opportunities in other fields also. While big data also derives many difficulties and challenges which is the other side of coin. This paper intended to reveal a close view about big data including its brief history, big data applications, big data opportunities and challenges, current tools and techniques to deal with big data problems. We also converse numerous underlying technologies related to big data such as cloud computing, internet of things and NoSQL databases.

International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 11, Issue 5, May 2020, pp.624-646, Article ID: IJARET_11_05_066 Available online at https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=5 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: 10.34218/IJARET.11.5.2020.065 © IAEME Publication Scopus Indexed BIG DATA: A COMPREHENSIVE SURVEY Wasim Haidar SK1, Surendra Pal Singh2, Prashant Johri3 1 Department of Computer Science & IT, NIMS University, Rajasthan, India 2 Department of Computer Science & IT, NIMS University, Rajasthan, India 3 Department of Computer Science and Engineering, Galgotias University, Uttar Pradesh, India ABSTRACT During the recreation of monetary competence, data is entirety and everything is data. However data is dependent upon the world that is chaotic, insane, unpredictable, and sentimental. The acceleration of so called big data and the evolution of tools and techniques with the intention of are able to enumerate our every move, desire and practice, have exposed where the conflict lies between the unstable actualities that we reside in and the need to seizure it in data. From the ancient anatomy of calligraphy to extant data centres, the human chase has always collected knowledge and facts. The advancement in tools and technology has led to the flood of data which desire more mature data ordnance system. This excessive growth of data makes great dilemma to human beings. Although there are much potential and extremely useful value, masked in the large amount of data. Big data is significantly helpful to increase the productivity and efficiency in business and revolutionary discovery in scientific disciplines and great opportunities in other fields also. While big data also derives many difficulties and challenges which is the other side of coin. This paper intended to reveal a close view about big data including its brief history, big data applications, big data opportunities and challenges, current tools and techniques to deal with big data problems. We also converse numerous underlying technologies related to big data such as cloud computing, internet of things and NoSQL databases. Key words: Big Data, Hadoop, MapReduce, NoSQL Cite this Article: Wasim Haidar SK, Surendra Pal Singh, Prashant Johri, Big Data: A Comprehensive Survey, International Journal of Advanced Research in Engineering and Technology, 11(5), 2020, pp.624-646. https://iaeme.com/Home/issue/IJARET?Volume=11&Issue=5 1. INTRODUCTION In 2010, Gartner and company termed, “Information will be the oil of 21st century”. And it is so true here that data is everything and everything is data in today world. As we live in the digital world, data has increased rapidly in every field. According to a report of International Data Corporation (IDC) in 2011 the overall created and copied data volume in the world was 1.8ZB which increased by nearly nine times within five years [1]. Under the explosive growth of data https://iaeme.com/Home/journal/IJARET 624 [email protected] Big Data: A Comprehensive Survey in digital global world big data is used to describe the massive amount of datasets. Compared with traditional databases big data also include unstructured data that needs more real-time analysis. Big data refers to the data that previously not viable to store, Supervise and analyze [2]. Moreover big data gives new insights to business or organization by changing the way of producing the new values from data and also help to understand the indepth knowledge of hidden values from large amount of data and also bring new challenges such as how to store and manage such large amount of data. With regard to size of data, mercantile relational systems in fact perform pretty well, most analytics merchants such as Greenplum, Netezza, Teradata and so on report being capable to handle multi-petabyte datasets [3]. It’s on the too fast and too hard foreparts where databases systems don’t fit. Databases can’t handle streaming data. What about provisions such as Hadoop or MapReduce?? Like DBMS they can scale to the extent by which they can easily handle large amount of data but with more physical machines. But they are limited too as they provide a low level infrastructure to process data but could not manage it. To support data management capabilities developers can construct data administration tools on the top of these promises although a lot of being built such as Hive, HBase etc which in essence seems to be remodelling DBMSs rather than figure-out new problems and challenges that are at the essence of big data. Moreover these platforms provide deprived support for the too fast problem because they process large blocks of data which makes it difficult to attain high response time. Data sets are raising more swiftly because of the ever increasing cheap and effective tools and technologies such as mobile devices, cameras, microphone, remote sensing devices, and radio frequency identification (RFID), wireless sensor networks, internet of things and so on. Traditional databases such as a relational database and data management and visualization tool also cannot handle big data and often found difficult to handle such large amount of data with existing tools and technology. As an alternative it is required that a new tool which is capable of running parallel software on tens, hundreds, or even thousands of servers which is effectively manage and process this large complex data [4]. Big data usually includes data sets with too high volume and complex in nature which is beyond the ability of existing software and hardware tools to store, manage, and process within reasonable elapsed time. In 2001 in a research report of META group analysts, Doug Laney delineated that increscent amount of data augmented the challenges and opportunities as in three dimensional i.e. increasing volumes, speed of data in and out in terms of velocity and variety [5]. Gartner and often of the industry use this three v’s for describing the big data. But now additional vs have been added such as veracity and value in the characteristics of big data. Big data usually includes stacks of unstructured data that needs more real time analysis. All though big data enlarged new opportunities for realizing new values from large amount of data and facilitate us to achieve profound knowledge of hidden values and also incurs new challenges such as how to integrate, store, manage, and process such large amount of data. Now a day’s big data related companies which manage and process such large amount of data, grows swiftly. Examples of such companies are Google which process hundred petabytes of data, Facebook which generates more than 10 petabyte of data per month only, YouTube which generates 72 hour of video data every minute and so on. The swift growth of new technologies such as Cloud Computing and Internet of Things further encourages the quick growth of data. Such data in terms of quantity and complexity will go beyond the capacities of existing IT infrastructure and architecture and also its real time requirement will significantly hassles the current computing competence. The increasingly emergent data cause a problem of how to capture and manage such heterogeneous data. In manifestation of volume, variety, complexity, velocity, veracity and security of big data, we shall efficiently mine such datasets at different levels during the processing, managing, visualization and forecasting so as effectively disclose its essential hidden values and properties and improve the decision making. https://iaeme.com/Home/journal/IJARET 625 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri 1.1. Big Data: A Current and Future Extremity There are many definitions of big data in the extant literature but in common terms Big Data is nothing but when data cannot be handled by a single machine it is termed as Big Data. With the swift advancement in tools and technology Big Data is quickly growing in all trades and science. Big data has other important characteristics other than volume, which distinguish it from huge data or very large amount of data. At present when the significance of big data has been already realized by all trades and industries, people still have diverse belief on its definition. In general, big data is the amount of data that could not be feasibly stored, managed, and processed by existing tools. There are different definition of big data because of different anxiety of different people such as research scholar, scientific and technological enterprises, data analysts and technical practitioners. The subsequent definition helps us to better understand the economic, social, and technological nuance of big data. Apache Hadoop, in 2010 defined big data as “datasets which could not be stored, supervised and processed by existing general tools and technologies within a tolerable span”. McKisinsey & company in 2011, announced Big Data as the forthcoming frontier for advancement, rivalry and productivity. Big data refers to the amount of data that is too big in context of volume, too fast in context of velocity and too hard in context of variety. The extant science defines Big Data as the succeeding immense thing in innovation [6]. Big data is the quantity of data that is prior not practicable to store, process and manage by existing tools. Big data technologies are new generation technologies and architectures which were intended to extort value from multivariate elevated volume data sets efficiently by providing immense speed acquisition, devising and analyzing [7]. Big data has been defined in early 2001. An analyst at META, Doug Laney, in a research report, depicted the challenges and opportunity brought by growing amount of data with a 3V model such as high Volume, Velocity and Variety [5]. Although such a model was not initially used to defined big data. Many enterprises such as IBM, Microsoft, and Gartner and so on still used this 3V model to defined big data in the subsequent ten years. In the 3V model of big data volume defines the production and gathering of high quantity of data, variety defines the diversity of data types such as traditional structured data, semi-structured data and unstructured data, and velocity defines the appropriateness of data such as data capturing and analysis must be swiftly and timely conducted so as to properly and efficiently utilizes the mercantile value of big data [3]. In 2011 an IDC report describes big data as “it is a new generation technologies which are designed to extract value from large amount and variety of data, by enabling the high speed capturing, discovery and analysis”. With this definition of big data, big data can be characterized by four Vs such as Volume, Velocity, Variety and Value. This characterization was widely recognized since it brings lights upon the meaning and inevitability of big data such as discovering the immense hidden values of data. This definition also discovers the most vital dilemma of big data i.e. how to exposed values from such large and diverse datasets. Additional V such as Veracity is also added in the definition of big data which defines the trustworthiness of data. These five Vs are properly defined as follows: velocity variety (producteion rate of data) (to turn out data into benefits) (diversity of data) volume (data in rest) value big data veracity (trustworthin ess of data) Figure 1 5v’s of Big Data https://iaeme.com/Home/journal/IJARET 626 [email protected] Big Data: A Comprehensive Survey Volume Volume described the size forepart of big data which represent the amount of data generated. Volume signifies the data in rest. Data volume describes the masses of data. Velocity Velocity described the speed of incoming data as well as to stream fast moving data into bulk storage for later batch processing. While production rate is conspicuously high so data should also be analyzed more promptly. The rapid generation of data signifies the velocity of data. Variety: Variety described the diversity in data types as the data comes from the different sources. Data can be of three types: Structured, Semi-Structured, And Unstructured. Structured; All the data which can be stored in relational database in table with rows and columns. Semi-Structured; Data that cannot stored in relational database and have some executive properties that make it simpler to analyze such as email, social media feeds, and web logs and so on. Unstructured; It includes text and multimedia data, sensor data, photographs and images, videos and audio files etc. Although these types of files may have an internal structure, they are still regarded as unstructured because the data they contain doesn’t fit methodically in a database. Value Value described our capability to turn our data into value. How to discover the great hidden value from data so as to efficiently use the economic as well as commercial benefit of big data. Having such large amount of data but cannot use it effectively for some meaning make it useless. So value is a most important V of big data. Veracity Veracity refers to the credibility, quality and accuracy of data. How correct our data is? With new techniques and technological advancements we can now work with the data which is less reliable or accurate such as twitter post with hash tags, abbreviations, speech etc. 2. BIG DATA: A BRIEF HISTORY The journey of big data starts many years before the present buzzword of “Big Data”. It’s already a seventy years ago when we first confront with the high growth rate in volume part of data or what has traditionally been known as the information explosion firstly a term used in 1941. The realization of big data was firstly recognized in 1944, by Fremont Rider, Wesleyan University Librarian, who predicted that American University libraries were doubling in volume every sixteen years [8]. In context with this increase rate Rider hypothesize that the Yale library would extent over 200,000,000 volumes in 2040. In 1949 Claude Shannon known as the “father of information” accomplished a research on Big Storage Capacity on data items such as punch cards and photographic data in which one of the largest data items was the Library of Congress, measuring over 100 trillion bits of data. In 1961, Derek Price concluded in his research on Scientific Knowledge that the numerator of novel publications has grown exponentially quite linearly, doubling in each fifteen years and growing during every half century by a factor of ten. In November 1967 a paper titled as “Automatic data compression” was published in the Communications of the ACM by B.A Marron and P.A.D de Maine stating that “the information explosion renowned in early years makes it crucial that storage requirements for all datasets be reserved to a minimal” [9]. The paper depicts the automatic compression technique which can be used to decrease the storage requirement and increase the transmission rate of information. In 1971 Arthur R. Miller writes the book titled as “The Assault on Privacy” and attained that “The https://iaeme.com/Home/journal/IJARET 627 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri extensive number of information system managers seem to measure a man by the number of bits of storage capacity his database will captured” [10]. The Ministry of Post and Telecommunications in Japan began regulating the Information Flow Census, capturing the quantity of information dissipating in Japan in 1975 [11]. The census introduces “amount of words” as the consolidated unit of quantity across all media. The census of 1975 already discover that information supply is escalating more rapidly than information consumption and in 1978 it releases a report that “the demand for information administer by mass media, which are regarded as one-way communication, has become dormant, and the demand for information administer by personal telecommunications media, which are regarded as two-way communications, has significantly increased. 1981 The Hungarian Central Statistics Office began a research project which involved accounting for the information industries of the country, including measuring information volume in bits. In 1993, Istvan Dienes, chief scientist of the Hungarian Central Statistics Office, accumulates a manual for a standard system of national information accounts [12]. In 1983 author Ithiel de Sola Pool publishes a “Tracking the Flow of Information” [13] in which he articulates by reviewed at the growth trends in 17 major communications media from 1960 to 1977 , he certified that the flood of information had exponentially matured by 2.9% throughout that period due to broadcasting and media. In July 1986 Hal B. Becker publishes a paper and refrains that “Semiconductor RAM (Random Access Memory) should be storing 1.25*10^11 bytes per cubic inch, by the year 2000.” In 1996 according to R. Morris and B.J. Truskowski, Digital storage becomes ever more cost-effective and worthwhile for storing data than paper [14]. In October 1997 Michael Cox and David Ellsworth publicized a paper in the Proceedings of the IEEE 8th conference on Visualization, labelled as “Application-controlled demand paging for out-of-core visualization” [15]. They concluded in his article that “Visualization brings a tempting challenge for computer systems since data sets are high in volume, tiring the competence of main memory, local disk, and even remote disk. They call this the crisis of big data. It is the first article in the ACM digital library which uses the term “big data.” In April 1998 a meeting USENIX was held by John R. Masey, Chief Scientist at SGI, and presents a paper titled “Big Data and the Next Wave of Infrastress”. In October 1998 K.G. Coffman and Andrew Odlyzko concluded in his paper that the growth rate of traffic on public internet is much higher than that of other networks [16]. Later Odlyzko entrenched the Minnesota Internet Traffic Studies (MINTS), for capturing the expansion in Internet traffic from 2002 to 2009. In October 1999 a panel of members named Bryson, Kenwright and Haimes join, Robert van Liere, David Banks and Sam Uselton, titled “Automation or interaction: what’s best for big data?” at the IEEE 1999 conference on Visualization [17]. Peter Lyman and Hal R. Varian in October 2000 publish “How Much Information?” which is the first inclusive study to measure, in terms of computer storage, the total amount of new and unique information created in the world annually. The study discovered that in 1999, the world produced around 1.5 Exabyte of exclusive information. In February 2001 Doug Laney, an analyst with the Meta Group, concluded a research note titled “3D Data Management” in which he clarifies how to supervised and control Data Volume, Velocity, and Variety. A decade later, the “3Vs” have turn out to be the wellaccepted three dimensions of big data. In September 2005 an article was published by Tim O’Reilly titled as “What is Web 2.0?” in which he claims that “Data is new Intel Inside and SQL is the new HTML”. In March 2007 IDC (International Data Corporation) published a white paper in which it is the first study to estimate and predict the amount of digital data produced and replicated each year. IDC forecasts that in between 2006 to 2010 the information supplemented yearly to the digital world will doubling every 18 months [18]. Bret Swanson and George Gilder in January 2008, publicize “Estimating the Exaflood”[19] in which they foretell that U.S. IP traffic could attain one zettabyte by 2015 and the U.S. Internet in 2015 will be at least 50 times larger than it was in 2006. In December 2009 the researchers Roger E. Bohn and James E. Short https://iaeme.com/Home/journal/IJARET 628 [email protected] Big Data: A Comprehensive Survey discerns in his study that “Americans engrossed information for about 1.3 trillion hours, an average of about 12 hours a day”. In February 2011 Martin Hilbert and Priscila Lopez estimate that in between 1986 and 2007, world’s information storage competence exceed at a compound annual enlargement rate of 25% per year. They also estimated that, 99.2% of all storage capacity was analog in 1986, but in 2007, 94% of storage capacity were digital, a complete reversal of roles. From 2013 and present Businesses are commencing to put into practice new in-memory technology such as SAP HANA to examine and optimize large volume of data. Industries are becoming ever more dependent on exploiting data as a commerce aid to realize economic advantages, with Big Data leading the charge as perhaps the most vital new technology to value and make use of in order to stay important in today’s swiftly altering market. Table 1 Evolution of Big Data Year 1941 1944 1949 1961 1967 1971 1975 1978 1981 1983 1986 1996 Description First time confront with high growth rate in volume of data known as Information Explosion. Realization of big data for the first time by Fremont Rider, who predicted that American University libraries were doubling in volume every sixteen years. Research on Big Storage Capacity by Claude Shannon , on data items such as punch cards and photographic data in which one of the largest data items was the Library of Congress, measuring over 100 trillion bits of data Derek Price concluded in his research on Scientific Knowledge that the numerator of novel publications has grown exponentially quite linearly, doubling in each fifteen years and growing during every half century by a factor of ten. A paper titled as “Automatic data compression” was published in the Communications of the ACM by B.A Marron and P.A.D de Maine stating that “the information explosion renowned in early years makes it crucial that storage requirements for all datasets be reserved to a minimal” Arthur R. Miller writes the book titled as “The Assault on Privacy” and attained that “The extensive number of information system managers seem to measure a man by the number of bits of storage capacity his database will captured” The Ministry of Post and Telecommunications in Japan began regulating the Information Flow Census, capturing the quantity of information dissipating in Japan Releases a report that “the demand for information administer by mass media, which are regarded as one-way communication, has become dormant, and the demand for information administer by personal telecommunications media, which are regarded as two-way communications, has significantly increased. The Hungarian Central Statistics Office began a research project which involved accounting for the information industries of the country, including measuring information volume in bits. Ithiel de Sola Pool publishes a “Tracking the Flow of Information” in which he articulates by reviewed at the growth trends in 17 major communications media from 1960 to 1977 , he certified that the flood of information had exponentially matured by 2.9% throughout that period due to broadcasting and media. Hal B. Becker publishes a paper and refrains that “Semiconductor RAM (Random Access Memory) should be storing 1.25*10^11 bytes per cubic inch, by the year 2000.” R. Morris and B.J. Truskowski, Digital storage becomes ever more costeffective and worthwhile for storing data than paper (The Evolution of Storage Systems, IBM Systems Publications, July 1, 2003). https://iaeme.com/Home/journal/IJARET 629 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri 1997 1998 (Apr.) 1998 (Oct) 1999 2000 2001 2005 2007 2008 2009 2011 2013 & Present Michael Cox and David Ellsworth publicized a paper in the Proceedings of the IEEE 8th conference on Visualization, labelled as “Applicationcontrolled demand paging for out-of-core visualization” in which they concluded that “Visualization brings a tempting challenge for computer systems since data sets are high in volume, tiring the competence of main memory, local disk, and even remote disk. They call this the crisis of big data. It is the first article in the ACM digital library which uses the term “big data.” A meeting USENIX was held by John R. Masey, Chief Scientist at SGI, and presents a paper titled “Big Data and the Next Wave of Infrastress”. K.G. Coffman and Andrew Odlyzko concluded in his paper that the growth rate of traffic on public internet is much higher than that of other networks A panel of members named Bryson, Kenwright and Haimes join, Robert van Liere, David Banks and Sam Uselton, titled “Automation or interaction: what’s best for big data?” at the IEEE 1999 conference on Visualization. Peter Lyman and Hal R. Varian publish “How Much Information?” which is the first inclusive study to measure, in terms of computer storage, the total amount of new and unique information created in the world annually. The study discovered that in 1999, the world produced around 1.5 Exabyte of exclusive information. Doug Laney, an analyst with the Meta Group, concluded a research note titled “3D Data Management” in which he clarifies how to supervised and control Data Volume, Velocity, and Variety. A decade later, the “3Vs” have turn out to be the well-accepted three dimensions of big data. An article was published by Tim O’Reilly titled as “What is Web 2.0?” in which he claims that “Data is new Intel Inside and SQL is the new HTML”. IDC (International Data Corporation) published a white paper in which it is the first study to estimate and predict the amount of digital data produced and replicated each year. IDC forecasts that in between 2006 to 2010 the information supplemented yearly to the digital world will doubling every 18 months. Bret Swanson and George Gilder publicize “Estimating the Exaflood” in which they foretell that U.S. IP traffic could attain one zettabyte by 2015 and the U.S. Internet in 2015 will be at least 50 times larger than it was in 2006. The researcher Roger E. Bohn and James E. Short discerns in his study that “Americans engrossed information for about 1.3 trillion hours, an average of about 12 hours a day”. Martin Hilbert and Priscila Lopez estimate that in between 1986 and 2007, world’s information storage competence exceed at a compound annual enlargement rate of 25% per year. They also estimated that, 99.2% of all storage capacity was analog in 1986, but in 2007, 94% of storage capacity were digital, a complete reversal of roles. Businesses are commencing to put into practice new in-memory technology such as SAP HANA to examine and optimize large volume of data. Industries are becoming ever more dependent on exploiting data as a commerce aid to realize economical advantages, with Big Data leading the charge as perhaps the most vital new technology to value and make use of in order to stay important in today’s swiftly altering market. 3. CONTRIBUTION IN THE AREA OF BIG DATA Data swiftness has been one of the big bases which are responsible for development of big data technology as processing and operational speed of legacy databases and data warehouses have demonstrated too slow and cannot feed many business needs. Data swiftness will become even https://iaeme.com/Home/journal/IJARET 630 [email protected] Big Data: A Comprehensive Survey more important as organization change their focus from just storing and managing data to actively use it. This segment presents the recent developments and progress in big data era. We subdivide this segment in five sections such as Classical big data technology, big data in cloud computing, data engineering and benchmarking approach, mobile big data technology and real time decision making with big data, which covers the timeline between 2011 and 2016. 3.1. Clinical Big Data Technology In classical big data research, challenges and opportunities of storage i.e. databases in existence of big data revealed [1]. Virtualization planning and cloud computing methods in IBM data centre are also introduced [20]. From platform architect perspective, Ferguson accounts their advancement for stepping up big data analysis [21]. As a current effort, Dittrich begins contributions on optimizing big data processing competence in Hadoop and MapReduce [22]. Herodotos Herodotou, a researcher from Duke University, proposed a self-tuning system, Starfish, for big data analytics [23]. Yongqiang He, a member of Facebook infrastructure team, proposed an RCFile as a fast and space-efficient data placement structure in MapReduce based warehouse [24]. H Zou, a researcher, designed a framework for flexible data analytics and also improves the I/O performance [25]. N. Rahman, a researcher from University of new Haven, US, build a architecture of a hybrid data center for big data which reduces the delay approximately 39% and also decreases the cost of cooling systems from 49.57% to 27% [26]. 3.2. Big Data in Cloud Computing A notable progress of big data has also been report proclaimed in era of cloud computing. Agrawal disclosed existing states and future opportunities for cloud computing and big data [27]. In 2013 Lakew in his thesis introduced the Resource management and allocation techniques in multi-cluster cloud [28]. Jinquan Dai in Intel Asia-Pacific development and research ltd presented a method HiTune an analyzer for Hadoop, which does the data-flow based performance analysis for big data cloud [29]. Sifei Lu in his research presented a framework for cloud based large scale data analytics and visualization; case studies on climate data of different scales were presented also [30]. A data centric cooling energy costs reduction approach for cloud based big data analytics were presented by T. Kaushik in 2012 [31]. Y. Zhao in his paper gives the idea of Aaas (Analytics as a Service) on cloud computing and focuses on scheduling of cloud resources for Big Data Analytics Applications (BDAA) to satisfy the Quality of Service [32]. R. Buyya in his paper shows that the most cost effective way of performing Big Data Analytics is to employ machine learning on cloud computing [33]. 3.3. Data Engineering and Benchmarking Approaches Excess with methodologies, there have been minute appealing data engineering and benchmarking achievement accounted for big data. Wanling Gao presented a big data benchmark project based on search engine workloads analysis and its benchmark methodology and also a tool to generate scalable big data [34]. Juha K. Laurila, a researcher in Nokia Research Centre, presents a mobile data gathering challenge introduced by Nokia, which signifies an important step in the direction of mobile big data computing [35]. L. Wang presents a Big Data benchmark suit in which he covers broad application scenarios and diverse datasets and data sources, algorithms, and software stacks; and effectively measures and evaluates the big data systems and architecture [36]. R. Han in his paper presents a survey on state-of-the-art of big data benchmark in academics as well as industry and provides a foundation for building a successful benchmark [37]. He also gives the brief introduction to data generation techniques and workload implementation techniques to characterise the big data big data benchmark. https://iaeme.com/Home/journal/IJARET 631 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri 3.4. Mobile Big Data Technology Mobile computing is becoming popular with greater extent and more important counterpart of big data and traditional internet. Shekhar, a faculty in university of Minnesota, accounted a spatial big data challenges interconnecting with mobility and cloud computing [38]. In June 2011, Chittaranjan presents a research in which he investigates large scale Smartphone data for personality studies [39]. From big data application perspective, Zaslavsky introduces challenges in Internet of Things and also presented an interesting application of Sensing as a service and big data [40]. A. Alsheikh in his paper presents a general idea of deep learning in mobile big data analytics and gives a scalable framework over Apache Spark [41]. 3.5. Real Time Decision Making with Big Data Initial big data development focuses on storage. Organization focuses on to measure the swiftness of data rather than the amount of data being managed. In 2014 raw data is stored in its natural form in object based storage repositories. In 2015 organizations moves from batch to real time processing to enhance the ability to make real time decisions. It is now not only about to store large amount of data and support bigger queries and reports; the big trend in 2015 was to continuous access and processing of data and records in real time to gain constant alertness and take instant decision. For processing real time data there are many tools. Hortonworks DataFlow [42], a project by Apache NiFi, which provides a way to capture and curate data in motion. It basically for streams of information comes from Internet of Thing devices such as sensors, GPS, and other machines and even social networks. Apache Strom, Apache Flink, Apache Spark, Apache Kafka, SAPHana and so on; there are many tools for real time data processing. 4. BIG DATA OPPURTUNITIES AND CHALLENGES The severely ever-increasing data surge in every field of Big Data lead to various challenges such as data procurement, repository, administration and analysis. Prior data ordnance and reasoning systems are able to process only structured data as they are based on relational database management system. It seems that it could not be exploit large amount and diversity of big data. The research community has proposed many solutions from which one is of cloud which satisfies the need of big data as it provides the infrastructure for big data in cost effective manner with easy upgrading and downgrading. For storing the large and complex data effectively and permanently, distributed file systems and NoSQL databases are finest option. As data is large and diverse in nature there are many challenges that are of important to consider such as data representation, data redundancy, data analysis, data confidentiality, data integration and interoperability, data visualization, and data management. 4.1. In Enterprise At present, big data primarily comes from and used in enterprises. By using big data in enterprise, one can enhance their productivity and competitiveness in many facets. On marketing, data association with analysis can enhance the productivity by predicting the consumer behaviour and find new selling plans. On business planning by comparing large amount of data, enterprise can optimize their goods and service prices and increase the productivity as well as competence in competitive world. On operation enterprise can advance their operation efficiency and customer satisfaction, accurately estimate resources required and production capacity and diminish labour cost. In finance big data also improves the total revenues of Enterprise by analysing the customer sales data and predicting patterns of commodities buys frequently or together. On supply chain enterprise controls budget and improves services offered by performing inventory optimization, logistic optimization and supplier coordination. It can also help to lessen the gap between supply and demand. https://iaeme.com/Home/journal/IJARET 632 [email protected] Big Data: A Comprehensive Survey 4.2. IoT based Big Data IoT is an important source of big data. IoT includes high variety of devices and objects such as smart car, mobile phones, GPS, and so on. Logistics enterprises may have greatly experienced the IoT big data because at present almost cars and other mechanical devices and objects are equipped with sensors, GPS and wireless adapters which generates the data more swiftly. So that one can track the object position and prevent many abnormal circumstances such as engine failures etc. On smart city, data generated by smart city can be used to support for decision making on managing the water resources, electronic grids, reducing traffic jam, and improving public welfare. 4.3. Social Network Online social network service allows an individual to communicate with each other via messages by using information network. Big data of social network essentially comes from instant messages, blogs, shared space, etc. The analysis of social network data done by using computational analytical method which involves informatics, mathematics, management and sociology science, provided for understanding relations in human society. In social network data analysis, analysis can be performed in various dimensions such as content based, structure based and so on. In content based application language and text are two most vital forms of presentation. By analysing language and text, user preference, fondness, interest, demand and emotions are revealed. In structured based applications social relations, interests and hobbies of users are amassed relations among users and it is represented as clustered structure while users are represented as nodes. Application of big data from online social network may help to understand one’s behaviour and master the law of social and economic activity by providing early warning if any abnormal activity is ongoing or abnormal use electronics services and goods and real time monitoring of events and usage. 4.4. Medical and Health Care At present medical data is swiftly and increasingly growing all over the world. As data is diverse in nature and complex so it has to be efficiently stored and effectively processed, analyzed, and query. Big data has the eventual potential to handle the medical data which improves the healthcare business. By using big data of medical industry, one can predict the recovery of patients, treatment of a particular decease, and average percent of patient of a same syndrome and so on. In 2007 Microsoft launches a medical application Health Vault by which one can manage the individual medical record or information of individual or family medical devices. 4.5. Science Io In science and research, big data has a huge application. In 2000, Sloan Digital Sky Survey (SDSS), which gathers the astronomical data which was more than that all the data in the history of astronomy. It collects about 140 terabyte of data which is approx 200 GB per night [43]. Google’s DNAStack, a part of Google Genomics, amasses and organizes DNA samples of genetic data from all around the world to identify diseases and other medical terms and also allows scientists to use the sample data all around the world. The NASA Centre for Climate Simulation stores about 32 petabytes of climate data [44]. . 4.6. Government The use of big data in government is advantageous and enables to gain the profit, productivity and efficiency. In 2012, big data analysis played a great role in Barack Obama's successful reelection campaign [45]. In India, big data analysis helped NDA to win Indian General Election 2014. Although in all sectors of government big data has the applications, easy and timely https://iaeme.com/Home/journal/IJARET 633 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri capturing and analysis of data is essential for different government agencies to gain and improve mission requirement. Government organizations are starting to deploy big data tools and technologies to analyze massive amount of data to prevent fraud, abuse, and devastate. 4.7. Private Sector Big data has been used in retail, banking, real estate etc. Walmart, an American multinational retail corporation which handles more than 1 million customer transactions every hour which in turn create more than 2.5 petabytes of data. Windermere real estate, which is a largest company of real estate in Pacific Northwest, with 300 offices and 7000 agents, utilize unstipulated GPS signals from roughly 100 million drivers to help new home buyers to locate out their typical drive times to and from work, during different times of the day. FICO Falcon Fraud detection system for credit and debit card protects debits and credit card world wide as well as user’s bank account. FICO falcon handles about 2.1 billion active accounts. Private Sector Government Science Medical and healthcare Social Network IoT Based In Enterprise 0 5 10 15 20 Adoption of Big Data Figure 2 Adoption of Big Data in different areas 4.8. Data Volume The amount of data is one of the biggest challenges of big data and gives rise to the other important challenges. With big data you should be able to scale very swiftly and flexibly whenever you want. So this is not actually the volume part which creates problem but scaling as whenever the data generation rate is notably high. Scaling, problem of big data give rise to the other challenges such as how to represent the huge amount of data so that analytics gives prominent results. Data life cycle management challenge is also comes from data volume part which described that which data shall be stored and which data shall be leftover. So, big data comes with huge amount of nuisance which is to be considered significantly to take advantages of actual big data technology. 4.9. Data Diversity As data is diverse in nature because it is comes from different sources, rigid schemas is not beneficial instead we need a more flexible and efficient design to handle our data. We need our technology to flexible enough to handle our data more swiftly and effectively so that we can perform transaction from our data in real time and also perform analytics as fast as possible without regarding the type of data we have. Diversity in Data brings serious challenge to analytics of data so as to decision making of business. How to coordinate with different sources is another serious problem which degrades the performance rate. https://iaeme.com/Home/journal/IJARET 634 [email protected] Big Data: A Comprehensive Survey 4.10. Data Security It is the major and most vital challenge of big data technology. Big data contains clumsy as well as sensitive data such as credit card information, personal information, and other sensitive information. So data has to be secured against data leakage, security breaches and other security threats. Most of the NoSQL databases provide only few security mechanisms to maintain the security in the big data platform. So security against threats is one of the fundamental and critical challenges of big data. Although many frameworks has been designed in big data platform for providing security to big data. 4.11. Performance As we live in the digital world, a microsecond delays can cost you a lot, big data must progress at exceedingly high velocities no matter how much data or what type of data your database must execute. The data handling techniques of RDBMS and NoSQL database solutions put a serious nuisance on performance. 4.12. Data Management Managing big data with RDBMS is a costly, time consuming, and often useless. Most NoSQL solutions overwhelmed by complex functioning and hard configuration. So it is hard to manage high volume, variety, velocity and complex data. It is the foremost fundamental problem of big data to efficiently manage such large amount of data. 4.13. Integration As data is coming from multiple sources and used to analyze and helps in gaining the fruitful insights; however it necessitates the need of data integration at multiple levels. Assume a scenario where two organizations want to merge and have different data management systems. So in this situation integration plays a vital role. Integration includes an integration of processes, integration of databases or integration of administrative task. Data has to be integrated and optimized to ensure the smooth operation of business application and it is hard in real time scenario. 4.14. Interoperability It is the challenge developed by integration. Integration and interoperability are the two sides of coin. Interoperability is highly dependent on integration. If integration is not effectively applied, interoperability could not achieve as two different databases have to be first integrated to achieve interoperability between them. It is the major challenge of big data. 4.15. Continuous Availability When you rely on big data for your business applications, constant high availability is not sometimes sufficiently high. Your data can never go down as it is the most vital asset of your application. Although certain amount of halting is inbuilt in RDBMS and NoSQL systems. 4.16. Data Visualization The main purpose of data visualization is to represent the knowledge more intuitively and effectively by using graphical view of data. To harvesting the big data correctly and more effectively, the next big challenge is figuring out how to display the information in a way that is useful for decision making. Helping organization to find outliers, interpretation of hidden values and patterns and to understand their fast changing dataset is where visualization provides real value. Traditional data analytics and reporting tools are not sufficiently handles big data. Most vital and greatest challenge possesses in the part of variety and velocity of big data. Another https://iaeme.com/Home/journal/IJARET 635 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri challenge of visualization based data discovery tools is in the IoT based data such as mobile user data. 5. BIG DATA TOOLS AND TECHNIQUES 5.1. Big Data Techniques Big data needs techniques to process massive amount of data and at same time creates business value. Practically big data techniques are driven by précised application areas. Big data techniques include number of domains such as data mining, machine learning, social network analysis, statistics, neural network, sentiment analysis and genetic algorithm. 5.1. Data Mining It is a technique to extract valuable patterns from data. It further includes techniques such as association rule mining, cluster analysis, classification and regression. Data mining with big data is more challenging as compared to traditional data mining. For example cluster analysis, an intrinsic way of clustering with big data is to extend existing methods so that it can deal with the massive amounts of workloads. Most extension generally works with sample of big data, and sample based results are used to derive a division for overall dataset [46]. 5.2. Machine Learning Machine learning is an essential part of artificial intelligence which is intended to design algorithms that let machine to learn from observed data. The most vital feature of machine learning is to ascertain knowledge and make decisions involuntarily [46]. To handle big data efficiently, traditional machine learning methods should be scaled up; Deep machine learning has turned to a new research extremity in the area of artificial intelligence [47]. Particularly machine learning algorithm is used in areas such as to distinguish between spam and non-spam emails, learns user preferences and patterns of activity and recommend on behalf of this information, determine the probability of wining, determine the best material to engage the customer and so on. 5.3. Social Network Analysis Social network analysis is a technique which is firstly used by telecommunication industry and in present is a key technique in recent sociology. It examines social relationships with the help of network theory which consists of nodes and ties between them where nodes represents the individual in network and ties represents the relationship between individuals . Social network analysis includes social system design, human behaviour modelling, social network visualization, social network evolution analysis, and graph query and mining [46]. Today online social media has become popular thus analysis of a network consisting millions or billions of connected people is usually difficult and costly. 5.4. Statistics It is a science of collecting, organizing, and interpreting data. Statistical techniques are used to extract correlations and associations between different objects in terms of numerical description. Traditional statistical algorithms are not well suited with big data. Data driven statistical analysis which is a part of statistics, focuses on scaling and parallelization of statistical algorithms. 5.5. Data Mining It is also an important part of artificial intelligence. It is a mature technique and has a wide range of applications such as pattern recognition, speech recognition, image analysis, time series prediction, computer numerical control and so on. Neural network is used to approximate https://iaeme.com/Home/journal/IJARET 636 [email protected] Big Data: A Comprehensive Survey function which can be based on large number of inputs. Generally it consists of an interconnection network of neurons which exchange message between each other. The connection have numerical value or weight, which can be adjusted based on experience , making network robust to inputs and adapted to learning. 5.6. Sentimental Analysis It helps to extracts the subjective information in source material. It generally estimates the polarity of a sentence in a document, sentence or entity, whether the articulated review or opinion is positive or negative or neutral. In advanced concept “Beyond Polarity” sentiment classification classifies the opinion at emotional states also such as happy, angry and sad. 5.7. Genetic Algorithm It belongs to the class of Evolutionary Algorithms which produced the solutions to optimization, and search problems with the help of techniques inspired by natural evolution such as inheritance, crossover, mutation and selection thus imitates the process of natural selection. BIG DATA TOOLS FOR BATCH PROCESSING 5.8. Hadoop and MapReduce Hadoop is industry-wide accepted and used tool in big data applications. Hadoop is open source software, batch offline oriented, data and I/O intensive general purpose framework which provides a reliable, efficient and distributed computing. Hadoop provides a framework which provides distributed storage and processing of large amount of data in distributed environment across clusters of computer. It is designed to scale up from single server to thousands of machines each having a local storage and computation. Hadoop project includes four modules as follows: Hadoop Common It includes common functionalities that support other Hadoop modules. Hadoop Distributed File System (HDFS) It provides the high performance access to the high volume data. HDFS stores data in chunks across many nodes in the cluster. It Replicates data across nodes for durability purpose. It has a master slave architecture in which one node in cluster is a Namenode and others are Datanode. Namenode which is master process which runs on a single node directs the user access to files in HDFS. Datanode which is slave runs on all the nodes and responsible for block creation, replication and deletion. Hadoop YARN YARN stands for Yet Another Resource Negotiator. It is a framework that provides the job scheduling and cluster resource management. All the functionality of YARN is available in the upgraded version of Hadoop (Hadoop 2.0). Hadoop MapReduce It provides parallel processing of elevated data set. Hadoop is used in applications such as spam filtering, social recommendation, network searching etc. At current a biggest Hadoop cluster includes 4000 nodes. Number of nodes could be increased to 10,000 with new release of Hadoop 2.7.1. In November 2012, Facebook declared that their Hadoop cluster can process 100 petabytes of data, which increased by 0.5 petabyte per day. So Hadoop provides a cost effective framework and most opportune service for acquisition and processing the large amount of data. MapReduce includes two logical functions which is mapper and reducer. Hadoop handles distributing the https://iaeme.com/Home/journal/IJARET 637 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri Map and Reduce tasks across the cluster. Map function takes a data in a form of key value pair and gives the output as set of key value pair. Reduce function takes a key and a associated set of array of values and gives the output in a set of key value pair. 5.9. Apache Mahout Apache Mahout is a project governs by Apache Software Foundation to produce scalable, business friendly machine learning algorithms for constructing intelligent and efficient applications [48]. Machine learning is an essential part of artificial intelligence and used to improve the computer’s output based on past experiences. Several approaches are there in machine learning but most basic and important are two which is supervised and unsupervised which are supported by Mahout. Collaborative filtering, clustering and categorization are the three main machine learning tasks which are currently Mahout Implements and also very much used in real time applications [48]. Current stable version of Apache Mahout is 0.11.1 released on November 2015 which has a new math environment called Mahout Samsara Environment which has additional features such as support for Spark and H2O environments, Distributed algebraic optimizer, R-like DSL Scala API and easily integrate with compatible libraries such as MLLib [49]. 5.10. Dryad Dryad is a project governed by Microsoft Research [50], for general purpose distributed execution engine for data parallel application. The Dryad project is examining programming model for writing parallel and distributed programs that can scale up from very small cluster to large cluster. An application written for Dryad is represented as a Directed Acyclic Graph which defines the flow of data in application. The vertex of graph signifies the operations that are to be executed on data. The Dryad runtime parallelize the data flow graph by distributing the computational vertices across nodes in cluster. The flow of data between different computational vertices is implemented by communication channels between the vertices and scheduling of computational vertices upon available hardware are handled by Dryad Runtime without the intervention of developer or administrator. But in 2011, on windows HPC team blog, Microsoft announces, with minor update to the latest release Dryad, that this will be the final preview and it drops its Dryad big data processing tool and focuses on the development on windows implementation of Hadoop [51]. BIG DATA TOOLS FOR STREAM PROCESSING 5.11. Storm Apache Storm [52] is a free and open source distributed real time computation system for processing streams of data. It is simple and can be used with any programming model and specifically designed for real time processing contrasts with hadoop which is for batch processing. It is scalable, fault tolerance and reliable to provide great performance and easy to setup and operate. Storm is fast at approx a million tuples processed per second per node. Storm can integrates with the existing databases and queuing systems. In storm cluster jobs are runs as topologies and cannot eventually finishes, unlike Hadoop, until you kill it. As Storm jobs are based on topologies, to do real time computation, user needs to create different topology for different storm task. A topology is a graph which describes how computation can be done in real time. Storm is based on master slave architecture like Hadoop, in which one node is master and others are worker nodes. Master node runs a process called Nimbus which is responsible for distributing the code among nodes in cluster and assigning task to machines. Each worker node runs a process called supervisor which listens for job assigned to its. Each node in topology encloses processing logic and links between nodes signifies how data should be passed between https://iaeme.com/Home/journal/IJARET 638 [email protected] Big Data: A Comprehensive Survey nodes. The whole computational topology is partitioned and distributed among number of worker nodes. Another kind of process is here called Zookeeper which plays an important role and used to coordinate between nimbus and supervisors by recording all the states of them in local disk. 5.12. Apache Kafka Apache Kafka [53] is a distributed, high throughput, partitioned commit log service, written in Scala which provides a high performance messaging system with unique design that was originally developed by LinkedIn and was subsequently open sourced in early 2011. It was developed to provide cohesive, high performance, low latency platform for handling real time data feeds. It is a tool to manage streaming and operational data by in memory analytical techniques and thus provides real time decision making. Kafka is publish-subscribe messaging system which have mainly four features: messages persistence and replication on disk thus durable, high throughput thus fast, distributed processing, parallel data load into Hadoop. 5.13. SAP Hana SAP Hana [54] is an in-memory data analytics platform which can be deploy on on-premise or on cloud. It provides real-time analytics such as operational reporting, data warehousing, predictive and text analysis, for real time applications such as sense and response applications, planning and optimization and core process accelerators. SAP Hana database is the core part of the real time platform which is different from, other database systems. Due to its hybrid structure for processing transactional and analytical data fully in-memory, SAPHana combines the best of both worlds. 5.14. Apache Spark Apache Spark [55] is an open source cluster computing framework for large dataset, originally developed at University of California, Berkeley AMPLab and later donated to the Apache Software Foundation. It provides an interface centred on resilient distributed dataset (RDD) for programming entire cluster with constant data parallelism and fault tolerance. The availability of RDD eases the implementation of both iterative algorithms and interactive data analysis. Apache Spark run programs faster up to 100x than Hadoop MapReduce in memory. Spark supports a mass of libraries including SQL, Spark Streaming, MLlib for machine learning, and GraphX. 5.15. Apache Flink Apache Flink [56] is an open source distributed framework for batch and stream data processing. The core of Flink is a Distributed streaming dataflow engine. It essentially aims to bridge the gap between MarReduce like systems and shared nothing parallel databases so Flink executes the arbitrary dataflow programs in a parallel and pipelined manner. Flink’s runtime supports the iterative algorithms execution naturally. Flink programs are automatically complied and optimized into dataflow programs which were written in Java or Scala. Flink does not provide any storage medium to store data but data must be store in distributed storage systems such as HDFS or HBase. Flink consumes data from message queues like Kafka for processing streaming data. BIG DATA TOOLS FOR INTERACTIVE ANALYSIS 5.16. Dremel Dremel [57] is a distributed interactive querying system developed at Google in 2010. It is scalable, interactive, ad-hoc processing system for analysis of read only nested data. It is capable to run aggregation queries over trillion row tables in seconds. The system has thousands of users and scales to thousands of CPUs and petabytes of data at Google. https://iaeme.com/Home/journal/IJARET 639 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri 5.17. Apache Drill Apache Drill [58] is an open source SQL query system for Big Data exploration. It provides high performance interactive analysis on the semi structured and rapidly evolving data. It is similar to Google’s Dremel, and particularly designed to well utilize the nested data. It can be scaled up to 10,000 servers or more and capable to process petabytes of data in seconds. Drill can be viewed as an open sourced version of Dremel. Low latency SQL queries, Dynamic queries on selfdescribing data, Nested data support, easy integration with Apache Hive are some of the key features of Drill. 6. UNDERLYING TECHNOLOGIES 6.1. Cloud Computing Cloud computing is strongly related to Big Data. The main intention of cloud computing is to provide high computing and storage resources under condensed administration so that big data applications can use it efficiently [3]. The maturity of cloud computing technology provides solution such as storage and processing for Big Data. Moreover the evolution of Big Data also speed ups the growth of cloud computing. The parallel computing and distributed storage capability of cloud computing helps big data to storing and analyzing and efficiently manage large amount of data. Although both technologies are closely related to each other but the main difference between these two is that cloud computing altered the Information technology architecture while big data effects the industry decision making. Cloud provides the three types of three types of deployment and delivery models as follows: 6.1.1. Delivery Model Cloud computing provides three types of delivery model by which it can provides the services to the consumers. Infrastructure as a Service (SaaS) This service model of cloud computing provides computing resources such as virtual machine, storage, networking services, servers and application programming interfaces on a pay per use basis and let the users put their workloads on a cloud. Platform as a Service (SaaS) Platform as a service provides a cloud based environment with essential requisites to support the comprehensive development and delivering web-based applications without the expenditure and complexity of buying and administering the required hardware, software, provisioning and hosting. In brief it provides the application development environment for developers. Platform as a Service (SaaS) Software as a Service, provides software applications on the remote computers, on cloud and used by consumers by connecting to cloud through an internet link or simply a web browser. 6.1.2. Deployment Model There are three forms by which we can set up the cloud environment for our use which are different in size, ownership and access. Public Cloud Public cloud provides the publically accessible cloud environment pioneered by a third party cloud provider. In this type of deployment, all the assets and resources are managed, administered and maintained by a cloud provider. https://iaeme.com/Home/journal/IJARET 640 [email protected] Big Data: A Comprehensive Survey Private Cloud Private Cloud provides the cloud environment which is pioneered by a single organization. Private cloud enables the organization to use cloud computing as a means of centrally access and manage its IT resources. An internal staff or outsourced staff is actually managed and administer the cloud environment. Risk and security challenge is low as compared to public cloud because of the single body is responsible for management and maintenance. Hybrid Cloud Hybrid cloud model is combination of both public and private cloud model. When a cloud consumer wants to deploy sensitive data service processing on private cloud and other clumsy data services on public cloud then the result of this combination is a Hybrid deployment model. Internet of Things includes the large amount of networking and sensor data embedded into various devices and machines in the real world. The Big data collected from these devices and machines has different in nature such as data is heterogeneous, data has diverse variety, noise and redundancy and it is also unstructured. 6.2. Internet of Things (IoT) The Internet of Things is the network of devices or objects entrenched with sensors and network connectivity which enables these objects to collect and exchange data via internet. As predicted by HP, “Although currently IoT is not the prevailing part of big data, but by 2030 the networking and sensor data will reach one trillion and then the IoT data will be the most dominant part of big data”. The Internet of Things let devices and objects, in physical word, to be sensed and collect and exchange data and controlled these objects and devices remotely so that more efficiently integration between the computer based systems and the physical world and resulting in improved economic benefit. In IoT each device or object is identifiable by its embedded sensor or actuator and it is able to interoperate with the existing Internet infrastructure. As of 2013, the Internet of Things has evolved due to the concurrence of multiple technology such as embedded system, wireless sensor network, control systems, automation, and so on. The notion of smart devices which are able to sensed data was conferred in early 1982 with a tailored Coke machine at Carnegie Mellon University which was the first appliance connected to the internet [59]. Big data and Internet of Things are two faces of same coin. It is estimated that the physical objects connected to the internet will risen from 13 billion today to 50 billion by 2020. Internet of Thing’s connected devices generates the large amount of data so we can say that IoT’s internet connected objects tied with Big Data. Internet of Things GPS smart car microwave mobile computing lighting/heating/refige rator alarm windows sensors mobile phone PDA Figure 3 Data generation sources in IoT https://iaeme.com/Home/journal/IJARET 641 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri 6.3. NoSQL Database NoSQL referred to a non-relational database which provides the storage and retrieval of data that is other than the relational data. These type of databases become popular and have been used in early twenty first century when need is elicited by the Web 2.0 companies such as Google, Facebook, and Amazon etc. The most dominant behavior of these databases is that it can be scale up by just horizontally adding clusters of computer, simplicity of design and last but not least greater control on high availability. NoSQL databases are gradually used in big data and real time applications. It is currently known as “Not only SQL” to highlight the fact that it may support SQL like query languages. There are an assortment of NoSQL database available such as Keyvalue, column oriented, document orient, graph and multi-model. Examples of NoSQL databases which are used presently in science and technology are Cassandra, CouchDB, HBase, MarkLogic, MongoDB, BigTable, Dynamo and so on. Keyvalue Database Keyvalue database is the simplest type of NoSQL database from an API perspective in which data is stored in key value pairs. These types of database stores data in a schema-less manner that’s why considered as less complex, simple and easy to handle. In keyvalue data store all the data is stored within indexed key and value hence the name. Since keyvalue data store uses primary key access they usually have great performance and easily extended up or down. Some exemplar of keyvalue NoSQL databases are Cassandra, BerkeleyDB, DyanmoDB, Redis, Riak, CouchDB, Voldmort and so on. Column Oriented Column oriented database stores data in columns instead of rows. These types of databases are designed to store data tables as a section of columns instead of rows. Column oriented database stores data as column families as a row which have many columns associated with row key. Column families include data set which are similar in nature or accessed together. Each column family considered as a container of rows where a key identifies the row and row have a multiple columns associated with it. Some examples of column oriented databases are Hbase, BigTable, and HyperTable and so on. 6.4. Document Oriented In document oriented databases is the notion of document type data in which data is encapsulated and programmed in document in some standard formats and encoding. The encoding standard includes such as XML, JSON, and BSON etc. The database stores data in form of document and retrieves document via document key which associates with each document and uniquely identifies that document. Document oriented database is the expansion of key value store in which document is more complex i.e. it contains data and each document is assigned a key which is used to retrieved the document. These databases are used to storing, retrieving and managing document oriented information which is mostly semi-structured in nature. Examples of such databases are MongoDB, CouchDB, MarkLogic, HyperDEX, and OrientDB etc. 6.5. Graph Databases Graph database deliberated for data whose relations are more efficiently presented with graph. Such type of data includes network topologies, social relations, road maps, and so on. Graph databases are allowed you to store the entities and relation between them. Entities can be considered as nodes which have properties and edges between them can be considered as relation between entities. Edges have directional implications; nodes have structured by relationships which let you to find interesting pattern. The organization of graph lets the data to be stored once https://iaeme.com/Home/journal/IJARET 642 [email protected] Big Data: A Comprehensive Survey and deduced in different manners based on relationship. Examples of such databases are Neo4j, FlockDB etc. 6 5 4 performance 3 scalability 2 flexibility complexity 1 0 key value column document graph relational oriented oriented oriented database Figure 4 Evaluation of NoSQL Databases with each other and relational database 7. CONCLUSION We have entered the period of data torrential rain in which data is everything and it is present everywhere. Big data and its analysis persuade the way of thinking and business of individual. Big data is the next revolution in the ages of information technology. While big data opens up new opportunity to fine gain the new insights of business, it also brings several difficulties which have to consider for achieving a new real time benefits. This paper defines the concept and characteristics of big data and its challenges with focus on features of big data. This paper defines the big data challenges including related technologies. There is no doubt that big data technology is still in development, since existing big data techniques are very limited to solve the big data problems absolutely. From hardware to software, we still need a more sophisticated storage and I/O techniques to truly solve the big data problems. REFERENCES [1] S. Madden, From Databases to Big Data, Massachusetts Institute of Technology, IEEE Internet Computing, 2012. [2] O. Tene, and J. Polonetsky, Privacy in the age of Big Data: A time for Big Decisions, Symposium Issue, 64 stan. l. rev. online 63, Feb.2012. [3] M. Chen, S. Mao, and Y. Liu, Big Data: A Survey. Springer Science Business Media, New York 2014. [4] A. Jacobs, The Pathologies of Big Data, ACM Queue, 6 Jul. 2009. [5] Laney, and Douglas, 3D Data Management: Controlling Data Volume, Velocity and Variety, Gartner, 6 Feb. 2001. [6] S. F. Wamba, S. Akter, D. A. Edwards, G. Chopin, and D. Gnanzou, How ‘big data’ can make big impact : Findings from a systematic review and a longitudinal case study, International Journal of Production Economics, vol. 165, pp. 234–246, 2015. https://iaeme.com/Home/journal/IJARET 643 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri [7] H. Ozkosea, E. S. Ari, and C. Gencerb, Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioural Sciences, World Conference on Technology, Innovation and Entrepreneurship, vol. 195, pp. 1042 – 105, 2015. [8] Gil Press, How Data Became Big, Smart Data Collective, Jun. 2012. [9] B. A. Maroon, and P. A. D. de Maine, Automatic data Compression, Communications of the ACM, vol. 10, no. 11, pp. 711-715, Nov. 1967. [10] A. R. Miller, and C. R. Ashman, The Assault on Privacy, 20 DePaul L. Rev. 1062, 1971. [11] T. Akiyama, The continued growth of text information: from an analysis of information flow censuses taken during the past twenty years, Keio Communication Review, no. 25, 2003. [12] I. Dienes, and A. Meta, Study of 26 “How Much Information” Studies, Sine Qua Nons and Solutions, Hungarian Central Statistical Office (HCSO) 1979–1997, International Journal of Communication, vol. 6, pp. 874–906, 2012. [13] I. de. S. Pool, Tracking the Flow of Information, Science 12 Aug 1983, vol. 221, no. 4611, pp. 609-613, DOI: 10.1126/science.221.4611.609. [14] R. J. T. Morris, The evolution of storage systems, IBM Systems Journal, vol. 42, no. 2, pp. 205217, ISSN: 0018-8670. [15] M.l Cox and D. Ellsworth, Application-Controlled Demand Paging for Out-of-Core Visualization, Report NAS-97-010, Jul. 1997. [16] K. Coffman, and A. Odlyzko, The work of the encyclopedia in the age of electronic reproduction, vol. 3, no. 10 – 5, Oct. 1998. [17] D. Kenwright, Automation or interaction: what’s best for big data? NASA Ames Research Center, Visualization '99 Proceedings, 24-29, pp. 491 – 495, Oct. 1999, ISSN: 1070-2385. [18] IDC, The expanding digital universe, A forecast of worldwide information growth through 2010, An IDC White paper, Mar. 2007. (http://www.emc.com/collateral/analystreports/expandingdigital-idc-white-paper.pdf) [19] B. Swanson, and G. Gilder, Estimating the Exaflood. The Impact of Video and Rich Media on the Internet, A “zettabyte” by 2015?, Discovery Institute Seattle, Washington, Jan. 2008. [20] M. Friedman, M. Girola, M. Lewis, and A. M. Tarenzio, IBM Data Centre Networking Planning for Virtualization and Cloud Computing, IBM Redbooks. [21] M. Ferguson, Architecting a big data platform for analytics, Intelligent Business Strategies, whitepaper prepared for IBM, Oct. 2012. [22] J. Dittrich, Efficient big data processing in Hadoop MapReduce, Information Systems Group Saarland University (http://infosys.cs.uni-saarland.de), Proceedings of the VLDB Endowment, vol. 5, no.12, 2014. [23] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, Starfish: A Self-tuning System for Big Data Analytics, Department of Computer Science Duke University, CIDR, vol. 11, 2011. [24] Y. He, RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems, ICDE IEEE Conference, 2011. https://iaeme.com/Home/journal/IJARET 644 [email protected] Big Data: A Comprehensive Survey [25] H. Zou, Y. Yu, W. Tang, and H.-W. M. Chen, Flex Analytics: A Flexible Data Analytics Framework for Big Data Applications with I/O performance Improvement, Special issue on Scalable Computing for Big Data Elsevier, Big Data Research, vol. 1, pp. 4-13, Aug. 2014. [26] M. N. Rahman, and A. Esmailpour, A Hybrid Data Center Architecture for Big Data, Big data research, Elsevier, Feb. 2016. [27] D. Agrawal, Big data and cloud computing: current state and future opportunities, EDBT 2011, Uppsala, Sweden, ACM 978-1-4503-0528-0/11/0003, Mar. 22–24, 2011. [28] E. B. Lakew, Managing resource usage and allocation in multi-cluster cloud, Umea University, Licentiate Thesis May 2013, Published in IEEE Computer Society, 2012. ISSN 0348-0542. [29] J. Dai, Data-flow based performance analysis for big data cloud, Intel Asia-Pacific research and development Ltd, 2011. [30] S. Lu, A framework for cloud-based large-scale data analytics and visualization; a case study on climate data, IEEE Cloud Computing Technology and Science (CloudCom), Nov. 29, 2011 – Dec, 1, 2011, pp. 618-622, 2011. [31] R. T. Kaushik, A data centric cooling energy costs reduction approach for big data analytics cloud, SC12, Nov. 10-16, 2012, Salt Lake City, Utah, USA 978-1-4673-0806-9/12 2012 IEEE, 2012. [32] Y. Zhao, R. N. Calheiros, G. Gange, K. Ramamohanarao, and R. Buyya, SLA-Based Resource Scheduling for Big Data Analytics as a Service in Cloud Computing Environments, Parallel Processing, 2015 44th International Conference on, IEEE, ISSN: 0190-3918, pp. 510-519, 2015 [33] C. Wu, R. Buyya, and K. Ramamohanarao, Big Data Analytics = Machine Learning + Cloud Computing, arXiv preprint arXiv: 1601.03115, arxiv.org, 2016. [34] W. Gao, Big Data Bench: a Big Data Benchmark Suite from Web Search Engines, ArXiv preprint arXiv: 1307.0320, 2013. [35] J. K. Laurila, The Mobile Data Challenge: Big Data for Mobile Computing Research, Nokia Research Centre Lausanne, Switzerland, Workshop on the Nokia Mobile Data Challenge in coincidence with the 10th International Conference on Pervasive Computing, 2012. [36] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, Big Data Bench: a Big Data Benchmark Suite from Internet Services, ArXiv: 1401.1406v2, 22 Feb. 2014. [37] R. Han, Z. Jia, W. Gao, X. Tian, and L. Wang, Benchmarking Big Data Systems: State-of-theArt and Future Directions, ArXiv: 1506.01494v1 4, Jun. 2015. [38] S. Shekhar, Spatial Big-Data Challenges Intersecting Mobility and Cloud Computing, ACM International Workshop on Data Engineering for Wireless and Mobile Access, ACM, 2012. [39] G. Chittaranjan, Mining large-scale Smartphone data for personality studies Personal and Ubiquitous Computing, vol.17, no. 3, pp. 433-450, 2013. [40] A. Zaslavsky, Sensing as a Service and Big Data, ArXiv preprint arXiv: 1301.0159, 2013. [41] M. A. Alsheikh, D. Niyato, S. Lin, H.-P. Tan, and Z. Han, Mobile Big Data Analytics Using Deep Learning and Apache Spark, ArXiv: 1602.07031v1, arXiv.org, Feb. 2016. [42] Hortonworks, Hortonworks DataFlow. Hortonworks.com/hdf/. https://iaeme.com/Home/journal/IJARET 645 [email protected] Wasim Haidar SK, Surendra Pal Singh, Prashant Johri [43] The Economist, Data, data everywhere, 25 Feb. 2010. [44] W. Phil, Supercomputing the Climate: NASA's Big Data Mission, CSC World, Computer Sciences Corporation, 2013. [45] L. Andrew, The real story of how big data analytics helped Obama win, InfoWorld, 31 May 2014. [46] C. L. P. Chen, and C.-Y. Zhang, Data intensive applications, challenges, techniques and technologies: A survey on Big Data, Information Sciences, vol. 275, pp. 314-347, 2014. [47] I. Arel, D. C. Rose, and T. P. Karnowski, Deep machine learning- a new frontier in artificial intelligence research, IEEE Computer Intelligence Magazine, vol. 5, no. 4, pp. 13-18, 2010. [48] G. Ingersoll, Introducing Apache Mahout, Member, Technical Staff Lucid Imagination IBM Corporation, Sep. 2009. [49] D. Lyubimov and A. Palumbo, Apache Mahout: Beyond MapReduce. February 2016. [50] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, Microsoft Research, Silicon Valley, 2007. [51] M. J. Foley, Microsoft drops Dryad; puts its big-data bets on Hadoop, All About Microsoft, Nov. 16, 2011. [52] Apache Storm. www.storm.apache.org/index.htm (Retrieved on Jun. 20, 2020) [53] A. Auradkar, C. Botev, S. Das, D. DeMaagd, A. Feinberg, P. Ganti, B. G. L. Gao, K. Gopalakrishna, B. Harris, J. Koshy, K. Krawez, J. Kreps, S. Lu, S. Nagaraj, N. Narkhede, S. Pachev, I. Perisic, L. Qiao, T. Quiggle, J. Rao, B. Schulman, A. Sebastian, O. Seeliger, A. Silberstein, B. Shkolnik, C. Soman, R. Sumbaly, K. Surlaker, S. Topiwala, C. Tran, B. Varadarajan, J. Westerman, Z. White, D. Zhang, and J. Zhang, Data infrastructure at LinkedIn, 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 1370–1381, 2012. [54] SAP community Network, What is sap hana?, saphana.com, Scn.sap.com/docs/DOC-60338, Sep. 2012. [55] M. Zaharia, M. Chowdhury, J. M. Franklin, S. Shenker, and I. Stoica, Spark: Cluster Computing with Working Sets, USENIX Workshop on Hot Topics in Cloud Computing (HotCloud). [56] Apache Flink, Apache Flink: Scalable Batch and Stream Data Processing, flink.apache.org. [57] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, Dremel: Interactive Analysis of Web-Scale Datasets, Google, Inc. Proceedings of the 36th International Conference on Very large Data Bases, pp. 330-339, 2010. [58] M. Hausenblas and J. Nadeau, Apache Drill: Interactive Ad-Hoc Analysis at Scale, MapR Technologies. Mary Ann Liebert, Inc., vol. 1, no. 2, Jun. 2013. [59] Carnegie Mellon University, The "Only" Coke Machine on the Internet, 10 Nov. 2014. https://iaeme.com/Home/journal/IJARET 646 [email protected]

Log In

BIG DATA: A COMPREHENSIVE SURVEY

Related papers

Related papers

Related topics