Big Data Analysis: Concepts, Tools and Applications: Poonam
Big Data Analysis: Concepts, Tools and Applications: Poonam
Big Data Analysis: Concepts, Tools and Applications: Poonam
ABSTRACT
Nowadays, Data has been growing exponentially which is leading explosion of data. With the enhancement of
technology a variety of enormous data is being generated at an extremely fast speed in various sectors. So this can
simply be termed as big data. We can create 2.5 quintillions of data everyday and this data comes from various
sources and has different formats. Apache Hadoop, Apache Spark, MangoDB, NOSQL are tools to handle big data.
Therefore analyzing big data has become crucial and inevitable. Big data analytics is being adopted throughout the
globe in order to gain numerous benefits from the data being produced. Big data analytics examines large and
different types of data to uncover hidden patterns, correlations and other insights. This paper describes a brief
summary of its types, tools and applications of big data analytics.
Keywords – Big data, Big Data Analytics, Big Data Tools, Hadoop, HDFS, MapReduce, Big Data Analytics
Applications
I. INTRODUCTION
Due to use of IOTs and social media sites we are identifying, procuring, preparing and analyzing large
generating a large amount of data which is huge in amounts of raw, unstructured data to extract
size and not present in structured manners. Due to meaningful information that can serve as an input for
the data explosion caused to digital and social media, identifying patterns, enriching existing enterprise
data is rapidly being produced in such large chunks. data and performing large-scale searches. It is
Google, Facebook, Netflix, LinkedIn, Twitter and all process to extract meaningful information from big
other social media platforms clearly qualify as big data. Big data analytics has quickly drawn the
data technology centers. So it has become attention of IT industry due to its application in
challenging for enterprises to store and process it majority of areas like healthcare, business firms,
using conventional methods. These are the biggest social media, education, banking [1] etc. Big data
factor for evolution of big data. Big data is a term for analytics helps in quicker and better decision making
collection of dataset so large and complex that it in organizations. Big data analytics technologies and
becomes difficult to process using traditional techniques provide a means to analyze data sets and
database system. Big data is a combination take away new information which can help
of structured, semi structured and unstructured data organizations to make informed business decisions.
collected by organizations that can be mined for Different kinds of organizations use data analytical
information. It is the massive amount of data that tools and techniques in different ways. Regardless of
cannot be stored, processed and analyzed using what big data is generated from, the reality comes
traditional tools. Enterprises must implement modern into challenges is to bring value to it. With the
business intelligence tools to effectively capture, availability of advanced Big Data analyzing
store and process such large amount of data in real- technologies namely, NOSQL Databases, BigQuery,
time. Big data is in raw form and is no meaningful to Map Reduce, Hadoop, perceptions and
us so we must done meaningful insight to it in order understandings can be better achieved to enable in
to benefit from this data. It is done by analyzing the improving the business policies and the decision-
data which is known as big data analytics. making process [2].
Big data analytics is a process of collecting, Big data is the data characterized by 10 parameters
organizing and analyzing large set of data. Big data [3]. These are shown in fig 1.
analytics is the often complex process of
examining big data to uncover information. The Big
Data analytics lifecycle generally involves
cluster which are identical in terms of their software highlight the contents and Mozenda will publish it to
architecture. All the nodes are symmetric and do not your site automatically.
need a master node. This feature allows linear
scalability. 5) Content Grabber: - Content Grabber is the best
choice if you want to extract your data by web
5) CouchDB: - Apache CouchDB is an open source scraping and web automation. This tool ensures the
document oriented NOSQL. CouchDB is also a provision of scalable and readable data. Content
clustered database that allows you to run a single Grabber fixes all the minor errors in your data and is
logical database server on any number of servers or the next evolution in data scraping technology. This
VMs.A CouchDB cluster improves on the single- software can handle travels portals and new websites
node setup with higher capacity and high-availability easily.
without changing any APIs.
D. Data Cleaning & Validation Tool: - Data
C. Data Filtering and extraction Tools: - Data Cleansing is the act of detecting and correcting
filtering and extraction tools are used to create corrupt or inaccurate records from a record set, table
structured output from unstructured data gathered in or databases. Data cleaning tools are very helpful
various stages. Some of these are as follows: because they help in minimizing the processing time.
The goal of data cleansing is not just to clean up the
1) Import.io:- It is the one of the best and most data in database but also brings consistency to
reliable web scarping software on internet. If you different sets of data. They also reduce the
want to scrape the contents from different web pages computational speed of data analytics tools. Various
and have short of time then you can use this tool. validation rules are used to confirm the necessity and
This tool allows you to perform multiple data relevance of data extracted for analysis. Sometimes it
scraping tasks at a time. The most interactive feature may be difficult to apply validation constraints due to
of import.io are web crawling, secure login and data complexity of data.
extraction. You can import the contents to Google
sheets, Excel and plot. 1) Open Refine: - Open Refine which is formally
known as Google Refine is a free open source tool. It
2) Octoparse: - Octoparse is a cloud-based web is a powerful tool for working with messy data,
crawler that helps you easily extract any web data cleaning it and transforms it from one format to
without coding [10]. Octoparse is an ultimate tool for another. It is extremely powerful for exploring,
data extraction which allows you turn the whole cleaning and linking data. It is a sophisticated tool for
internet into a structured format. It provides an easy working on big data and performs analytics. Open
user friendly interface which can easily deal with any Refine define explore data feature that explore large
type of websites. It is powerful tool to deal with dataset with very ease. The clean and transform
dynamic websites and interact with many sites in feature enables to clean big data and transform it
various ways. from one form to another. It also provides reconcile
3) ParseHub: - ParseHub is the web browser and match data feature that extends the dataset with
extension that turns your dynamic websites into several web services. Open Refine always keeps your
APIs. It also converts poorly structured websites into data private on your own computer until you want to
APIs without writing a code. Parsehub is supported share or collaborate.
in various systems such as Windows, Mac OS X, and 2) DataCleaner: - DataCleaner is a data quality
Linux. It works with any interactive pages and easily analysis application and a solution platform. It has
searches through forms, opens dropdowns, logins to strong data profiling engine. DataCleaner is a tool
websites, clicks on maps and handles sites with that is integrated with Hadoop. Data transformation,
infinite scroll, tabs, and pop-ups, etc. validation and reporting are its main features. It is a
4) Mozenda: - Mozenda is the most powerful and tool which is an application for data quality analysis.
advanced data scraping and web extraction tool. It is There is a profiling engine in its core to profile the
best known for its user friendly interface. Mozenda is data. This can be extensible by adding data cleansing,
suitable for programmers, webmasters, journalists, transformations, reduplication, matching merging and
scholars and enterprises. You can easily scrape, enrichment. It profiles and analyses the database
manage and store your data without compromising on within minutes, discovers patterns with the Pattern
quality. Mozenda has different interactive options Finder, finds frequency of data using Value
and features to ease your work. This tool takes the Distribution profile, filters the contact details, detects
hassle out of publishing data. You just have to duplicates by using fuzzy logic, Merge the duplicates
values etc.
3) MapReduce: - MapReduce is a powerful algorithms, and thanks to add-ons, you can apply a
paradigm for parallel computation. Hadoop uses lot of techniques.
MapReduce to execute jobs on files in HDFS.
Hadoop will intelligently distribute computation 5) Talend: - Talend is very good tool for quickly data
over clusters. During a MapReduce job, Hadoop integration and it makes the development time so
sends the Map and Reduce tasks to the appropriate short and easy. Talend provides a unified approach
servers in the cluster. The MapReduce algorithm that combines rapid data integration, transformation,
contains two important tasks, namely Map and and mapping with automated quality checks. It is
Reduce. MapReduce program executes in three well suited for all kinds of data migration between
stages, namely map stage, shuffle stage, and reduce various systems.
stage.
IV. APPLICATIONS OF BIG DATA
• Map stage − the map or mapper’s job is to ANALYTICS
process the input data. Generally the input
data is in the form of file or directory and is The Big Data analytics is indeed a revolution in the
stored in the Hadoop file system (HDFS). field of Information Technology. The use of Data
The input file is passed to the mapper analytics by the companies is enhancing every year.
function line by line. The mapper processes There are various factors where big data analytics is
the data and creates several small chunks actively used. Some of these are explained s below.
of data. Big data analytics helps organizations to work with
their data efficiently and use that data identify new
• Reduce stage − this stage is the opportunities. Different techniques and algorithms
combination of the Shuffle stage and can be applied to predict from data.
the Reduce stage. The Reducer’s job is to
process the data that comes from the
mapper. After processing, it produces a
new set of output, which will be stored in
the HDFS.
derived from it in order to make right decisions. By world. In the field of law enforcement big data
analyzing big data we can have personalized analytics can be used to analyze all the available data
marketing campaigns which can result in better and in order to understand crime pattern. Intelligent
higher sales. Multiples business strategies can be services can use predictive analytics to forecast the
applied for future success of the company and that crime that could be committed. The police
leads to smarter business moves, more efficient department was able to reduce crime rate using big
operations and higher profits. data analytics. With the help of data police could
identify whom to target, where to go and how to
C. Education: - The next biggest where big data investigate crime. Big data analytics helps them to
analytics used is Education. The usability of big data discover pattern of crime in emerging area.
is also increased in educational sector. There are new
options for research and analysis using data analytics. V. CONCLUSIONS
In the field of education depending on market
requirements new courses are to be developed. The In the present scenario there are millions of sources
market requirements needs to be analyzed with which generate data very rapidly. These data sources
respect to scope of course and according to scope present across the world. All that data together make
new courses are to be developed. Hence to a analyze big data. We must derive meaningful insights to it in
market requirement and to develop new courses big order to make benefit from big data. It is done by
data analytics is used. analyzing big data which is known as big data
analytics. This paper presents the concepts of big data
D. Healthcare: - There are number of uses of big analytics, its types and various applications of big
data analytics in the field of healthcare. One of its data analytics. So big data analytics is being adopted
uses is to predict patient health issues with big data by various organizations which help in quicker and
analytics. With the help of patient’s health history better decisions making in organization.
big data analytics is used to predict how likely they
are to have particular health issues in future. REFERENCES
E. Media & Entertainment: - In the field of media & 1. M. Chen, S. Mao, and Y. Liu, “Big data: a
entertainment big data analytics is used to understand survey”, Mobile Networks and Applications,
the demand of shows, movies, songs and so on to vol. 19, No. 2, pp. 171–209, 2014.
deliver a personalized recommendation lists to its 2. Shweta Sinha, “Big Data Analysis:
users. Concepts, Challenges And Opportunities”,
International Journal of Innovative Research
F. Banking: - There are various uses of big data in Computer Science & Technology
analytics in banking sectors. One of its uses is risk (IJIRCST) ISSN: 2347-5552, Volume-8,
Management. By using big data analytics there are Issue-3, May 2020.
many advantages. Big data analytics is used for risk 3. https://tdwi.org/articles/2017/02/08/10-vs-
management. Risk management is an important of-big-data.aspx
concept in any organization especially in the field of 4. Erl, T., Khattak, W. and Buhler, P., 2016.
banking. Risk management analyzes a series of Big data fundamentals: concepts, drivers &
measures which helps the organization to prevent any techniques. Prentice Hall Press
sort of unauthorized activities. In addition to risk 5. Bihani, P. and Patil, S.T., 2014, “A
management it also used to analyze customer income comparative study of data analysis
and expenditures. It helps the bank to predict if a techniques”, International journal of
particular customer is going to choose for various emerging trends & technology in computer
bank offers like loans, credit card schemes and so on. science, 3(2), pp.95-101.
This way the bank is able to identify the right 6. J.Nageswara Rao, M.Ramesh, “A Review
customer who is interested in its offers. on Data Mining & Big Data, Machine
G. Telecommunication: - Big data analytics is used Learning Techniques” , International Journal
in the field of telecommunication in order to gain of Recent Technology and Engineering
profit. Big data analytics can be used to analyze (IJRTE) ISSN: 2277-3878, Volume-7 Issue-
network traffic and call data records. It can also 6S2, April 2019, pp 914-916.
improve service quality and customer experience in 7. http://www.opinioncrawl.com/aboutOpinion
the field of telecommunication. Crawl.htm
H. Government: - Big data analytics has been used 8. Ritu Ratra, Preeti Gulia, “Big Data Tools
widely in the field of government in all over the and Techniques: A Roadmap for Predictive