Big Data Analysis: Concepts, Tools and Applications: Poonam

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

RESEARCH ARTICLE OPEN ACCESS

Big Data Analysis: Concepts, Tools and Applications


Poonam
Assistant Professor
A.S.College - Khanna

ABSTRACT
Nowadays, Data has been growing exponentially which is leading explosion of data. With the enhancement of
technology a variety of enormous data is being generated at an extremely fast speed in various sectors. So this can
simply be termed as big data. We can create 2.5 quintillions of data everyday and this data comes from various
sources and has different formats. Apache Hadoop, Apache Spark, MangoDB, NOSQL are tools to handle big data.
Therefore analyzing big data has become crucial and inevitable. Big data analytics is being adopted throughout the
globe in order to gain numerous benefits from the data being produced. Big data analytics examines large and
different types of data to uncover hidden patterns, correlations and other insights. This paper describes a brief
summary of its types, tools and applications of big data analytics.
Keywords – Big data, Big Data Analytics, Big Data Tools, Hadoop, HDFS, MapReduce, Big Data Analytics
Applications

I. INTRODUCTION
Due to use of IOTs and social media sites we are identifying, procuring, preparing and analyzing large
generating a large amount of data which is huge in amounts of raw, unstructured data to extract
size and not present in structured manners. Due to meaningful information that can serve as an input for
the data explosion caused to digital and social media, identifying patterns, enriching existing enterprise
data is rapidly being produced in such large chunks. data and performing large-scale searches. It is
Google, Facebook, Netflix, LinkedIn, Twitter and all process to extract meaningful information from big
other social media platforms clearly qualify as big data. Big data analytics has quickly drawn the
data technology centers. So it has become attention of IT industry due to its application in
challenging for enterprises to store and process it majority of areas like healthcare, business firms,
using conventional methods. These are the biggest social media, education, banking [1] etc. Big data
factor for evolution of big data. Big data is a term for analytics helps in quicker and better decision making
collection of dataset so large and complex that it in organizations. Big data analytics technologies and
becomes difficult to process using traditional techniques provide a means to analyze data sets and
database system. Big data is a combination take away new information which can help
of structured, semi structured and unstructured data organizations to make informed business decisions.
collected by organizations that can be mined for Different kinds of organizations use data analytical
information. It is the massive amount of data that tools and techniques in different ways. Regardless of
cannot be stored, processed and analyzed using what big data is generated from, the reality comes
traditional tools. Enterprises must implement modern into challenges is to bring value to it. With the
business intelligence tools to effectively capture, availability of advanced Big Data analyzing
store and process such large amount of data in real- technologies namely, NOSQL Databases, BigQuery,
time. Big data is in raw form and is no meaningful to Map Reduce, Hadoop, perceptions and
us so we must done meaningful insight to it in order understandings can be better achieved to enable in
to benefit from this data. It is done by analyzing the improving the business policies and the decision-
data which is known as big data analytics. making process [2].

II. BIG DATA ANALYTICS A. Characteristics of Big Data

Big data analytics is a process of collecting, Big data is the data characterized by 10 parameters
organizing and analyzing large set of data. Big data [3]. These are shown in fig 1.
analytics is the often complex process of
examining big data to uncover information. The Big
Data analytics lifecycle generally involves

ISSN: 2347-8578 www.ijcstjournal.org Page 110


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

7. Variability: - In big data’s context, Variability


refers to a few different things. The number of
inconsistencies in the data is one. These are required
to be found by anomaly and outlier detection
methods for any meaningful analytics to occur.
8. Volatility: - Due to the velocity and volume of big
data the volatility of data needs to be considered
carefully. Proper rules are established for data
currency and availability so that rapid retrieval of
information is done when it is required. With big data
the costs and complexity of a storage and retrieval
process are magnified.
9. Vulnerability: - Big data brings new issues
regarding security. In any case, with big data, a data
Fig 1: 10 Vs of Big Data breach is a major violation.
1. Volume: - Volume represents enormous amount of 10. Visualization: - Visualization is a new
data that is produced. Today data is generated from challenging characteristic of big data i.e. How to
various sources in different formats. Volume of data visualize the big data using current visualization
is generated exponentially. By 2020 it is expected tools. Because of the limitations of in memory
data is rise up to 44 zettabytes. New big data tools technology and low response time, functionality, and
used distributed system so that we store analyze data poor scalability existing big data visualization tools
across databases that are dotted around anywhere in visage technical challenges.
the world.
B. Types of Big Data Analytics
2. Velocity: - Velocity refers to the speed at which
the data is generated, collected and analyzed. This is There are four general categories of analytics that are
mainly due to IOTs, mobile data, social media etc. distinguished by the results they produce; descriptive
Technology allows us now to analyze the data while analytics, diagnostic analytics, predictive analytics
it is being generated, without ever putting it into and prescriptive analytics. Different kinds of
databases. organizations use data analytics tools and techniques
in different ways. The values and complexity
3. Variety: - Variety refers to nature of data that is relations between different analysis types are shown
structured, semi-structured and unstructured data. In in fig 2.
past we focus only on structured data but in fact 80%
of data is unstructured which is in the form of text,
video, image etc. With big data technology we can
now analyze and brings together data of different
types.
4. Veracity: - It refers to the assurance
of quality/integrity/credibility/accuracy of the data.
Since the data is collected from multiple sources, we
need to check the data for accuracy before using it for
business insights.
5. Value: - It is most important term in context of big
data. We know data is huge. It has to be converted
into a form from where it can be used to make
analysis otherwise it will be useless.
6. Validity: - Validity refers to how accurate and
correct the data is for its intended use. The benefit
from big data analytics is only as good as its Fig 2: Value and complexity relation with analysis
underlying data, so you need to adopt good data type [4]
governance practices to ensure consistent data
quality, common definitions, and metadata.

ISSN: 2347-8578 www.ijcstjournal.org Page 111


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

1) Descriptive Analytics: - Descriptive analytics TYPES OF BIG DATA ANALYSIS TOOLS


answers the questions what has happened. It uses data
TABLE I
aggregation in data mining techniques to provide
insight into past and then it answers what is
Data Data Storage Data Data
happening now based on incoming data. It Collection tools Filtering Cleaning
summarizes the data into a form that is understood tools and and
able by humans. Google Analytical Tool is best Extraction Validation
example of descriptive analysis. This analytics helps Tools Tools
in creating reports like company’s revenue, profit, 1 Semantria Apache HBase Import.io OpenRefine
sales and so on. (Hadoop database)

2) Diagnostic Analytics: - Diagnostic Analysis is 2 Opinion Oracle NOSQL OctoParse DataCleaner


used to determine why something happened in the Crawl Database
past. So it is characterized by techniques like data 3 Trackur MangoDB ParseHub MapReduce
discovery, data mining and correlation to diagnostic
4 OpenText Apache Cassandra Mozenda Rapidminer
analytics it takes deeper look at data. It is helpful in
data mining what kinds of factors and events 5 SAS CouchDB Content Talend
contributed to particular outcomes. Sentiment Grabber
Analysis
3) Predictive Analytics: - Predictive analysis uses
statistical techniques and focus on techniques to
understand the future. It predicts what future
outcomes are. It looks into historical and present data A) Data Collection Tools: - Data Collection Tools
to make predictions of the future. So predictive plays an important role in big data life. Some of most
analysis provides the companies with actionable important tools for data collection are Semantria,
insight based on data. So through sensors and other Opinion Crawl, Open Text and Trackur.
machine generated data can identify when a
malfunction is likely to occur. 1) Semantria: - Semantria is a cloud-based text and
sentiment analysis tool offered by Lexatics. This tool
4) Prescriptive Analytics: - Prescriptive Analysis is designed to help businesses collect tweets, texts,
uses optimization and simulation algorithm to advice and other comments from their clients and analyze
on the possible outcomes and answer the questions them to acquire highly valuable and actionable
what should we do. So basically it allows the users to insights. The main benefits of Semantria are its tools
prescribe a number of different possible actions and and features that allow the users to gain reliable and
then guide them towards a solution. Thus, for actionable insights, its customizable features, and MS
prescriptive analytics, organizations optimize their Excel Compatibility.
business process models based on the feedback
provided by predictive analytic models [5]. The result 2) Opinion Crawl: - Opinion Crawl is an online web
of the analytics is dependent upon the characteristics sentiments tool which is used for current events,
of data gathered. companies, products and people. Users can enter a
topic and get an adhoc sentiment assessment related
to that topic. Users can get a pie chart which shows
current real time sentiments. All these concepts allow
III. BIG DATA ANALYTICS TOOLS the users to check what issues are derived in
The word “Big data” can be applied to a dataset sentiments in a positive or negative way [7].
which increases at very intense rate. And it becomes 3) Trackur: -Trackur is a tool that is used to collect
difficult to store and process that data. Hence big data the information. It uses its automated sentiment
analytics is where the use some advance techniques analysis to look at the specific keywords that the
which are applied on big data sets. There is a variety users are supervising and after then decisions are
of tools that are used for Analytics of Big Data. The carried out. The sentiment may be positive, negative
tools used for the same purpose can be categorized or may be neutral with the related document. In
into different stages of lifecycle of big data as shown Trackur algorithm, it could be used to observe the
in Table I. That is based on their usage and social sites and can outline news, to collect
implementation [6]. information through the trends and automated
sentiment analysis.

ISSN: 2347-8578 www.ijcstjournal.org Page 112


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

4) Open text: - The Open Text is Sentiment Analysis


module. It is a special type of engine that is used in
classification to find out various subjective patterns.
It is also used to evaluate the expressions of
sentiment that is present in text form. First of all the
analysis work is done at the topic level, sentence
level, and document level. Its prime function is to
acknowledge whether parts of text are realistic [8].
5) SAS Sentiment Analysis: - SAS is also sentiment
analysis tool that automatically extract sentiments in
real time. It performs this task with the help of
various statistical modeling techniques.
B) Data Storage tools: - One of most important
challenge of big data is how we can store it. A good
storage tool provides a place to store, query and
analyze big data. Some of these are as follows:
1) Apache Hadoop: - Hadoop is an open-source
framework that is written in Java and it provides
cross-platform support. Apache HADOOP is a 2) NOSQL: - NOSQL is an alternative to traditional
framework used to develop data processing databases which does not require any kind of fixed
applications which are executed in a distributed table schema like SQL. The original intention of
computing environment. It works in environment NOSQL is the modern web scale databases. NOSQL
that provides distributed storage and computing database can also be referred to as structured storage
across clusters of computers. Hadoop runs which consists of relational database has a subset.
application using the map reduces where data is Compared to relational database NOSQL are more
processed in parallel with others. Hadoop runs the superior and provides superior performance. It can
application that could perform complete statistical scale out the data easily and has share nothing
analysis on huge amount of data. architecture which is capable of running on large
There are two core services which Hadoop provides: number of nodes. It also provides nonlocking
concurrency mechanism. There are four different
• Hadoop MapReduce: -MapReduce is a types of NOSQL database – key value store, column
computational model and software based store, document based store and graph based
framework for writing applications which store.
are run on Hadoop. These MapReduce
3) MangoDB: - MangoDB is a document database
programs are capable of processing massive
that provides high performance, high availability, and
data in parallel on large clusters of
easy scalability. It is a cross-platform document-
computation nodes [9].
oriented database system classified as
• HDFS: - HDFS is main part of Hadoop as it
a NOSQL database. The indexing in case of
provides reliable means for managing big data. It
MangoDB is done using document key structure. It
was closely related to MapReduce. When HDFS
bridges the gap between key-value and traditional
takes the data it breaks the information down into
RDBMS systems. It provides flexibility during initial
separate blocks and distributes them into different
phase of development and design. It is a database that
nodes in a cluster. It employs NameNode and
supports online real time applications.
DataNode architecture to implement distributed file
system that provides high performance access to 4) Cassandra: - Apache Cassandra is the leading
data highly scalable Hadoop clusters. NOSQL distributed data management systems that
drive many of today’s modern business applications
by offering continuous availability, high scalability
and strong performance security. Cassandra handles
large amount of data with its distributed structure.
The main goal of Cassandra is to handle big data
workloads across multiple nodes without any single
point of failure. Cassandra has multiple nodes in a

ISSN: 2347-8578 www.ijcstjournal.org Page 113


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

cluster which are identical in terms of their software highlight the contents and Mozenda will publish it to
architecture. All the nodes are symmetric and do not your site automatically.
need a master node. This feature allows linear
scalability. 5) Content Grabber: - Content Grabber is the best
choice if you want to extract your data by web
5) CouchDB: - Apache CouchDB is an open source scraping and web automation. This tool ensures the
document oriented NOSQL. CouchDB is also a provision of scalable and readable data. Content
clustered database that allows you to run a single Grabber fixes all the minor errors in your data and is
logical database server on any number of servers or the next evolution in data scraping technology. This
VMs.A CouchDB cluster improves on the single- software can handle travels portals and new websites
node setup with higher capacity and high-availability easily.
without changing any APIs.
D. Data Cleaning & Validation Tool: - Data
C. Data Filtering and extraction Tools: - Data Cleansing is the act of detecting and correcting
filtering and extraction tools are used to create corrupt or inaccurate records from a record set, table
structured output from unstructured data gathered in or databases. Data cleaning tools are very helpful
various stages. Some of these are as follows: because they help in minimizing the processing time.
The goal of data cleansing is not just to clean up the
1) Import.io:- It is the one of the best and most data in database but also brings consistency to
reliable web scarping software on internet. If you different sets of data. They also reduce the
want to scrape the contents from different web pages computational speed of data analytics tools. Various
and have short of time then you can use this tool. validation rules are used to confirm the necessity and
This tool allows you to perform multiple data relevance of data extracted for analysis. Sometimes it
scraping tasks at a time. The most interactive feature may be difficult to apply validation constraints due to
of import.io are web crawling, secure login and data complexity of data.
extraction. You can import the contents to Google
sheets, Excel and plot. 1) Open Refine: - Open Refine which is formally
known as Google Refine is a free open source tool. It
2) Octoparse: - Octoparse is a cloud-based web is a powerful tool for working with messy data,
crawler that helps you easily extract any web data cleaning it and transforms it from one format to
without coding [10]. Octoparse is an ultimate tool for another. It is extremely powerful for exploring,
data extraction which allows you turn the whole cleaning and linking data. It is a sophisticated tool for
internet into a structured format. It provides an easy working on big data and performs analytics. Open
user friendly interface which can easily deal with any Refine define explore data feature that explore large
type of websites. It is powerful tool to deal with dataset with very ease. The clean and transform
dynamic websites and interact with many sites in feature enables to clean big data and transform it
various ways. from one form to another. It also provides reconcile
3) ParseHub: - ParseHub is the web browser and match data feature that extends the dataset with
extension that turns your dynamic websites into several web services. Open Refine always keeps your
APIs. It also converts poorly structured websites into data private on your own computer until you want to
APIs without writing a code. Parsehub is supported share or collaborate.
in various systems such as Windows, Mac OS X, and 2) DataCleaner: - DataCleaner is a data quality
Linux. It works with any interactive pages and easily analysis application and a solution platform. It has
searches through forms, opens dropdowns, logins to strong data profiling engine. DataCleaner is a tool
websites, clicks on maps and handles sites with that is integrated with Hadoop. Data transformation,
infinite scroll, tabs, and pop-ups, etc. validation and reporting are its main features. It is a
4) Mozenda: - Mozenda is the most powerful and tool which is an application for data quality analysis.
advanced data scraping and web extraction tool. It is There is a profiling engine in its core to profile the
best known for its user friendly interface. Mozenda is data. This can be extensible by adding data cleansing,
suitable for programmers, webmasters, journalists, transformations, reduplication, matching merging and
scholars and enterprises. You can easily scrape, enrichment. It profiles and analyses the database
manage and store your data without compromising on within minutes, discovers patterns with the Pattern
quality. Mozenda has different interactive options Finder, finds frequency of data using Value
and features to ease your work. This tool takes the Distribution profile, filters the contact details, detects
hassle out of publishing data. You just have to duplicates by using fuzzy logic, Merge the duplicates
values etc.

ISSN: 2347-8578 www.ijcstjournal.org Page 114


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

3) MapReduce: - MapReduce is a powerful algorithms, and thanks to add-ons, you can apply a
paradigm for parallel computation. Hadoop uses lot of techniques.
MapReduce to execute jobs on files in HDFS.
Hadoop will intelligently distribute computation 5) Talend: - Talend is very good tool for quickly data
over clusters. During a MapReduce job, Hadoop integration and it makes the development time so
sends the Map and Reduce tasks to the appropriate short and easy. Talend provides a unified approach
servers in the cluster. The MapReduce algorithm that combines rapid data integration, transformation,
contains two important tasks, namely Map and and mapping with automated quality checks. It is
Reduce. MapReduce program executes in three well suited for all kinds of data migration between
stages, namely map stage, shuffle stage, and reduce various systems.
stage.
IV. APPLICATIONS OF BIG DATA
• Map stage − the map or mapper’s job is to ANALYTICS
process the input data. Generally the input
data is in the form of file or directory and is The Big Data analytics is indeed a revolution in the
stored in the Hadoop file system (HDFS). field of Information Technology. The use of Data
The input file is passed to the mapper analytics by the companies is enhancing every year.
function line by line. The mapper processes There are various factors where big data analytics is
the data and creates several small chunks actively used. Some of these are explained s below.
of data. Big data analytics helps organizations to work with
their data efficiently and use that data identify new
• Reduce stage − this stage is the opportunities. Different techniques and algorithms
combination of the Shuffle stage and can be applied to predict from data.
the Reduce stage. The Reducer’s job is to
process the data that comes from the
mapper. After processing, it produces a
new set of output, which will be stored in
the HDFS.

Applications of Big Data Analytics


A. Ecommerce: - The first sector of big data
analytics is Ecommerce. Nearly 45% of world is
online and they can create a lot of data every day. Big
4) RapidMiner: - With RapidMiner Studio accessing, data can be used smartly in the field of Ecommerce
loading and analyzing of any type of data is possible. by predicting customer trend, forecasting demand,
Data can be both structured and unstructured like adjusting price and so on. Online retailers can have
text, still images, and media. RapidMiner is really opportunity better shopping experience and generate
fantastic to perform fast ETL processes and work on high sales if big data analytics is used properly.
your data as you want, no matter what is the source.
You will really save a lot of time when you learn how B. Marketing: - Big data does not lead to have high
to use it. You can create mining analysis with several marketing strategy. Meaningful insights need to be

ISSN: 2347-8578 www.ijcstjournal.org Page 115


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

derived from it in order to make right decisions. By world. In the field of law enforcement big data
analyzing big data we can have personalized analytics can be used to analyze all the available data
marketing campaigns which can result in better and in order to understand crime pattern. Intelligent
higher sales. Multiples business strategies can be services can use predictive analytics to forecast the
applied for future success of the company and that crime that could be committed. The police
leads to smarter business moves, more efficient department was able to reduce crime rate using big
operations and higher profits. data analytics. With the help of data police could
identify whom to target, where to go and how to
C. Education: - The next biggest where big data investigate crime. Big data analytics helps them to
analytics used is Education. The usability of big data discover pattern of crime in emerging area.
is also increased in educational sector. There are new
options for research and analysis using data analytics. V. CONCLUSIONS
In the field of education depending on market
requirements new courses are to be developed. The In the present scenario there are millions of sources
market requirements needs to be analyzed with which generate data very rapidly. These data sources
respect to scope of course and according to scope present across the world. All that data together make
new courses are to be developed. Hence to a analyze big data. We must derive meaningful insights to it in
market requirement and to develop new courses big order to make benefit from big data. It is done by
data analytics is used. analyzing big data which is known as big data
analytics. This paper presents the concepts of big data
D. Healthcare: - There are number of uses of big analytics, its types and various applications of big
data analytics in the field of healthcare. One of its data analytics. So big data analytics is being adopted
uses is to predict patient health issues with big data by various organizations which help in quicker and
analytics. With the help of patient’s health history better decisions making in organization.
big data analytics is used to predict how likely they
are to have particular health issues in future. REFERENCES
E. Media & Entertainment: - In the field of media & 1. M. Chen, S. Mao, and Y. Liu, “Big data: a
entertainment big data analytics is used to understand survey”, Mobile Networks and Applications,
the demand of shows, movies, songs and so on to vol. 19, No. 2, pp. 171–209, 2014.
deliver a personalized recommendation lists to its 2. Shweta Sinha, “Big Data Analysis:
users. Concepts, Challenges And Opportunities”,
International Journal of Innovative Research
F. Banking: - There are various uses of big data in Computer Science & Technology
analytics in banking sectors. One of its uses is risk (IJIRCST) ISSN: 2347-5552, Volume-8,
Management. By using big data analytics there are Issue-3, May 2020.
many advantages. Big data analytics is used for risk 3. https://tdwi.org/articles/2017/02/08/10-vs-
management. Risk management is an important of-big-data.aspx
concept in any organization especially in the field of 4. Erl, T., Khattak, W. and Buhler, P., 2016.
banking. Risk management analyzes a series of Big data fundamentals: concepts, drivers &
measures which helps the organization to prevent any techniques. Prentice Hall Press
sort of unauthorized activities. In addition to risk 5. Bihani, P. and Patil, S.T., 2014, “A
management it also used to analyze customer income comparative study of data analysis
and expenditures. It helps the bank to predict if a techniques”, International journal of
particular customer is going to choose for various emerging trends & technology in computer
bank offers like loans, credit card schemes and so on. science, 3(2), pp.95-101.
This way the bank is able to identify the right 6. J.Nageswara Rao, M.Ramesh, “A Review
customer who is interested in its offers. on Data Mining & Big Data, Machine
G. Telecommunication: - Big data analytics is used Learning Techniques” , International Journal
in the field of telecommunication in order to gain of Recent Technology and Engineering
profit. Big data analytics can be used to analyze (IJRTE) ISSN: 2277-3878, Volume-7 Issue-
network traffic and call data records. It can also 6S2, April 2019, pp 914-916.
improve service quality and customer experience in 7. http://www.opinioncrawl.com/aboutOpinion
the field of telecommunication. Crawl.htm

H. Government: - Big data analytics has been used 8. Ritu Ratra, Preeti Gulia, “Big Data Tools
widely in the field of government in all over the and Techniques: A Roadmap for Predictive

ISSN: 2347-8578 www.ijcstjournal.org Page 116


International Journal of Computer Science Trends and Technology (IJCST) – Volume 9 Issue 1, Jan-Feb 2021

Analytics”, International Journal of


Engineering and Advanced Technology
(IJEAT) ISSN: 2249 – 8958, Volume-9
Issue-2, December, 2019
9. Dr. Urmila R. Pol, “Big Data Analysis
Using Hadoop Mapreduce”, American
Journal of Engineering Research, e-ISSN:
2320-0847 p-ISSN : 2320-0936 Volume-5,
Issue-6, pp-146-151
10. Online source, [Available]
https://www.octoparse.com/blog/yes-there-
is-such-thing-as-a-free-web-scraper/, 2018.

ISSN: 2347-8578 www.ijcstjournal.org Page 117

You might also like