Module I-1
Module I-1
Module I-1
Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs
Traditional Data - Risks of Big Data - Structure of Big Data - Challenges of Conventional
Systems– Evolution of Analytic Scalability - Evolution of Analytic Processes, Tools and
methods - Analysis vs Reporting - Modern Data Analytic Tools.
Big Data
Definition: “Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision
making.” Big Data refers to complex and large data sets that have to be processed and analysed
to uncover valuable information that can benefit businesses and organizations.
It refers to a massive amount of data that keeps on growing exponentially with time.
It includes data mining, data storage, data analysis, data sharing, and data visualization.
The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyse the data.
Volume
Variety
Velocity
1. Volume :
Big Data is a vast “volumes” of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, and so on.
Example: Facebook generates approximately a billion messages, 4.5 billion times the “Like”
button is recorded, and more than 350 million new posts are uploaded each day.
Big data technologies can handle large amounts of data.
2. Variety :
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
Data were only collected from databases and sheets in the past, But these days the data will
come in an array of forms ie.- PDFs, Emails, audios, Social Media posts, photos, videos, etc.
3. Velocity :
Velocity refers to the speed with which data is generated in real-time.
Velocity plays an important role compared to others.
It contains the linking of incoming data sets speeds, rate of change, and activity bursts.
The primary aspect of Big Data is to provide demanding data rapidly.
Example of data that is generated with high velocity - Twitter messages or Facebook posts.
4. Veracity :
Veracity refers to the quality of the data that is being analyzed.
It is the process of being able to handle and manage data efficiently.
Example: Facebook posts with hashtags.
5. Value :
Value is an essential characteristic of big data.
It is not the data that we process or store, it is valuable and reliable data that we store, process
and analyse.
Features of big data
Applications of Big Data
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s spending
habit (in which product customer spent, in which brand they wish to spent, how frequently they
spent), shopping behavior, customer’s most liked product (so that they can keep those products
in the store). Which product is being searched/sold most, based on that data,
production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can provide
the offer to a particular customer to buy his particular liked product by using bank’s credit or
debit card with discount or cashback. By this way, they can send the right offer to the right
person at the right time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails
store provide a recommendation to the customer. E-commerce site like Amazon, Walmart,
Flipkart does product recommendation. They track what product a customer is searching, based
on that data they recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got data
that customer may be interested to buy bed cover. Next time when that customer will go to any
google page, advertisement of various bed covers will be seen. Thus, advertisement of the right
product to the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video type.
Based on the content of a video, the user is watching, relevant advertisement is shown during
video running. As an example suppose someone watching a tutorial video of Big data, then
advertisement of some other big data course will be shown during that video.
3. Smart Traffic System: Data about the condition of the traffic of different road, collected
through camera kept beside the road, at entry and exit point of the city, GPS device placed in
the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less
time taking ways are recommended. Such a way smart traffic system can be built in the city by
Big data analysis. One more profit is fuel consumption can be reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors present.
These sensors capture data like the speed of flight, moisture, temperature, other environmental
condition. Based on such data analysis, an environmental parameter within flight are set up and
varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine can
operate flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the
various spot of car camera, a sensor placed, that gather data like the size of the surrounding
car, obstacle, distance from those, etc. These data are being analyzed, then various calculation
like how many angles to rotate, what should be speed, when to stop, etc carried out. These
calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like
Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to provide the answer
of the various question asked by users. This tool tracks the location of the user, their local time,
season, other data related to question asked, etc. Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data
like location of the user, season and weather condition at that location, then analyze these data
to conclude if there is a chance of raining, then provide the answer.
7. IoT:
Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any problem
when it requires repairing so that company can take action before the situation when
machine facing a lot of issues or gets totally down. Thus, the cost to replace the whole
machine can be saved.
In the Healthcare field, Big data is providing a significant contribution. Using big data tool,
data regarding patient experience is collected and is used by doctors to give better
treatment. IoT device can sense a symptom of probable coming disease in the human body
and prevent it from giving advance treatment. IoT Sensor placed near-patient, new-born
baby constantly keeps track of various health condition like heart bit rate, blood presser,
etc. Whenever any parameter crosses the safe limit, an alarm sent to a doctor, so that they
can take step remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data to
search candidate, interested in that course. If someone searches for YouTube tutorial video on
a subject, then online or offline course provider organization on that subject send ad online to
that person about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this
read data to the server, where data analyzed and it can be estimated what is the time in a day
when the power load is less throughout the city. By this system manufacturing unit or
housekeeper are suggested the time when they should drive their heavy machine in the night
time when power load less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing company
like Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like
what type of video, music users are watching, listening most, how long users are spending on
site, etc are collected and analyzed to set the next business strategy.
Apart from other benefits the finest thing with schema less databases is that it makes data
migration very easy. MongoDB is a very popular and widely used NoSQL database these
days. NoSQL and schema less databases are used when the primary concern is to store a
huge amount of data and not to maintain relationship between elements. "NoSQL (not only
Sql) is a type of databases that does not primarily rely upon schema based structure and does
not use Sql for data processing."
The traditional approach work on the structured data that has a basic layout and the structure
provided.
Figure 1.7: Static Data
The structured approach designs the database as per the requirements in tuples and columns.
Working on the live coming data, which can be an input from the ever changing scenario
cannot be dealt in the traditional approach. The Big data approach is iterative.
The Big data analytics work on the unstructured data, where no specific pattern of the data
is defined. The data is not organized in rows and columns. The live flow of data is captured
and the analysis is done on it. xv.Efficiency increases when the data to be analyzed is large.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Traditional data is generated per hour or per But big data is generated more
day or more. frequently mainly per seconds.
Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.
Its data model is strict schema based and it is Its data model is a flat schema based
static. and it is dynamic.
Its data sources includes ERP transaction data, Its data sources includes social media,
CRM transaction data, financial data, device data, sensor data, video, images,
organizational data, web transaction data etc. audio etc.
1. Data Security
This risk is obvious and often uppermost in our minds when we are considering the logistics of
data collection and analysis. Data theft is a rampant and growing area of crime and attacks are
getting bigger and more damaging. The bigger your data, the bigger the target it presents to
criminals.
2. Data Privacy
Closely related to the issue of security is privacy. This ocean of Big Data is also intrusive.
Risk mitigation strategies are essential for protecting privacy. You need to be sure that the
sensitive information you are storing and collecting isn’t going to be divulged through
damaging misuse by yourself or by the people to whom you have delegated the responsibility
On the one hand, big data unleashes tremendous benefits not only to individuals but also to
development, energy conservation, and personalized marketing. On the other hand, big data
introduces new privacy and civil liberties concerns including high-tech profiling, automated
decision-making, discrimination, and algorithmic inaccuracies or opacities that strain
3. Costs
Data collection, aggregation, storage, analysis, mapping, and reporting costs a lot of money.
These costs can be mitigated by careful budgeting, but getting it wrong at that point can lead
to spiralling costs, potentially negating any value-added to your bottom line by your data-
driven initiative. A well-developed strategy will clearly set out what you intend to achieve and
the benefits that can be gained so they can be balanced against the resources allocated to the
project.
4. Time to Deployment
The amount of time required to deploy a big data solution can vary significantly depending on
the type of implementation. It’s worth considering that an in-house solution can take over six
internal infrastructure.
5. Scalability
The ability to scale a project up or down is crucial. Organizations underestimate how quickly
their data can and will grow, or fail to take into account varying usage levels. Cloud-based
systems will offer more scalability, allowing businesses to increase or decrease usage as
required.
6. Enhanced Transparency
Big data analysis is prone to errors, inaccuracies, and bias. Consequently, organizations should
provide more transparency into their automated processing operations and decision-making
processes, including eligibility factors and marketing profiles.
7. Bad Data
Collecting irrelevant, out-of-date, or erroneous data. The big data revolution has led to a
collect everything and analyse it later approach. If you are not analysing the right up-to-date
8. Accessibility
Big data is only useful if it is accessible by the people who can actually learn something from
The structure of big data, you can consider it a collection of data values, the relationships
between them together with the operations or functions which can be applied to that data.
These days, lots of resources (social media platforms being the number one) have become
available to companies from where they can capture massive amounts of data. Now, this
captured data is used by enterprises to develop a better understanding and closer relationships
with their target customers. It’s important to understand that every new customer action
essentially creates a more complete picture of the customer, helping organizations achieve a
more detailed understanding of their ideal customers. Therefore, it can be easily imagined why
companies across the globe are striving to leverage big data. Put simply, big data comes with
the potential that can redefine a business, and organizations, which succeed in analysing big
data effectively, stand a huge chance to become global leaders in the business domain.
Structures of big data
Big data structures can be divided into three categories – structured, unstructured, and semi-
structured.
1- Structured data
It’s the data which follows a pre-defined format and thus, is straightforward to analyze. It
conforms to a tabular format together with relationships between different rows and columns.
You can think of SQL databases as a common example. Structured data relies on how data
could be stored, processed, as well as, accessed. It’s considered the most “traditional” type of
data storage.
2- Unstructured data
This type of big data comes with unknown form and cannot be stored in traditional ways and
cannot be analysed unless it’s transformed into a structured format. Multimedia content like
audios, videos, images as examples of unstructured data. It’s important to understand that these
days, unstructured data is growing faster than other types of big data.
3- Semi-structured data
It’s a type of big data that doesn’t conform with a formal structure of data models. But it comes
with some kinds of organizational tags or other markers that help to separate semantic elements,
as well as, enforce hierarchies of fields and records within that data. You can think of JSON
documents or XML files as this type of big data. The reason behind the existence of this
category is semi-structured data is significantly easier to analyse than unstructured data. A
significant number of big data solutions and tools come with the ability of reading and
processing XML files or JASON documents, reducing the complexity of the analysing process.
CHALLENGES OF CONVENTIONAL SYSTEMS
1. Data Challenges –
In analytic scalability, we have to pull the data together in a separate analytics environment
and then start performing analysis.
Analysts do the merge operation on the data sets which contain rows
and columns. The columns represent information about the customers such as name, spending
level, or status. In merge or join, two or more data sets are combined together. They are
typically merged/joined so that specific rows of one data set or table are combined with
specific rows of another.
The massively Parallel Processing (MPP) system is the most mature, proven, and widely
deployed mechanism for storing and analysing large amounts of data.
An MPP database breaks the data into independent pieces managed by independent storage
and central processing unit (CPU) resources.
This can be achieved with the use of analytical sandboxes to provide analytic
professionals with a scalable environment to build advanced analytics processes.
One of the uses of MPP database system is to facilitate the building and deployment
of advanced analytic processes.
If used appropriately, an analytic sandbox can be one of the primary drivers of value
in the world of big data.
Analytical sandbox :-
An analytic sandbox provides a set of resources with which in-depth analysis can be
done to answer critical business questions.
There will be data created within the sandbox that is segregated from the production
database.
Sandbox users will also be allowed to load data of their own for brief time periods as
part of a project, even if that data is not part of the official enterprise data model.
Tools and methods
TOOLS
Big Data Analytics is the process of collecting large chunks of structured/unstructured data,
segregating and analyzing it and discovering the patterns and other useful business insights
from it.
These days, organizations are realising the value they get out of big data analytics and hence
they are deploying big data tools and processes to bring more efficiency in their work
environment.
Many big data tools and processes are being utilised by companies these days in the
processes of discovering insights and supporting decision making.
Big data processing is a set of techniques or programming models to access large- scale data
to extract useful information for supporting and providing decisions.
METHODS
Large dataset
Once data is collected, it will be organized using tools such as graphs and tables.
The process of organizing this data is called reporting.
Reporting translates raw data into information.
Reporting helps companies to monitor their online business and be alerted when data
falls outside of expected ranges.
Good reporting should raise questions about the business from its end users.
Analysis :
Analytics is the process of taking the organized data and analyzing it.
This helps users to gain valuable insights on how businesses can improve their
performance.
Analysis transforms data and information into insights.
The goal of the analysis is to answer questions by interpreting the data at a deeper
level and providing actionable recommendations.
Modern data analytic tools
These days, organizations are realising the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency to their
work environment.
Many big data tools and processes are being utilised by companies these days in the
processes of discovering insights and supporting decision making.
Data Analytics tools are types of application software that retrieve data from one or
more systems and combine it in a repository, such as a data warehouse, to be reviewed
and analysed.
Most organizations use more than one analytics tool including spreadsheets with
statistical functions, statistical software packages, data mining tools, and predictive
modelling tools.
Together, these Data Analytics Tools give the organization a complete overview of the
company to provide key insights and understanding of the market/business so smarter
decisions may be made.
Data analytics tools not only report the results of the data but also explain why the
results occurred to help identify weaknesses, fix potential problem areas, alert decision-
makers to unforeseen events and even forecast future results based on decisions the
company might make.
Below is the list some of data analytics tools :
R Programming (Leading Analytics Tool in the industry)
Python
Excel
SAS
Apache Spark
Splunk
RapidMiner
Tableau Public
KNime