Module I-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

MODULE I INTRODUCTION TO BIG DATA 9

Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs
Traditional Data - Risks of Big Data - Structure of Big Data - Challenges of Conventional
Systems– Evolution of Analytic Scalability - Evolution of Analytic Processes, Tools and
methods - Analysis vs Reporting - Modern Data Analytic Tools.

Big Data

Definition: “Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision
making.” Big Data refers to complex and large data sets that have to be processed and analysed
to uncover valuable information that can benefit businesses and organizations.

What is Big Data?

 It refers to a massive amount of data that keeps on growing exponentially with time.

 It is so voluminous that it cannot be processed or analysed using conventional data


processing techniques.

 It includes data mining, data storage, data analysis, data sharing, and data visualization.

 The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyse the data.

Characteristics of Big Data

Big Data Characteristics:


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity

5 Vs of Big Data, Big Data technology components


5 Vs of Big Data :

1. Volume :
Big Data is a vast “volumes” of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, and so on.
Example: Facebook generates approximately a billion messages, 4.5 billion times the “Like”
button is recorded, and more than 350 million new posts are uploaded each day.
Big data technologies can handle large amounts of data.
2. Variety :
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
Data were only collected from databases and sheets in the past, But these days the data will
come in an array of forms ie.- PDFs, Emails, audios, Social Media posts, photos, videos, etc.
3. Velocity :
Velocity refers to the speed with which data is generated in real-time.
Velocity plays an important role compared to others.
It contains the linking of incoming data sets speeds, rate of change, and activity bursts.
The primary aspect of Big Data is to provide demanding data rapidly.
Example of data that is generated with high velocity - Twitter messages or Facebook posts.

4. Veracity :
Veracity refers to the quality of the data that is being analyzed.
It is the process of being able to handle and manage data efficiently.
Example: Facebook posts with hashtags.
5. Value :
Value is an essential characteristic of big data.
It is not the data that we process or store, it is valuable and reliable data that we store, process
and analyse.
Features of big data
Applications of Big Data

1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s spending
habit (in which product customer spent, in which brand they wish to spent, how frequently they
spent), shopping behavior, customer’s most liked product (so that they can keep those products
in the store). Which product is being searched/sold most, based on that data,
production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can provide
the offer to a particular customer to buy his particular liked product by using bank’s credit or
debit card with discount or cashback. By this way, they can send the right offer to the right
person at the right time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails
store provide a recommendation to the customer. E-commerce site like Amazon, Walmart,
Flipkart does product recommendation. They track what product a customer is searching, based
on that data they recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got data
that customer may be interested to buy bed cover. Next time when that customer will go to any
google page, advertisement of various bed covers will be seen. Thus, advertisement of the right
product to the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video type.
Based on the content of a video, the user is watching, relevant advertisement is shown during
video running. As an example suppose someone watching a tutorial video of Big data, then
advertisement of some other big data course will be shown during that video.
3. Smart Traffic System: Data about the condition of the traffic of different road, collected
through camera kept beside the road, at entry and exit point of the city, GPS device placed in
the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less
time taking ways are recommended. Such a way smart traffic system can be built in the city by
Big data analysis. One more profit is fuel consumption can be reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors present.
These sensors capture data like the speed of flight, moisture, temperature, other environmental
condition. Based on such data analysis, an environmental parameter within flight are set up and
varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine can
operate flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In the
various spot of car camera, a sensor placed, that gather data like the size of the surrounding
car, obstacle, distance from those, etc. These data are being analyzed, then various calculation
like how many angles to rotate, what should be speed, when to stop, etc carried out. These
calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like
Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to provide the answer
of the various question asked by users. This tool tracks the location of the user, their local time,
season, other data related to question asked, etc. Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects data
like location of the user, season and weather condition at that location, then analyze these data
to conclude if there is a chance of raining, then provide the answer.
7. IoT:
 Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any problem
when it requires repairing so that company can take action before the situation when
machine facing a lot of issues or gets totally down. Thus, the cost to replace the whole
machine can be saved.
 In the Healthcare field, Big data is providing a significant contribution. Using big data tool,
data regarding patient experience is collected and is used by doctors to give better
treatment. IoT device can sense a symptom of probable coming disease in the human body
and prevent it from giving advance treatment. IoT Sensor placed near-patient, new-born
baby constantly keeps track of various health condition like heart bit rate, blood presser,
etc. Whenever any parameter crosses the safe limit, an alarm sent to a doctor, so that they
can take step remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data to
search candidate, interested in that course. If someone searches for YouTube tutorial video on
a subject, then online or offline course provider organization on that subject send ad online to
that person about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends this
read data to the server, where data analyzed and it can be estimated what is the time in a day
when the power load is less throughout the city. By this system manufacturing unit or
housekeeper are suggested the time when they should drive their heavy machine in the night
time when power load less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing company
like Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like
what type of video, music users are watching, listening most, how long users are spending on
site, etc are collected and analyzed to set the next business strategy.

TRADITIONAL VS BIG DATA

1. Schema less and Column oriented Databases (No Sql)


We are using table and row based relational databases over the years, these databases are just
fine with online transactions and quick updates. When unstructured and large amount of data
comes into the picture we needs some databases without having a hard code schema
attachment. There are a number of databases to fit into this category, these databases can
store unstructured, semi structured or even fully structured data.

Apart from other benefits the finest thing with schema less databases is that it makes data
migration very easy. MongoDB is a very popular and widely used NoSQL database these
days. NoSQL and schema less databases are used when the primary concern is to store a
huge amount of data and not to maintain relationship between elements. "NoSQL (not only
Sql) is a type of databases that does not primarily rely upon schema based structure and does
not use Sql for data processing."

Figur1e. : Big Data

The traditional approach work on the structured data that has a basic layout and the structure
provided.
Figure 1.7: Static Data

The structured approach designs the database as per the requirements in tuples and columns.
Working on the live coming data, which can be an input from the ever changing scenario
cannot be dealt in the traditional approach. The Big data approach is iterative.

Figure 1.8: Streaming Data

The Big data analytics work on the unstructured data, where no specific pattern of the data
is defined. The data is not organized in rows and columns. The live flow of data is captured
and the analysis is done on it. xv.Efficiency increases when the data to be analyzed is large.

1.9 Big Data Architecture


Traditional Data Big Data

Traditional data is generated in enterprise Big data is generated outside the


level. enterprise level.

Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.

Big data system deals with structured,


Traditional database system deals with semi-structured,database, and
structured data. unstructured data.

Traditional data is generated per hour or per But big data is generated more
day or more. frequently mainly per seconds.

Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable to High system configuration is required to


process traditional data. process big data.

The size is more than the traditional


The size of the data is very small. data size.

Special kind of data base tools are


Traditional data base tools are required to required to perform any
perform any data base operation. databaseschema-based operation.

Special kind of functions can


Normal functions can manipulate data. manipulate data.

Its data model is strict schema based and it is Its data model is a flat schema based
static. and it is dynamic.

Big data is not stable and unknown


Traditional data is stable and inter relationship. relationship.

Big data is in huge volume which


Traditional data is in manageable volume. becomes unmanageable.
Traditional Data Big Data

It is difficult to manage and manipulate


It is easy to manage and manipulate the data. the data.

Its data sources includes ERP transaction data, Its data sources includes social media,
CRM transaction data, financial data, device data, sensor data, video, images,
organizational data, web transaction data etc. audio etc.

RISKS OF BIG DATA

1. Data Security

This risk is obvious and often uppermost in our minds when we are considering the logistics of

data collection and analysis. Data theft is a rampant and growing area of crime and attacks are

getting bigger and more damaging. The bigger your data, the bigger the target it presents to

criminals.

2. Data Privacy

Closely related to the issue of security is privacy. This ocean of Big Data is also intrusive.

Risk mitigation strategies are essential for protecting privacy. You need to be sure that the

sensitive information you are storing and collecting isn’t going to be divulged through

damaging misuse by yourself or by the people to whom you have delegated the responsibility

for analyzing and reporting on it.

On the one hand, big data unleashes tremendous benefits not only to individuals but also to

communities and society at large, including breakthroughs in health research, sustainable

development, energy conservation, and personalized marketing. On the other hand, big data

introduces new privacy and civil liberties concerns including high-tech profiling, automated
decision-making, discrimination, and algorithmic inaccuracies or opacities that strain

traditional legal protections.

3. Costs

Data collection, aggregation, storage, analysis, mapping, and reporting costs a lot of money.

These costs can be mitigated by careful budgeting, but getting it wrong at that point can lead

to spiralling costs, potentially negating any value-added to your bottom line by your data-

driven initiative. A well-developed strategy will clearly set out what you intend to achieve and

the benefits that can be gained so they can be balanced against the resources allocated to the

project.

4. Time to Deployment
The amount of time required to deploy a big data solution can vary significantly depending on

the type of implementation. It’s worth considering that an in-house solution can take over six

months to build depending on the requirements, however, a cloud-based solution requires no

internal infrastructure.

5. Scalability
The ability to scale a project up or down is crucial. Organizations underestimate how quickly

their data can and will grow, or fail to take into account varying usage levels. Cloud-based
systems will offer more scalability, allowing businesses to increase or decrease usage as

required.

6. Enhanced Transparency

Big data analysis is prone to errors, inaccuracies, and bias. Consequently, organizations should

provide more transparency into their automated processing operations and decision-making
processes, including eligibility factors and marketing profiles.
7. Bad Data

Collecting irrelevant, out-of-date, or erroneous data. The big data revolution has led to a

collect everything and analyse it later approach. If you are not analysing the right up-to-date

data, you won’t be drawing the right conclusions to provide value.

8. Accessibility
Big data is only useful if it is accessible by the people who can actually learn something from

the data and implement it into everyday business practices.

Structure of big data


Big data as a collection of massive and complex datasets that are difficult to store and process
utilizing traditional database management tools and traditional data processing applications.
The key challenges include capturing, storing, managing, analysing, and visualization of that
data.

The structure of big data, you can consider it a collection of data values, the relationships
between them together with the operations or functions which can be applied to that data.
These days, lots of resources (social media platforms being the number one) have become
available to companies from where they can capture massive amounts of data. Now, this
captured data is used by enterprises to develop a better understanding and closer relationships
with their target customers. It’s important to understand that every new customer action
essentially creates a more complete picture of the customer, helping organizations achieve a
more detailed understanding of their ideal customers. Therefore, it can be easily imagined why
companies across the globe are striving to leverage big data. Put simply, big data comes with
the potential that can redefine a business, and organizations, which succeed in analysing big
data effectively, stand a huge chance to become global leaders in the business domain.
Structures of big data

Big data structures can be divided into three categories – structured, unstructured, and semi-
structured.
1- Structured data

It’s the data which follows a pre-defined format and thus, is straightforward to analyze. It
conforms to a tabular format together with relationships between different rows and columns.
You can think of SQL databases as a common example. Structured data relies on how data
could be stored, processed, as well as, accessed. It’s considered the most “traditional” type of
data storage.
2- Unstructured data

This type of big data comes with unknown form and cannot be stored in traditional ways and
cannot be analysed unless it’s transformed into a structured format. Multimedia content like
audios, videos, images as examples of unstructured data. It’s important to understand that these
days, unstructured data is growing faster than other types of big data.
3- Semi-structured data

It’s a type of big data that doesn’t conform with a formal structure of data models. But it comes
with some kinds of organizational tags or other markers that help to separate semantic elements,
as well as, enforce hierarchies of fields and records within that data. You can think of JSON
documents or XML files as this type of big data. The reason behind the existence of this
category is semi-structured data is significantly easier to analyse than unstructured data. A
significant number of big data solutions and tools come with the ability of reading and
processing XML files or JASON documents, reducing the complexity of the analysing process.
CHALLENGES OF CONVENTIONAL SYSTEMS

 Big data is the storage and analysis of large data sets.


 These are complex data sets that can be both structured and unstructured.
 They are so large that it is not possible to work on them with traditional analytical tools.
 One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
 Big data is continuously expanding, there are new companies and technologies that are
being developed every day.
 A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
 These days, organizations are realising the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency in their
work environment.

1. Data Challenges –

 Volume, Velocity, Variety & Veracity


 Data discovery and comprehensiveness
 Scalability
 Storage issues
2. Process Challenges –
 Capturing data
 Aligning data from different sources
 Transforming data into suitable form for data analysis
 Modelling data(mathematically, simulation)
 Understanding output, visualizing results and display issues on mobile devices
3. Management Challenges -
• Security
• Privacy
• Governance
• Ethical issues
4. Traditional / RDBMS
 Designed to handle well-structured data
 Traditional storage vendor solutions are very expensive
 Shared block-level storage is too slow
 Read data in 8k or 16k block size
 Schema-on-write requires data be validated before it can be written to disk.
 Software licenses are too expensive
 Get data from disk and load into memory requires application
Evolution of analytics scalability

In analytic scalability, we have to pull the data together in a separate analytics environment
and then start performing analysis.

Analysts do the merge operation on the data sets which contain rows
and columns. The columns represent information about the customers such as name, spending
level, or status. In merge or join, two or more data sets are combined together. They are
typically merged/joined so that specific rows of one data set or table are combined with
specific rows of another.

Analysts also do data preparation. Data preparation is made up of joins, aggregations,


derivations, and transformations. In this process, they pull data from various sources and
merge it all together to create the variables required for analysis.

The massively Parallel Processing (MPP) system is the most mature, proven, and widely
deployed mechanism for storing and analysing large amounts of data.

An MPP database breaks the data into independent pieces managed by independent storage
and central processing unit (CPU) resources.

MPP systems build in redundancy to make recovery easy.

MPP systems have resource management tools:

 Manage the CPU and disk space


 Query optimizer
Evolution of analytic process :-

 With increased level of scalability, it needs to update analytic processes to take


advantage of it.

 This can be achieved with the use of analytical sandboxes to provide analytic
professionals with a scalable environment to build advanced analytics processes.

 One of the uses of MPP database system is to facilitate the building and deployment
of advanced analytic processes.

 An analytic sandbox is the mechanism to utilize an enterprise data warehouse.

 If used appropriately, an analytic sandbox can be one of the primary drivers of value
in the world of big data.

Analytical sandbox :-

 An analytic sandbox provides a set of resources with which in-depth analysis can be
done to answer critical business questions.

 An analytic sandbox is ideal for data exploration, development of analytical


processes, proof of concepts, and prototyping.

 Once things progress into ongoing, user-managed processes or production processes,


then the sandbox should not be involved.

 A sandbox is going to be leveraged by a fairly small set of users.

 There will be data created within the sandbox that is segregated from the production
database.

 Sandbox users will also be allowed to load data of their own for brief time periods as
part of a project, even if that data is not part of the official enterprise data model.
Tools and methods

TOOLS
Big Data Analytics is the process of collecting large chunks of structured/unstructured data,
segregating and analyzing it and discovering the patterns and other useful business insights
from it.
These days, organizations are realising the value they get out of big data analytics and hence
they are deploying big data tools and processes to bring more efficiency in their work
environment.
Many big data tools and processes are being utilised by companies these days in the
processes of discovering insights and supporting decision making.
Big data processing is a set of techniques or programming models to access large- scale data
to extract useful information for supporting and providing decisions.

METHODS

 Until the advent of computers, it wasn’t feasible to run

Many iterations of a model

Highly advanced methods

Large dataset

 Ensemble methods are built using multiple techniques

Go beyond individual performer


Analysis vs Reporting
Reporting :

 Once data is collected, it will be organized using tools such as graphs and tables.
 The process of organizing this data is called reporting.
 Reporting translates raw data into information.
 Reporting helps companies to monitor their online business and be alerted when data
falls outside of expected ranges.
 Good reporting should raise questions about the business from its end users.
Analysis :
 Analytics is the process of taking the organized data and analyzing it.
 This helps users to gain valuable insights on how businesses can improve their
performance.
 Analysis transforms data and information into insights.
 The goal of the analysis is to answer questions by interpreting the data at a deeper
level and providing actionable recommendations.
Modern data analytic tools

 These days, organizations are realising the value they get out of big data analytics and
hence they are deploying big data tools and processes to bring more efficiency to their
work environment.
 Many big data tools and processes are being utilised by companies these days in the
processes of discovering insights and supporting decision making.
 Data Analytics tools are types of application software that retrieve data from one or
more systems and combine it in a repository, such as a data warehouse, to be reviewed
and analysed.
 Most organizations use more than one analytics tool including spreadsheets with
statistical functions, statistical software packages, data mining tools, and predictive
modelling tools.
 Together, these Data Analytics Tools give the organization a complete overview of the
company to provide key insights and understanding of the market/business so smarter
decisions may be made.
 Data analytics tools not only report the results of the data but also explain why the
results occurred to help identify weaknesses, fix potential problem areas, alert decision-
makers to unforeseen events and even forecast future results based on decisions the
company might make.
 Below is the list some of data analytics tools :
 R Programming (Leading Analytics Tool in the industry)
 Python
 Excel
 SAS
 Apache Spark
 Splunk
 RapidMiner
 Tableau Public
 KNime

You might also like