Introduction To Big Data Analytics

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 112
At a glance
Powered by AI
The key takeaways are the 4 V's of big data (volume, velocity, variety and veracity), examples of sources generating big data, and potential applications of big data analytics.

The characteristics of big data are the 4 V's - volume (amount of data), velocity (speed of data generation), variety (different data types), and veracity (uncertainty of data).

Some examples of applications of big data analytics mentioned are fraud detection in stock markets, sentiment analysis on social media, location-aware recommendations, and healthcare systems.

Big Data Analytics

Veningston .K
Associate Professor
Department of CSE
Madanapalle Institute of Technology & Science
[email protected]
Big Data Analytics
Contents
1 Explosion in Quantity of Data

2 Big Data Characteristics

3 Importance of Big Data

4 Usage Example in Big Data

55 Challenges in Big Data

Big Data Analytics


Contents
1
6 Big Data vs. Hadoop

2
7 Data Analytics Architecture

3
8 Hadoop Key Characteristics

4
9 MapReduce Architecture

10
5 Potential Applications

Big Data Analytics


What is big data?
A massive volume of both structured and
unstructured data that is so large that it's
difficult to process with traditional database
and software techniques.

Big Data Analytics


Explosion in Quantity of Data

Air Bus A380


Each engine generate 10 TB 640TB per Flight
every 30 min
Stock Exchange generate 1TB of new trade data
everyday

Big Data Analytics


Explosion in Quantity of Data
Science
Data bases from astronomy, genomics, environmental data,
transportation data,
Humanities and Social Sciences
Scanned books, historical documents, social interactions data, new
technology like GPS
Business & Commerce
Corporate sales, stock market transactions, census, airline traffic,
Entertainment
Internet images, movies, MP3 files,
Medicine
MRI & CT scans, patient health records,

Big Data Analytics


Explosion in Quantity of Data
http://newstex.com/

The Data Explosion in 2014 Minute by Minute


In 2012, Google received over 2 million search
queries per minute
Today, Google receives over 4 million search
queries per minute from the 2.4 billion strong
global internet population

Big Data Analytics


Explosion in Quantity of Data
http://newstex.com/

Every minute
Facebook users share nearly 2.5 million pieces of
content
Twitter users tweet nearly 300,000 times
Instagram users post nearly 220,000 new photos
YouTube users upload 72 hours of new video
content
Apple users download nearly 50,000 apps
Email users send over 200 million messages
Amazon generates over $80,000 in online sales
Big Data Analytics
Explosion in Quantity of Data

Big Data Analytics


Big Data Characteristics
Volume Data at rest
Amount of data
Velocity Data in motion
Speed rate in collecting or acquiring or generating or
processing of data
Variety Data in many forms,
Different data type such as audio, video, image data
(mostly unstructured data)
Veracity Data in doubt
Sparse data, Inconsistent and missed data

Big Data Analytics


4Vs

Big Data Analytics


5 Vs of Big Data
Volume, Veracity, Velocity,
Variety, and Value

Banking/Marketing/IT:
Volume, Velocity, and Value

Healthcare/Life Sciences:
Veracity, Variety, and Value

Big Data Analytics


Big Data Characteristics 4Vs

Big Data Analytics


Big Data Analytics
Big Data Analytics
Types of Data
Structured
Fields/ Tables/ Columns/ RDBMS/Spreadsheet
Semi-structured
Markers/Tags to separate elements
XML/HTML
Unstructured
No fields/attributes
Free form text (E-mail body, notes, articles,)
Audio, video, and image

Big Data Analytics


Comprehensive List of Big Data Statistics
http://wikibon.org/

Big Data in Todays Business and Technology Environment


2.7 Zetabytes of data exist in the digital universe today
Facebook stores, accesses, and analyzes 30+ Petabytes of user
generated data.
Walmart handles more than 1 million customer transactions
every hour, which is imported into databases estimated to
Byte (B) contain more than 2.5 Petabytes of data
Kilobyte (KB) More than 5 billion people are calling, texting, tweeting and
Megabyte MB) browsing on mobile phones worldwide
Gigabyte (GB)
Decoding the human genome originally took 10 years to
Terabyte (TB)
Petabyte (PB) process; now it can be achieved in one week
Exabyte (EB)
Zettabyte (ZB)
Yottabyte (YB)
Big Data Analytics
Comprehensive List of Big Data Statistics
http://wikibon.org/

The Rapid Growth of Unstructured Data


YouTube users upload 72 hours of new video every minute of
the day
571 new websites are created every minute of the day
Brands and organizations on Facebook receive 34,722 Likes
every minute of the day
100 terabytes of data uploaded daily to Facebook
Data production will be 44 times greater in 2020 than it was in
2009

Big Data Analytics


Comprehensive List of Big Data Statistics
http://wikibon.org/

The Market Challenge with Big Data


Big data is a top business priority and drives enormous
opportunity for business improvement
Customer Churn analysis (the cost of retaining an existing
customer is far less than acquiring a new one)
Government administration could save more than 100 billion
($149 billion) in operational efficiency improvements alone by
using big data
Operational Efficiency: Ratio between the input to run a
business operation and the output gained from the business

Big Data Analytics


Comprehensive List of Big Data Statistics
http://wikibon.org/

Big Data & Real Business Issues


What data to collect?
Poor data can cost businesses 20%35% of their operating
revenue.

Data scientist Just give me the data and I'll work


out what it is we'll need.
Response Well, if you can tell me just exactly what
you need, we'll get it for you.
Data scientist I'm not going to know what I need
until I see it all.
Response You really want all the data?
Data scientist Yes, ideally we'd have all the data
in its most basic form.
Response We've got that on tape drive
somewhere.

Big Data Analytics


Big Data is a Hot Topic of Research because Technology Makes it
Possible to Analyze All Available Data
Cost effectively manage and analyze
all available data in its native form
(unstructured, structured, streaming)

ERP: Business management software that a


company can use to collect, store, manage and
interpret data from many business activities,
including: CRM: sales, marketing,
Product planning, cost customer service, and technical
Manufacturing or service delivery support
Marketing and sales
Inventory management
Shipping and payment

Website Social Media

Billing
ERP Network Switches
CRM RFID
Big Data Analytics
Common Big data Customer Scenarios

Big Data Analytics


Common Big data Customer Scenarios

Big Data Analytics


Common Big data Customer Scenarios

Big Data Analytics


H
BIG DATA vs. HADOOP B

Understand and navigate


Federated Discovery and Navigation
federated big data sources

Manage & store huge Hadoop File System


volume of any data MapReduce

Structure and control data Data Warehousing

Manage streaming data Stream Computing

Analyze unstructured data Text Analytics Engine

Integrate and govern all ETL, Integration, Data Quality,


data sources Security, Lifecycle Management
Big Data Analytics
A Holistic View of a Big Data System
Real Time
Streams

Real-Time
Processing
(s4, storm)

Analytics

ETL Real Time


Structured Big SQL
(Greenplum,
(Greenplum, Batch
Database AsterData,
AsterData, Processing
(hBase,
(hBase, Gemfire,
Gemfire, Etc)
Etc)
Cassandra)
Cassandra)

Unstructured Data (HDFS)

Big Data Analytics


Limitations of Existing Data Analytics Architecture

Big Data Analytics


Solution: A Combined Storage Compute Layer

Big Data Analytics


Why DFS?

Big Data Analytics


Why DFS?

Big Data Analytics


What is Hadoop?
Apache Hadoop is a framework that allows for
the distributed processing of large data sets
across clusters of commodity computers using
a simple programming model

It is an Open-source Data Management with


scale-out storage and distributed processing

Big Data Analytics


Scalability: Scale-up or Scale-out
Vertical Scaling (Scale-up): Generally refers to adding more
processors and RAM, buying a more expensive and robust
server.
Pros
Less power consumption than running multiple servers
Cooling costs are less than scaling horizontally
Generally less challenging to implement
Less licensing costs
(sometimes) uses less network hardware than scaling horizontally (this is a
whole different topic that we can discuss later)
Cons
PRICE, PRICE, PRICE
Greater risk of hardware failure causing bigger outages
generally severe vendor lock-in and limited upgradeability in the future
Big Data Analytics
Horizontal Scaling (Scale-out): Generally refers to adding
more servers with less processors and RAM. This is usually
cheaper overall and can literally scale infinitely (although we
know that there are usually limits imposed by software or
other attributes of an environments infrastructure)
Pros
Much cheaper than scaling vertically
Easier to run fault-tolerance
Easy to upgrade
Cons
More licensing fees
Bigger footprint in the Data Center
Higher utility cost (Electricity and cooling)
Possible need for more networking equipment (switches/routers)
Big Data Analytics
Hadoop
Open-source software framework from Apache
Inspired by
Google MapReduce
GFS (Google File System)

HDFS
Map/Reduce

Big Data Analytics


Hadoop Distribution
Microsoft
IBM
Cloudera
Apache
MapR
Horton Works

Big Data Analytics


Hadoop Key Characteristics

Big Data Analytics


Hadoop enables...
Scalable
New nodes can be added as needed
Cost effective
Hadoop brings massively parallel computing to commodity
servers.
sizeable decrease in the cost per terabyte of storage
Flexible
Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources.
Fault tolerant
When you lose a node, the system redirects work to another
location of the data and continues processing

Big Data Analytics


RDBMS vs. Hadoop

Big Data Analytics


Big Data Analytics
Big Data Analytics
Hadoop Ecosystem

Big Data Analytics


Hadoop 2.x Core Components

Big Data Analytics


Main Components of HDFS

Big Data Analytics


NameNode Metadata

Big Data Analytics


File Blocks

Big Data Analytics


HDFS Architecture

Big Data Analytics


Anatomy of a File Read

Big Data Analytics


Anatomy of a File Write

Big Data Analytics


Replication and Rack Awareness

Big Data Analytics


Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Hadoop 2.x Cluster Architecture

Big Data Analytics


Hadoop 2.x Cluster Architecture

Big Data Analytics


Hadoop 2.x Cluster Architecture -
Federation
R data Finance data Marketing data

Big Data Analytics


Hadoop 2.x High Availability

Data sync

Big Data Analytics


Hadoop 2.x High Availability

Big Data Analytics


Hadoop 2.x Resource Management

Big Data Analytics


Hadoop 2.x Resource Management

Big Data Analytics


Big Data Analytics
Big Data Analytics
YARN Moving beyond MapReduce

Big Data Analytics


Hadoop Cluster - Facebook
Use Hadoop to store copies of internal log and
dimension data sources and use it as a source
for reporting/analytics and machine learning.
2 Major clusters:
1100-machine cluster with 8800 cores & about
12PB raw storage
300-machine cluster with 2400 cores & about 3PB
raw storage.
Each node has 8 cores & 12 TB of storage
Big Data Analytics
Hadoop 2.x Configuration files

Big Data Analytics


Data Loading Techniques & Data Analysis

Big Data Analytics


MapReduce Way

Big Data Analytics


Why MapReduce?
Two Advantages:
Taking processing to
the data
Processing data in
parallel

Big Data Analytics


Solving the Problem with MapReduce

Big Data Analytics


Hadoop 2.x MapReduce Architecture

Big Data Analytics


Hadoop 2.x MapReduce Components

Big Data Analytics


Application Workflow

Big Data Analytics


Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
Big Data Analytics
MapReduce Paradigm

Big Data Analytics


Big Data Analytics
Big Data Analytics
Unifying the Big Data Platform using
Virtualization
Goals
Make it fast and easy to provision new data Clusters
on Demand
Allow Mixing of Workloads
Leverage virtual machines to provide isolation (esp.
for Multi-tenant)

Big Data Analytics


Unifying the Big Data Platform using
Virtualization
Leveraging Virtualization
Elastic scale
Use high-availability to protect key services, e.g.,
Hadoops namenode
Resource controls and sharing: re-use underutilized
memory, cpu

Big Data Analytics


Use Local Disk where its Needed

SAN Storage NAS Filers Local Storage

$2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte

$1M gets: $1M gets: $1M gets:


0.5Petabytes 1 Petabyte 20 Petabytes
200,000 IOPS 400,000 IOPS 10,000,000 IOPS
1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec

Big Data Analytics


Text Analytics for Large
unstructured information
Data Big Data
Digital Data Data Mining
Warehouse Analytics
(1960) (1990)
(1980) (1960)

Predictive
Descriptive Diagnostic Prescriptive
analysis
analysis analysis analysis
(What is
(What (Why did it (What shall
going to
happened?) happen?) we do?)
happen?)

INFORM ANALYSIS ACT


Big Data Analytics
Past vs. Future

Big Data Analytics


Text Mining
Discover useful and previously unknown
gems of information from large collection
Patterns
Trends
Associations

Big Data Analytics


Search vs. Discover
Search Discover
(Goal oriented) (Opportunistic)

Structured Data
Data
Data Mining
Retrieval

Unstructured Information
Data Text Mining
Retrieval

Big Data Analytics


Analytic Stack

Hadoop, Hbase,

HDFS

Hardware servers

Big Data Analytics


Live Datasets
PUBMED Medical Literature Abstract
Twitter API
Yahoo API
Yelp Reviews data
Foursquare
.
.
.
Big Data Analytics
Implementation of Big Data
Platforms for Large-scale Data Analysis
Parallel DBMS technologies
Proposed in late eighties
Matured over the last two decades
Multi-billion dollar industry: Proprietary DBMS Engines intended as
Data Warehousing solutions for very large enterprises
Map Reduce
pioneered by Google
popularized by Yahoo! (Hadoop)

Big Data Analytics


Implementation of Big Data
MapReduce Parallel DBMS technologies
Overview: Popularly used for more than two
Data-parallel programming model decades
Relational Data Model
An associated parallel and
Indexing
distributed implementation for
Familiar SQL interface
commodity clusters
Advanced query optimization
Pioneered by Google
Processes 20 PB of data per day
Popularized by open-source Hadoop
Used by Yahoo!, Facebook,
Amazon, and the list is growing

Big Data Analytics


Implementation of Big Data
MapReduce vs. Parallel DBMS
Parallel DBMS MapReduce

Schema Support Not out of the box

Indexing Not out of the box


Imperative
Declarative (C/C++, Java, )
Programming Model
(SQL) Extensions through
Pig and Hive
Optimizations
(Compression, Query Not out of the box
Optimization)
Flexibility Not out of the box
Coarse grained
Fault Tolerance
techniques
Big Data Analytics
Applications for Big Data Analytics
Multi-channel
Smarter Healthcare sales Finance Log Analysis

Homeland Security Traffic Control Telecom Search Quality

Retail: Churn
Manufacturing Trading Analytics Fraud and Risk analysis

Big Data Analytics


Big Data Driven by Real-World Benefit

Fraud detection in Stock markets


Twitter Trend analysis
Google trend analysis
Location aware recommendations
Sentiment Analysis
Health care systems
.
.

Big Data Analytics


Health care
Health care management for cancer survivors
Parental stress
Health status of survivors
Physical stress factors
Psychological stress factors
Psychosocial factors
Impact of family
Health care demands

Big Data Analytics


Potential Applications
Healthcare Recommendation system
Patient-driven health social network
To find health related resources such
as clinical trials, physician question
and answers, emotional support, etc.
Doctor recommender system
Patient Secured ratings

Recommendation data Secure computation

Big Data Analytics


Potential Applications
Personalized health education system
User modeling Document modeling

Similarity Matching

Personalized resources to users


Nursing care plan recommender system
Recommends all the required items to
nurses
To create effective comprehensive care
plans for their patients
Big Data Analytics
Other Aspects of Big Data
Provocations for Big Data
1- Bigger Data are not always Better data

2- Not all Data are equivalent

3- Just because it is accessible doesnt make it ethical

Big Data Analytics


Can we Avoid Big Data?

YES
YES
YES

Big Data Analytics


How Can we Avoid Big Data?
Pay cash for everything!
Never go online!
Dont use a telephone!
Dont use smart cards!
Dont fill any prescriptions!
Never leave your house!

Big Data Analytics


Summary
Big Data is Unavoidable
Greater Opportunities in
Financial Services
Retail
Manufacturing
Healthcare
Web/Social/Mobile
Government

Big Data Analytics


Thank
You all

Big Data Analytics

You might also like