BDM Review Session
BDM Review Session
BDM Review Session
V Trimester
Session 1-5 - Overview
Course Objective
• Explain the basic concepts behind deriving value from Big Data and
its importance to Businesses
• Unit II: Big Data and the Business Case, Building the Big Data
Team
• Grading :
A (9 marks),
B (7 Marks),
C (5 Marks) and
D (3 marks)
TEAM Presentation Topics
12
More Clarity on 4th and 5th V
• Veracity
• SNR -Signal to Noise Ratio
• Example
• Data acquired in a controlled manner. (ex – online customer registration) usually less
noise than data acquired via uncontrolled sources such as blog postings.
• Value
• Usefulness of data for 20 years or 20 minutes
• Example- 20 mins. delayed stock value has no value
DIKW Framework
• Data
• Information
• Knowledge
• Wisdom
Harnessing Big Data
17
Who’s Generating Big Data
Mobile devices
(tracking all objects all the time)
• The progress and innovation is no longer hindered by the ability to collect data
• The ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
18
Big Data Definition (1)
“Big Data” is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and
extract value and hidden knowledge from it…
19
Big Data Definition (2)
• Many firms whose sole existence is based upon their capability to
generate insights that only Big Data can deliver.
20
Big Data Analysis
• BDA enable data-driven decision making with
scientific backing so that decisions can be based on
factual data and not simply on past experience or
intuition alone.
• Example 1
• The number of ice-creams sold is related to daily
temperature.
• Example 2
• Tyre sales data and road construction works.
21
Big Data Analytics
• Process of collecting, organizing, analysing large
sets of data (called Big Data) to discover patterns and
other useful information
• Four main categories
S.No. Analytics Type Value Complexity
1 Descriptive Hindsight Very Low
2 Diagnostic Insights Low
3 Predictive Insights High
4 Prescriptive Foresight Very High
1. BDA - Descriptive Analytics
• It answer questions about events that have already
occurred.
• Example 1
• What was the sales value of last 12 Months
• Example 2
• What is the monthly commission earned by each sales
agent ?
• Descriptive Analytics carried out via ad-hoc
reporting, static in nature, display historical data
(ERM, ERP, OLTP)
2. BDA - Diagnostic Analytics
• It aims to determine the cause of a phenomenon
that occurred in past using questions that focus on
the reason behind the event.
• Example 1
• Why Q2 sales is < Q1 ?
• Example 2
• Why an increase in patient re-admission rates over the past
3 months?
• Drill down, Rollup analysis
3. BDA - Predictive Analytics
• An attempt to determine the outcome of an event that might
occur in the future.
• The strength and associations form the basis of models that
are used to generate future predictions based upon past
events.
• Models used for PA have implicit dependencies on the
conditions under which the past events occurred.
• If these underlying conditions change, then the models that
make predictions need to be updated.
• Example 1 – If a customer has purchased products A and B,
what are the chances that they will also purchase C?
• Example 2 - Who is likely to cancel the product that was
ordered through e-commerce portal?
4. BDA - Prescriptive Analytics
• Prescriptive Analytics build upon the results of prediction
analytics by prescribing actions that should taken.
• Provide results that can be reasoned about because they
embed element of a situational understanding.
• This kind of analytics can be used to gain an advantage or
mitigate a risk.
• Example 1 – Among these drugs which one provides the
best results?
• Example 2 – When is the best time to trade a particular
stock?
• This approach shifts from explanatory to advisory and can
include the simulation of various scenarios
Big Data Ecosystem
27
Big Data in Management
Unit II
Content
• 1. Origins of Big Data Analytics
• 2. Different Types of Data Sources
• 3. Skill Sets Needed
• 4. How to integrate Big Data into a corporate culture
• 5. Public and Private sources of data
• 6. Storage, Processing Power, Platforms
• 7. Security, Compliance and Auditing
• 8. Short - term and Long-term Changes
• 9. Best Practices for data analysis
• 10. Data Pipeline, Value creation
Big Data – Introduction (1)
• Data
• Can be anything from “something recorded” to “everything under the sun”
• Recording and preserving that data has always been the challenge, and
technology has limited ability to capture and preserve data
Multilevel
Missing data
modeling imputations Classification and
clustering
Survival analysis
Pattern recognition
Principal component
and factor analysis
AB testing Machine learning
Forecasting
Propensity score Logistic, multinomial
matching and multiple linear
regression techniques Network analysis
What’s in the Big Data Toolkit?
Statistical Methods Tools User Experience Research
CRM OLAP
Data
Warehouse
Adhoc
Legacy
Querying
Third Party
Modelling
Apps
Data Storage
Archives
Media
Sensor Data
Sources of Big Data Docs
Pubic Web
Social Media
Hadoop Ecosystem
FLUME OOZIE MAHOUT ….
Core Components
MapReduce Programming
HDFS
• Health Care
• Electronic medical records and images
• Public health monitoring
• epidemiological research programs
• Government
• Digitizing public records like census information, energy usage, budgets,
Freedom of Information Act documents, electoral data, law
enforcement reporting
4. Big Data Resources – Various Industries (2)
• Entertainment media
• Digital recording, production and delivery
• Collecting rich content and user viewing behaviors
• Life Sciences
• Low-cost gene sequencing(less than $1000) can generate
tens of terabytes of information that is analyzed for genetic
variations and potential treatment effectiveness.
4. Big Data Resources – Various Industries (3)
• Video Surveillance
• Recording systems that organizations want to analyze for
behavioral patterns, security and service enhancement
CRM OLAP
Adhoc
Legacy
MapReduce Querying
Third Party
Modeling
Apps
NoSQL
Easy
Scalability Fast in-place
updates
Replication
• Storage capacity
- is inexpensive and constantly dropping in price,
• The Evolution of Big Data; Big Data: The Modern Era; Today,
Tomorrow, and the Next Day; Changing Algorithms; Best
Practices for Big Data Analytics; Start Small with Big Data;
Thinking Big; Avoiding Worst Practices; Baby Steps; The
Value of Anomalies; Expediency versus Accuracy. In-
Memory Processing
91
Security, Compliance,
Auditing and Protection
Chapter 7
• Activity logs – prevent the logs from being exposed, the best method
may be to delete them after their usefulness ends.
• All of the data are unique to the moment, and if they are lost, they are
impossible to recreate.
• 4 Goals
• Control Access by process, not job function
• Secure the data at rest
• Protect the cryptographic keys and store them separately from the data
• Create trusted applications and stacks to protect data from rogue users.
• https://www.youtube.com/watch?v=2Gc9wj56ibc
• Patient
Doctors
Specialists
Diagnosis
Insurance Premium
03/07/2022 V. Senthil, BDM 114
BDM
adoption and planning considerations (1)
• Organization prerequisites
- outdated, invalid or poorly identified data will
result in low quality results
• Data procurement
- external data sources (Govt data source,
commercial data markets)
• Privacy and Security
• Provenance
- used for auditing purposes
have created multibillion dollar industry that has added value to collected data.
• GOOGLE SERVICES
• Language Translator
• There is a great public fear about the inappropriate use of personal data, particularly
through the linking of data from multiple sources.
• Differential Privacy
03/07/2022 V. Senthil, BDM 125
BDM Technical Concepts
Map Reduce, Oozie, Hive, PIG, Cassandra, SPARK
Comman
Web
d Line
Interface
Interface HIVE Meta
Data
Driver
(Compiler, Optimizer, Executor)
03/07/2022 V. Senthil, BDM 131
Big Data – Technical Concepts(6)
• PIG
• A high level scripting language/platform for data manipulation, that is used
with Hadoop & MapReduce.
• Pig offers greater procedural control over data flows, and thus excels at solving
problems such as ETL that required great control over data flows.
• Pig Latin – provides a rich set of data types and functions, and operators to
perform various operations on the data.
03/07/2022 V. Senthil, BDM 132
Big Data – Technical Concepts(7)
• CASSANDRA
• More recent and popular scalable open source non-relational database that
offers continuous uptime, simplicity and easy data distribution across multiple
data centres and cloud.
• It offers built-in libraries for Machine Learning, Graph processing, Stream Processing
and SQL to deliver seamless superfast data processing along with high programmers
productivity.
03/07/2022 V. Senthil, BDM 134
Data Ownership & Privacy
Europe - GDPR
• GDPR – General Data Protection Rights
• https://www.youtube.com/watch?v=UhXRT_QM_uE
• SimpleKMeans algorithm uses Euclidean distance measure to compute distances between instances
and clusters.
• Dataset : bank-data.csv
• Cluster
• Choose : SimpleKMeans / EM / HierarchicalClusterer
• Classes to clusters evaluation : Variable selection (NOM)
• Check the number of clusters and Incorrectly clustered instances
Example 7 – Air Traffic Passenger
Statistics
• TRY IT
• Data set - Air Traffic Passenger Statistics.csv
• Activity Period
• Operating Airline
• Operating Airline IATA Code
• Published Airline
• Published Airline IATA Code
• GEO Summary
• GEO Region
• Activity Type Code
• Price Category Code
• Terminal
• Boarding Area
• Passenger Count
• Adjusted Activity Type Code
• Adjusted Passenger Count
• Year
• Month
KNIME
KNIME WORKBENCH
KNIME – Example 1 – Row Filter
Example 2 - Churn Prediction Model :
Training
• Data set : TELCO
• predict the customers who are going to quit the contract.
• Building a basic Model for Churn Prediction with KNIME
• -Churn Prediction - Training
Reader Node, Number to String, color manager
• -Churn Prediction - Evaluation
Partitioning node, Decision Tree Learner, Decision Tree
Predictor, Scorer Node
KNIME - Churn Prediction Model
Churn Prediction : Deployment
• Deployment is to check with new customer chances for churn ?
KNIME LDA Example 3 – Amazon Data
KNIME
Examples 4
Example 4 – Wine Quality
Wine Quality