BDM Review Session

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 179

Big Data in Management

V Trimester
Session 1-5 - Overview
Course Objective
• Explain the basic concepts behind deriving value from Big Data and
its importance to Businesses

• Develop skill sets needed to successfully extract intelligence and


value out of data sets

• How to integrate Big Data into a corporate culture

• Discuss the best practices for data analysis


Course Content
• Unit I: Big Data Concepts , Why Big Data Matters

• Unit II: Big Data and the Business Case, Building the Big Data
Team

• Unit III: Big Data Sources

• Unit IV: Security, Compliance, Auditing, and Protection;

• Unit V: Practical: Weka and KNIME


Evaluation Pattern

Evaluation parameter Marks


Quiz 10

Team Presentation (Evaluation: Individual & Team) 10


Class Participations (Assignment, interactions) 10
Mid term Test 30
Total Internal 60
End term (Exam for 100 marks) 40
Total marks 100

03/07/2022 V. Senthil, E-Commerce 4


BDM Course Delivery Pattern

Course Delivery Number of


Sessions
Regular Classroom Sessions - Theory 12
BDM Lab Sessions 6
Team Presentation - Case Discussion 6
*6 Students per team on the specified topic &
date

03/07/2022 V. Senthil, E-Commerce 5


Rules for Team Presentation - Allocation
• Team Size – 6 Members / Team
• 15 teams (MBA 8 + PGDM 7) = 90 students
• Total Duration is 20 Mins / Team (6 * 3 = 18 +2 Mins)
• Total Presentation – 10 Marks, all team members contribution is must.
• Presentation is on HBR Case study, (Case Selection ?)

• Grading :
A (9 marks),
B (7 Marks),
C (5 Marks) and
D (3 marks)
TEAM Presentation Topics

Date of Presentation Presentation Teams


23/11/2021 Teams 1,2,3
27/11/2021 Teams 4,5,6
29/11/2021 Teams 7,8, Reserve 1
03/01/2022 Teams 9, 10, 11
04/01/2022 Teams 12, 13,14
06/01/2022 Team 15, Reserve 2

• Evaluation is based on both Team and Individual presentation


03/07/2022 V. Senthil, E-Commerce 7
Case Titles
• Groups I
1. Koho Financial Inc.:Facing a New Banking Era
2. 111 INC.:Envisioning the Future of Health Care
3. Podium Data
4. Verisk
5. DOW Chemical
6. Supply Chain Collaboration at JD.com
7. Bairong
8. THE YES
• Groups II
9. Predicting Consumer Tastes with Big Data at Gap (9-517-115
10. Marcpoint: Strategizing with Big Data
11. Bossarg AG: Enabling Industry 4.0 Logistics, Worldwide
12. UCB: Data is the New Drug
13. RBC: Social Network Analysis
14. Shanghai Pharmaceuticals: Seeking A Prescription For Digital Transformation
15. An Introductory Note on Big Data and Data Analytics for Accountants and Auditors
MBA & PGDM Students Presentation Details
Team. Topic Presenting Students Registration Date
No
1 Case 1 2013044, 2013015, 2013060, 2013066, 2013070, 2013081 23/11
2 Case 2 2013038, 2013039, 2013049, 2013068, 2013072, 2013077 23/11
3 Case 3 2011005, 2011011, 2011040, 2011073, 2011085, 2011115 23/11
4 Case 4 2011006, 2011017, 2011018, 2011023, 2011045, 2011090 27/11
5 Case 5 2011004, 2011013, 2011055, 2011060, 2011065, 2011105 27/11
6 Case 6 2013008, 2013009, 2013013, 2013089, 2013112, 2013113 27/11
7 Case 7 2011002, 2011003, 2011009, 2011024, 2011067, 2011116 29/11
8 Case 8 2013020, 2013053, 2013067, 2013087, 2013090, 2011122 29/11, R1
9 Case 9 2011012, 2011026, 2011074, 2011091, 2011104, 2011121 03/01
10 Case 10 2011027, 2011057, 2011066, 2011086, 2011087, 2011098 03/01
11 Case 11 2011031, 2011075, 2011078, 2011092, 2011099, 2011108 03/01
12 Case 12 2011007, 2011010, 2011030, 2011041, 2011117, 2011118 04/01
13 Case 13 2013019, 2013025, 2013027, 2013040, 2013055, 2013057 04/01
14 Case 14 2013028, 2013030, 2013042, 2013045, 2013046, 2013050 04/01
15 Case 15 2013001, 2013035, 2013075, 2013085, 2013099, 2013109 06/01, R2

03/07/2022 V. Senthil, E-Commerce 9


Bits and bytes
• Bytes(8 bits)
• Kilobyte (1000 bytes)
• Megabyte (1 000 000 bytes)
• Gigabyte (1 000 000 000 bytes)
• Terabyte (1 000 000 000 000 bytes)
• Petabyte (1 000 000 000 000 000 bytes)
• Exabyte (1 000 000 000 000 000 000 bytes)
• Zettabyte (1 000 000 000 000 000 000 000 bytes)
• Yottabyte (1 000 000 000 000 000 000 000 000 bytes)
• Xenottabyte (1 000 000 000 000 000 000 000 000 000 bytes)
• Shilentnobyte (1 000 000 000 000 000 000 000 000 000 000 bytes)
• Domegemegrottebyte (1 000 000 000 000 000 000 000 000 000 000 000 bytes)
Some Make it 4V’s

12
More Clarity on 4th and 5th V
• Veracity
• SNR -Signal to Noise Ratio
• Example
• Data acquired in a controlled manner. (ex – online customer registration) usually less
noise than data acquired via uncontrolled sources such as blog postings.
• Value
• Usefulness of data for 20 years or 20 minutes
• Example- 20 mins. delayed stock value has no value
DIKW Framework
• Data
• Information
• Knowledge
• Wisdom
Harnessing Big Data

• OLTP: Online Transaction Processing (DBMSs)


• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

17
Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• The ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion

18
Big Data Definition (1)

“Big Data” is data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and
extract value and hidden knowledge from it…

19
Big Data Definition (2)
• Many firms whose sole existence is based upon their capability to
generate insights that only Big Data can deliver.

20
Big Data Analysis
• BDA enable data-driven decision making with
scientific backing so that decisions can be based on
factual data and not simply on past experience or
intuition alone.

• Example 1
• The number of ice-creams sold is related to daily
temperature.

• Example 2
• Tyre sales data and road construction works.

21
Big Data Analytics
• Process of collecting, organizing, analysing large
sets of data (called Big Data) to discover patterns and
other useful information
• Four main categories
S.No. Analytics Type Value Complexity
1 Descriptive Hindsight Very Low
2 Diagnostic Insights Low
3 Predictive Insights High
4 Prescriptive Foresight Very High
1. BDA - Descriptive Analytics
• It answer questions about events that have already
occurred.
• Example 1
• What was the sales value of last 12 Months
• Example 2
• What is the monthly commission earned by each sales
agent ?
• Descriptive Analytics carried out via ad-hoc
reporting, static in nature, display historical data
(ERM, ERP, OLTP)
2. BDA - Diagnostic Analytics
• It aims to determine the cause of a phenomenon
that occurred in past using questions that focus on
the reason behind the event.
• Example 1
• Why Q2 sales is < Q1 ?
• Example 2
• Why an increase in patient re-admission rates over the past
3 months?
• Drill down, Rollup analysis
3. BDA - Predictive Analytics
• An attempt to determine the outcome of an event that might
occur in the future.
• The strength and associations form the basis of models that
are used to generate future predictions based upon past
events.
• Models used for PA have implicit dependencies on the
conditions under which the past events occurred.
• If these underlying conditions change, then the models that
make predictions need to be updated.
• Example 1 – If a customer has purchased products A and B,
what are the chances that they will also purchase C?
• Example 2 - Who is likely to cancel the product that was
ordered through e-commerce portal?
4. BDA - Prescriptive Analytics
• Prescriptive Analytics build upon the results of prediction
analytics by prescribing actions that should taken.
• Provide results that can be reasoned about because they
embed element of a situational understanding.
• This kind of analytics can be used to gain an advantage or
mitigate a risk.
• Example 1 – Among these drugs which one provides the
best results?
• Example 2 – When is the best time to trade a particular
stock?
• This approach shifts from explanatory to advisory and can
include the simulation of various scenarios
Big Data Ecosystem

27
Big Data in Management
Unit II
Content
• 1. Origins of Big Data Analytics
• 2. Different Types of Data Sources
• 3. Skill Sets Needed
• 4. How to integrate Big Data into a corporate culture
• 5. Public and Private sources of data
• 6. Storage, Processing Power, Platforms
• 7. Security, Compliance and Auditing
• 8. Short - term and Long-term Changes
• 9. Best Practices for data analysis
• 10. Data Pipeline, Value creation
Big Data – Introduction (1)
• Data
• Can be anything from “something recorded” to “everything under the sun”

• Recording and preserving that data has always been the challenge, and
technology has limited ability to capture and preserve data

• Compliance Regulations, Backup Strategy


Big Data – Introduction (2)
• Big data is often described as extremely large data sets that have grown beyond the
ability to manage and analyse them with traditional data processing tools.

• The primary difficulties are the


acquisition,
storage,
searching,
sharing,
analytics and
visualization of data
Big Data – Introduction (3)
• The Arrival of Analytics
• Researchers started to incorporate related data sets, unstructured
data, archival data, and real-time data into the process, which in turn
gave birth to what we now call Big Data.
Uses (indicative list) of Big Data in
different areas
• National Oceanic and Atmospheric Administration
(NOAA) - Uses Big Data approaches to aid in climate,
ecosystem, and weather
• NASA - Uses Big Data for aeronautical research
• Pharmaceutical Companies - Uses BD for Drug Testing
• Energy Companies – uses BD for geophysical analysis
• Media – uses BD for Text Analysis and Web mining
• Walt Disney – uses BD for customer behaviour in all
theme parks
Obstacles ???
• Purity of the data, Analytical Knowledge, Understanding of
Statistics and Others

• Gathering the data is usually half the battle in Analytics game.


• Data Sources are from multimedia, social media, instant messaging,
e-mail, POS, etc.,
• Bio-Informatics
- Human Individual Genome – 1000$
- Personalized Medicine
Data and Data Analytics are Getting more
Complex
• Behavioural Analytics is a process of that determines patterns of
behaviour from human to human and human to system
interaction data.

• The behavioural pattern can provide insight into which series of


actions led to an event (Ex. A Customer Buy / Switch a product)
Few concepts in BDM
• Like 5 Vs - 4 Ms also in Big Data
Make Me More Money (or)
Make Me More Efficient
BD Business Model Maturity Index
• The index provides a roadmap to help organization accelerate the
integration of data and analytics into their business models
• 5 Phases
• 1. Business Monitoring - DWH, BI, also called as Business
performance management
• 2. Business Insights - Coupling internal (Example – consumer comments,
e-mail conversations, Technician notes) and external data (Social Media,
Weather, Traffic, data.gov)
• 3. Business optimization - Prescriptive analytics to optimize key business
process (Example – recommendations, scores, rules), it provides
opportunities for organizations to push analytics insights to their
customers in order to influence customer behaviours
• 4. Data Monetization - Organizations seek to create new sources of
revenue and Selling data or insights into new market.
• 5. Business Metamorphosis - A major shift in the organizations core
business model. (Ex – Selling products to selling “Business as a Service”.
Business Analyst vs Data Scientist
Area BI Analyst BIG Data Scientist
Focus Reports, KPIs, Trends Patterns, correlations,
models
Process Static, comparative Exploratory,
experimentation
Data Pre-Planned, added slowly On the fly, as needed
Sources
Transform Up front, carefully planned In-database, on-
demand, enrichment
Data Single version of truth “Good Enough”,
Quality probabilities
Data Schema on load Schema on query
Model
Analysis Retrospective, descriptive Predictive, prescriptive
Data Science Algorithms
• 1. Fundamental Exploratory Analytics algorithms
• Trend Analysis, Boxplots, Geography (Spatial) Analysis, Pairs plot,
Time series decomposition
• 2. Advanced Analytics Algorithms
• Cluster Analysis, Normal Curve equivalent analysis, Assumption
Analysis, Graph Analysis, Text Mining,
• Sentiment Analysis, Traverse pattern analysis, Decision Tree
Big Data Sources
Step 1 : Identify the Usability - CONSIDERATIONS
• Structure of the data
- structured, unstructured, semi-structured, table based, proprietary
• Source of the data
- internal, external, private, public
• Value of the data
- generic, unique, specialized
• Quality of the data
- verified, static, streaming
• Storage of the data
- remotely accessed, shared, dedicated platforms, portable
• Relationship of the data
- superset, subset, correlated
Step 2: Import the data into an appropriate platform
• Data have to transformed into something accessible, quarriable and relatable.
What’s in the Big Data Toolkit?
Statistical Methods Tools User Experience Research

Sentiment Time series analysis


analysis Data mining

Multilevel
Missing data
modeling imputations Classification and
clustering
Survival analysis
Pattern recognition
Principal component
and factor analysis
AB testing Machine learning
Forecasting
Propensity score Logistic, multinomial
matching and multiple linear
regression techniques Network analysis
What’s in the Big Data Toolkit?
Statistical Methods Tools User Experience Research

Languages Libraries Data Engineering Visualization


Python SciPy Profiling D3.js
R Pandas ETL Gephi
SQL Scikit-learn Job notices R
Javascript GPText APIs Leaflet
NodeJS OpenNLP Optimized data PowerBI
Mahout pipelines ggplot2
+many others Optimized data shiny
storage/access
• Google Data Centre -
https://www.youtube.com/watch?v=XZmGGAbHqa0
UNIT II
Module II- Building the Big Data Team
UNIT II - Concepts
• Big Data Nuances
• Big Data Matters ?
• Data Analysis getting Complex
• Twitter – large producer of unstructured data
• Big Data Team, Team Challenges & Analytics Team Functions
• 5 critical skills for Big Data Team
• SQL vs NOSQL
Big Data Nuances
• Dealing with unstructured and semi structured data
• XML databases for storage and access
• Tools and Technologies continue to evolve
• Provide results within an acceptable time frame
• Storage & Protection
Big Data Matters ?
• Yes, Big Data analytics has proven its value.

• The results are real and measurable

• Servicing the customers better

• Examples – Amazon, Facebook, Google, LinkedIn


Big Data can benefit SMBs ?
• “Pay as you go, consume what you need”
• Computing power, Storage & Platform for Analytics
• Have access to scores of publicly available data
• Product review analysis
• Obstacles for SMBs are purity of data, analytical skills
including statistical knowledge
More data …. More value
• Exponential growth in data
• Analyzing large data sets
• Innovation, finding its place in the market, Identify the
next world trend
• Leverage data in an intelligent way
• To build accurate models, we need large volumes of
data from behavioral interactions.
Data Analysis getting Complex
• Earlier days, people thought that the bulk value
resides in analyzing structured data and little value
obtained from analyzing unstructured data

• Perception changed with click stream analysis and


search engine predictions

• Unstructured data brings complexity to analytics


Twitter – large producer of unstructured
data
• Success of Twitter depends on how well it can
leverage on the data generated by its users

• More than 774 million accounts from 200 countries

• Storm – a software for parsing live data stream


• Performs real time analytics
• Identify trends and emerging topics
• how widely web addresses are shared by Twitter users
• http://storm.apache.org/
Big Data Team
• Determining the various skills needed for the team
• The Data Scientist
• Expert in analyzing data
• Helps business gain a competitive advantage
• Team leader for the project
• Skills needed
• Must possess a combination of analytic, machine learning, data
mining, and statistical skills
• Experience with algorithms and coding
• Ability to translate in a way easily understood by others
Big Data Team Challenge
• Finding and hiring talented workers with analytical
skills
• Organizing the team – members from IT and BI team
• Placing Big Data team under IT Department?
• Depends on the domain on which the analytics team
is working
• customer churn out – marketing department
• Financial risk forecasting – finance department
Analytics Team Functions
• Build the team around core competencies
• The tasks of the team can be broken down into
1. Locate the data,
2. Normalize the data
3. Analyze the data
• Most important chore
– analyzing
• Design and select best suited algorithms
• Access data
• Gathers results
• Present the information
5 critical skills for Big Data Team
1. Data mining
2. Data visualization
3. Data analysis
4. Data manipulation
5. Data discovery
A Typical data warehouse environment
ERP Reporting /
Dashborading

CRM OLAP
Data
Warehouse
Adhoc
Legacy
Querying

Third Party
Modelling
Apps

03/07/2022 V. Senthil, BDM 58


Big Data Sources

Data Storage

Archives
Media

Sensor Data
Sources of Big Data Docs

Machine Log Data


Business Apps

Pubic Web
Social Media

03/07/2022 V. Senthil, BDM 59


BD Environment - Hadoop
EcoSystem

03/07/2022 V. Senthil, BDM 60


BD Environment Why Hadoop ?
• It is capability to handle massive amounts of data,
different categories of data – fairly quickly.
• Hadoop Components

Hadoop Ecosystem
FLUME OOZIE MAHOUT ….

HIVE PIG SQOOP BASE

Core Components

MapReduce Programming

HDFS

03/07/2022 V. Senthil, BDM 61


Hadoop 1.0 VS 2.0
Hadoop 1.0 Hadoop 2.0
MapReduce Others ( Data
MapReduce
(Cluster Resource ( Data Processing) Processing)
Management & data
Processing
YARN
HDFS ( Cluster Management)
(Redundant, Reliable
Storage) HDFS
(Redundant, Reliable
Storage)

03/07/2022 V. Senthil, BDM 62


7. SQL vs NoSQL

03/07/2022 V. Senthil, BDM 63


TSM Storage
• Backup storage : QNAP
• Mail Storage : MS Office 360
• Video Storage : CCTV 94 Cameras – 21 days 24x7
• Log Storage : Forti Analyzer

Storage Policy & IT DATA Policy


TSM Level – Capacity, Duration, Retrieval
Country Level – Capacity, Duration, Retrieval
IBM Big Data
• https://www.youtube.com/watch?v=u5jWC89xBzI
UNIT III
Session 12 - 14
Session 12 - Big Data
Sources
Content
• 1. Identify usable data
• 2. Hunting for Data
• 3. Setting the Goal
• 4. Big Data Resources – Various Industries
• 5. Big Data Sources
• 6. BD Environment – Hadoop, NoSQL, HDFS
1. Identify usable data
• Selection process considerations
• Structure of the data (structured, unstructured, semi
structured, table based)
• Source of the data(internal, external, private, public)
• Value of the data (generic, unique, specialized)
• Quality of the data (verified, static, streaming)
• Storage of the data(remotely accessed, shared, dedicated
platforms, portable)
• Relationship of the data(superset, subset, correlated)
2. Hunting for Data
• First determine the questions to be answered
• Market trends?
• Predict web traffic?
• Gauge customer satisfaction?
3. Setting the Goal (1)
• Goal setting and listing out the objectives
• Example Goal - increase sales
• Spans several departments such as Marketing, pricing,
customer relations, advertising
• Start with internal structured data
• Sales logs, inventory movement, registered transactions,
customer information, pricing, supplier interactions
• Next unstructured data
• Call center and support logs, customer feedback, emails,
surveys, data gathered by sensors
3. Setting the Goal (2)
• external data
• Customer sentiments, geopolitical issues, government
entities, research companies, social networking sites…
• News archives, traffic pattern information, weather
information,
• Richness of data is important for prediction
4. Big Data Resources – Various Industries (1)
• Transportation, logistics, retail, utilities and telecommunications
• Sensor data from GPS transceivers, RFID, cell phones
• these data are used to optimize operations
• drive operational BI to realize immediate business opportunities

• Health Care
• Electronic medical records and images
• Public health monitoring
• epidemiological research programs

• Government
• Digitizing public records like census information, energy usage, budgets,
Freedom of Information Act documents, electoral data, law
enforcement reporting
4. Big Data Resources – Various Industries (2)
• Entertainment media
• Digital recording, production and delivery
• Collecting rich content and user viewing behaviors

• Life Sciences
• Low-cost gene sequencing(less than $1000) can generate
tens of terabytes of information that is analyzed for genetic
variations and potential treatment effectiveness.
4. Big Data Resources – Various Industries (3)
• Video Surveillance
• Recording systems that organizations want to analyze for
behavioral patterns, security and service enhancement

• Social Media sites


• Likes, dislikes, opinions, locations, comments (positive,
negative, neutral),
5. Big Data Sources (1)
• Financial transactions
• Global trading environments
• Trading occurring more frequently and at faster levels
because of competition among companies
• Smart instrumentation
• Use of smart meters in energy grid systems
• Shifts meter reading from monthly to every 15 minutes
• Can be used as an indicator of household size at any given
moment
5. Big Data Sources (2)
• Mobile telephony
• Data generated from PDAs, smartphones
• Geographic location, caller, receiver, call duration, text
messages, browsing history, social network posts
• Mining data from the Web
• Data Gathering Tools
• Extractiv, Mozenda, Google Refine, 80Legs
• Tools for Analyze
• Grep, Turk, BigSheets
• Visualization
• Tableau Public, OpenHeatMap, Gephi
6. BD Environment - A Typical Hadoop
environment
ERP Reporting /
Dashborading
Hadoop

CRM OLAP

Adhoc
Legacy
MapReduce Querying

Third Party
Modeling
Apps

V. Senthil, BDM 03/07/2022 78


Session 13
Big Data Environment
1. Schema less databases
2. Yarn
3. Why Hadoop ? & MongoDB ?
6. BD Environment - NoSQL
• NoSQL Usage
• Log Analysis
• Social Networking feeds
• Time Based Data (Not easily analysed in a RDBMS)

NoSQL

Key/Value or the big hash Schema Less


table Cassandra ( Column based)
Amazon S3 (Dynamo) CouchDB (Document based)
Scalaris Neo4j (Graph Based)
Hbase ( Column Based)

03/07/2022 V. Senthil, BDM 80


6. BD Environment - Popular Schema Less
Databases
Key-Value Data Column-Oriented Document Data Graph
Store Data Store Store Data Store
Riak Cassandra MongoDB InfiniteGraph
Redis Hbase CouchDB Neo4j
Membase HyperTable RavenDB AllegroGraph
Use of NoSQL in Industry
Shopping Carts Analyse huge web Real-time Network
Web User data user actions analytics, modelling,
Analysis sensor feeds logging, recommendation,
(Amazon, (FaceBook, Twitter, document Walmart-upsell,
LinkedIn) eBay, Netflix) archive cross-sell.
management

03/07/2022 V. Senthil, BDM 81


6. BD Environment - YARN
• Yet Another Resource Negotiator
- any application capable of dividing itself into parallel tasks is
supported by YARN.

03/07/2022 V. Senthil, BDM 82


Why MongoDB?
• Cross Platform, Open Source, Non-Relational, Distributed
NoSQL, Document Oriented Data Store, JSON, BSON
and CRUD
Auto Sharding Document
Full Index Oriented
Support
High
Rich Query
MongoDB Performance
Language

Easy
Scalability Fast in-place
updates
Replication

03/07/2022 V. Senthil, BDM 83


Session 14
The Nuts & Bolts of BIG DATA
The Nuts & Bolts of BIG DATA
• The Storage Dilemma
- Capacity, Security, Latency, Access, Flexibility, Persistence, Cost, Building a
Platform,

Bringing Structure to Unstructured Data, Processing Power, Choosing


Among in-house, outsourced, or Hybrid Approaches
The Storage Dilemma

• Storage capacity
- is inexpensive and constantly dropping in price,

- Business have been compelled to save more data,

- Business have access to more types of data such as internet


interactions, social networking activity, automated sensors, mobile
devices, VOIP, and Video elements
• Capacity
- The Clustered architecture of scale-out storage solutions
features nodes of storage capacity with embedded processing
power and connectivity that can grow seamlessly
• Security
- Many types of data carry security standards that are driven by
compliance laws and regulations.
- The data may be financial, medical, or government
intelligence and may be part of an analytics set yet still be protected
• Latency
- Virtualization of server resources, used to expand compute
resources without the purchase of new hardware.
- Solid-state storage devices, which can be implemented as
server-based cache to all-flash-based scalable storage systems.
• Access
- As data sets increases, more people are in data sharing
loop.
- global file systems, allow multiple users on multiple hosts
to access files from may different back-end storage systems
in multiple locations.
• Flexibility
- need to account for data migration challenges
• Persistence
- BD applications often involve regulatory compliance
requirements, which dictate that data must be saved for
years or decades
• Cost
- more efficiency with less expensive components.
- Storage deduplication - The ability to reduce capacity
consumption even by a few % provides a significant ROI as
data sets grow
- Thin Provisioning minimum space required by each user at
any given time
- Snapshots streamline : access to stored data and can speed
up the process of data recovery. 2 types copy on write (low
capacity) snapshot, split mirror snapshot
- Disk cloning – copying the contents of hard drive. (example
Image file)
- Magnetic Tape – most economical storage medium.
UNIT IV
BDM – Security, Auditing, Compliances
Best Practices for BDA
Unit IV
• Security, Compliance, Auditing, and Protection; Pragmatic
Steps to Securing Big Data; Classifying Data; Protecting Big
Data Analytics; Big Data and Compliance; The Intellectual
Property Challenge;

• The Evolution of Big Data; Big Data: The Modern Era; Today,
Tomorrow, and the Next Day; Changing Algorithms; Best
Practices for Big Data Analytics; Start Small with Big Data;
Thinking Big; Avoiding Worst Practices; Baby Steps; The
Value of Anomalies; Expediency versus Accuracy. In-
Memory Processing

91
Security, Compliance,
Auditing and Protection
Chapter 7

03/07/2022 V. Senthil, BDM 92


How can the data be protected ?
• Proper security entails more than just keeping the bad guys out; it
also means backing up and protecting data from corruption

• Access – who, what, when and where of data access.


• Availability – controlling where the data are stored
• Performance – higher levels of encryption, complex security
methodologies
• Liability – sensitivity of data, legal requirements connected to the
data, privacy issues and Intellectual Property concerns.

03/07/2022 V. Senthil, BDM 93


Pragmatic steps to securing BD

• Get rid of data that are no longer needed

• Activity logs – prevent the logs from being exposed, the best method
may be to delete them after their usefulness ends.

03/07/2022 V. Senthil, BDM 94


Protecting BDA
• Data from devices that monitor physical elements (e.g. traffic,
movement, soil pH, rain, wind) on a frequent schedule, surveillance
cameras, or any other type of data that are accumulated frequently
and in real time.

• All of the data are unique to the moment, and if they are lost, they are
impossible to recreate.

• De-duplication – greatly increases the capacity requirements of


backup subsystems and slows down security scanning
03/07/2022 V. Senthil, BDM 95
Big Data and Compliance

03/07/2022 V. Senthil, BDM 96


Big Data and Compliance (1)
• New data types and methodologies are still expected to meet the
legislative requirements placed on businesses by compliance laws.

• There will be no excuses accepted and no passes given of a new


data methodology breaks the law.

• Payment Card Industry and Personal Health information


compliance ?, HIPAA (Health Insurance Portability and
Accountability Act)

03/07/2022 V. Senthil, BDM 97


Big Data and Compliance (2)
• Companies may also move BD to the cloud for disaster recovery,
replication, load balancing, storage, and other purposes.

• Perimeter based security – Firewalls, Data Level Security

• 4 Goals
• Control Access by process, not job function
• Secure the data at rest
• Protect the cryptographic keys and store them separately from the data
• Create trusted applications and stacks to protect data from rogue users.

• Social media applications that are collecting tons of unregulated yet


potentially sensitive data may not yet to be a compliance concern.

03/07/2022 V. Senthil, BDM 98


Intellectual Property

03/07/2022 V. Senthil, BDM 99


The Intellectual Property Challenge (1)
• IP refers to the creations of the human mind, such as inventions
literary and artistic works, and symbols, names, images, and designs,
used in commerce.

• With BD consolidating all sorts of private, public, corporate, and


government data into a large data store, there are bound to be pieces
of IP in the mix: simple elements, such as photographs, to more
complex elements such as patent applications or engineering
diagrams.

03/07/2022 V. Senthil, BDM 100


The Intellectual Property Challenge (2)
• Basic Rules
• Understand what IP is and know what you have to
protect.
• Prioritize protection
• Label – confidential information should be labelled
appropriately.
• Lock it up – rooms that store sensitive data should be
locked.
• Educate Employees
• Know your tools
• Use a holistic approach
• Use a counter-intelligence mind set

03/07/2022 V. Senthil, BDM 101


Big Data Security
• https://www.youtube.com/watch?v=8B2r4J7OPqc

• https://www.youtube.com/watch?v=2Gc9wj56ibc

03/07/2022 V. Senthil, BDM 102


The Evolution of Big Data
Chapter 8

03/07/2022 V. Senthil, BDM 103


The Evolution of Big Data
• 1880 US census
• First BD platform
• Hollerith Tabulating Machine – punched cards having 80 variables
• Analysis took Six weeks instead of from 7 years.

• The Manhattan Project


• Big Science
• 1950 - US space program
• 1960 – International Geophysical Year & Long term Ecological Research
Network.
• Lessons learned from Big Science Data Projects – Weather prediction,
physics research (supercollider data analytics), astronomy images (planet
detection), medical research (drug interaction), and others.

03/07/2022 V. Senthil, BDM 104


The Big Data : Modern Era
• Big Science, Big Data, Big Business
• Discover new opportunities
• Measure efficiencies, or
• Uncover relationships among what was thought to be
unrelated data sets.
• New approaches to education and workforce development
• Nurturance of new types of collaborations – multi-disciplinary
teams and communities enabled by new data access policies
to make advanced in the grand challenges of the computation
and data intensive world today.

03/07/2022 V. Senthil, BDM 105


Today, Tomorrow, and the Next Day
• UPS, T-Mobile, Google, Oracle, HP, SAP, Watson, IBM
• SAP’s HANA, Hadoop, Moore’s Law, GUI
• Human VS Machine
• Big Medicine, Customized Medicine
• WOM, e-WOM
• Machine Learning & Deep Learning Algorithms
• Self-Learning Algorithms

03/07/2022 V. Senthil, BDM 106


03/07/2022 V. Senthil, BDM 107
Best Practices of BDA
Chapter 9

03/07/2022 V. Senthil, BDM 108


Avoiding Worst Practices
• Thinking “If we build it, they will come”
• Assuming that the software will have all of the answers
• Not understanding that you need to think differently
• Forgetting all of the lessons of the past
• Not having the requisite business and analytical expertise
• Treating the project like a science experiment
• Promising and trying to do too much

03/07/2022 V. Senthil, BDM 109


Best & Baby Steps
• Decide what data to include and what to leave out
• Build effective business rules and then work through the complexity
they create
• Translate business rules into relevant analytics in a collaborative
fashion
• Have a maintenance plan
• Keep your users in mind-all of them
• Anomaly Detection
• Expediency VS Accuracy
03/07/2022 V. Senthil, BDM 110
In-Memory Processing (1)
• The latency of disk-based clusters and WAN connections makes it
difficult to obtain instantaneous results from Business Intelligence
Solutions.

• International Data Corporations – IDC

• Using disk-based processing meant that complex calculations that


involved multiple data sets or algorithmic search processing could not
happen in real time.

03/07/2022 V. Senthil, BDM 111


In-Memory Processing (2)
• Today’s business are demanding faster results that can be used to make
quicker decisions and be used with tools that help organizations to access,
analyse, govern and share information. All of these increasing value to BD.

• The use of in-memory technology brings that expediency to analytics,


ultimately increasing the value.

• Enterprises are able to shift from after-event analysis (reactive) to real


time decision making (proactive) and then create business models that
are predictive rather than response based.

03/07/2022 V. Senthil, BDM 112


Reference Book insights

03/07/2022 V. Senthil, BDM 113


BDM –
More insights from Reference Books(1)
• PI, KPI, CSF – Critical Success Factor

• BPMS – Business Process Management Systems

• Patient
 Doctors
 Specialists
 Diagnosis
 Insurance Premium
03/07/2022 V. Senthil, BDM 114
BDM
adoption and planning considerations (1)
• Organization prerequisites
- outdated, invalid or poorly identified data will
result in low quality results
• Data procurement
- external data sources (Govt data source,
commercial data markets)
• Privacy and Security
• Provenance
- used for auditing purposes

03/07/2022 V. Senthil, BDM 115


BDM
adoption and planning considerations (2)
• Distinct Performance Challenge - large datasets with complex search
algorithms can lead to long query times,
• Network Bandwidth (100 % throughput ???)
Source – t0 One Peta Byte of data (1024 TB)
Network – 1 gbps (giga bit per seconds)
Destination – t1 One Peta Byte of data
t1 – t0 = 2222 hours ( 92 days)
1 Peta Byte = 103 TB = 106 GB = 109 MB = 1012 KB
= 1015 bytes = 8 x 1015 bits
= 8 x 1015 bits
1 gbps = 1 x 103 mbps = 106 kbps = 109 bps
= 8 x 1015 bits / 109 bps = 8 x 106 seconds
= 8 x 106 /3600 hours = 8 x 104 /36 hours
= 80000/36 hours = 2222 hours = 92 days
03/07/2022 V. Senthil, BDM 116
BD Ethics
• BD Ethical issues
1. Value Destruction
2. Diminished rights
3. Disrespectfulness
4. Unempathetic Behaviour
5. Intentional Discrimination

03/07/2022 V. Senthil, BDM 117


Assignment 1
Take anyone website Accuweather.com or
worldwidewebsize.com

Answer the following questions


1. Is it a Big data relevant site? How?
2. What are the business benefits (any 3) ?
3. What are the benefits to the society (any 3) ?

Note - Maximum of three paras and one or two pages

03/07/2022 V. Senthil, BDM 118


Technology Evolution | 100,000 BC - 2020
• https://www.youtube.com/watch?v=IJM3yuIDDPQ

03/07/2022 V. Senthil, BDM 119


UNIT V
Bringing it all Together
Bring it all Together
• The path to BD
• Primary data management principles,
including physical and logical independence,
declarative querying (SQL, PL/SQL) and
cost based optimization,

have created multibillion dollar industry that has added value to collected data.

03/07/2022 V. Senthil, BDM 121


The realities of Thinking Big Data
• Today organizations and individuals are awash in a flood of data. Applications
and computer-based tools are collecting information on an unprecedented
scale.

• GOOGLE SERVICES

03/07/2022 V. Senthil, BDM 122


Demo NLP on Big Data
• NLP Tools – mettl.com

• Language Translator

03/07/2022 V. Senthil, BDM 123


The Big Data Pipeline in Depth
• Gathering data is akin to sensing and observing the world
around us, from the heart rate of a hospital patient to the
contents of an air sample to the number of web page
queries to scientific experiments that can easily produce
petabytes of data.
• Challenges
• Filtering, Meta Data, Extracting, Cleaning
• Honesty – those who are reporting the data may choose to hide
or falsify information
• Example 1 – Patients may choose to hide risky behaviour
• Example 2 – Potential borrowers filling out loan applications may inflate
income or hide expenses.
• BD is often noisy, dynamic, heterogeneous, interrelated and
untrustworthy

03/07/2022 V. Senthil, BDM 124


BD Privacy
• Data Privacy is a huge concern, For electronic health records there are strict laws
governing what can and cannot be done.

• There is a great public fear about the inappropriate use of personal data, particularly
through the linking of data from multiple sources.

• Location Based Services – where is my office, where is my residence

• Health Issues, Religious preferences

• Differential Privacy
03/07/2022 V. Senthil, BDM 125
BDM Technical Concepts
Map Reduce, Oozie, Hive, PIG, Cassandra, SPARK

03/07/2022 V. Senthil, BDM 126


Big Data – Technical Concepts(1)
• Four Ms of Big Data - Make me more money & Make
me more efficient
• Data Lake - allows organisations to store, manage and
analyse massive amounts of data at a cost that can be
20 to 50 times cheaper than at traditional Data Ware
Housing.
• Because of agile underlying Hadoop/HDFS architecture
that typically supports the data lake

03/07/2022 V. Senthil, BDM 127


Big Data – Technical Concepts(2)
• Map Reduce Overview
• A parallel programming framework for speeding up large scale data
processing for certain types of tasks
• It achieves so with minimal movement of data on distributed files systems
on Hadoop clusters to achieve near real time results.
• MapReduce speeds-up computations by reading and processing small
chunks of file, by different computer in parallel
• Large number of partial results must be combined to produce a composite
results (Reduce)

03/07/2022 V. Senthil, BDM 128


Big Data – Technical
Concepts(3)
• Map Reduce Jobs Execution
• Data is represented as a key-value pair.
• Name node (Master Node) is called “Job Trackers” , it
tracks the progress of the Map Reduce jobs from
beginning to the completion
• Worker Node that is hosting a fragment of the data.
• When there is more than one job in MapReduce
workflow, it is necessary that they be executed in Right
order.
• For linear chain of jobs, it might be easy while for a
more complex directed acyclic graph (DAG) of jobs.

03/07/2022 V. Senthil, BDM 129


Big Data – Technical Concepts(4)
• Oozie
• Apache Oozie is a system for running workflows of
dependent jobs.
• Oozie consists of two main parts
• 1. a Workflow Engine – that stores and runs workflows composed
of different types of Hadoop jobs (MapReduce, Pig, Hive …)
• 2. a Coordinator engine – that runs workflow jobs based on
predefined schedules and data availability
• Oozie has been designed to scale and it can manage the
timely execution of thousands of workflows in a Hadoop
clusters.

03/07/2022 V. Senthil, BDM 130


Big Data – Technical
Concepts(5)
• Hive
• Is a declarative SQL-like language for processing
queries.
• Data Analysts mainly use it on server-side for
generating report.
• It is best used for producing reports using structured
data.

Comman
Web
d Line
Interface
Interface HIVE Meta
Data
Driver
(Compiler, Optimizer, Executor)
03/07/2022 V. Senthil, BDM 131
Big Data – Technical Concepts(6)
• PIG
• A high level scripting language/platform for data manipulation, that is used
with Hadoop & MapReduce.

• Pig offers greater procedural control over data flows, and thus excels at solving
problems such as ETL that required great control over data flows.

• Pig is also used for ad-hoc processing and quick prototyping.

• Pig Latin – provides a rich set of data types and functions, and operators to
perform various operations on the data.
03/07/2022 V. Senthil, BDM 132
Big Data – Technical Concepts(7)
• CASSANDRA

• More recent and popular scalable open source non-relational database that
offers continuous uptime, simplicity and easy data distribution across multiple
data centres and cloud.

• It is a hybrid between a key-value and a column-oriented database.

03/07/2022 V. Senthil, BDM 133


Big Data – Technical Concepts(8)
• SPARK
• It is an integrated, fast, in-memory, general-purpose engine for large scale data
processing. It is ideal for iterative and interactive processing tasks on large data
sets and streams.

• It achieves 10-100% performance over Hadoop by operating with an in-memory


data construct called RDD (Resilient Distributed Datasets). It helps avoid the
latencies involved in disk reads and writes.

• It offers built-in libraries for Machine Learning, Graph processing, Stream Processing
and SQL to deliver seamless superfast data processing along with high programmers
productivity.
03/07/2022 V. Senthil, BDM 134
Data Ownership & Privacy
Europe - GDPR
• GDPR – General Data Protection Rights

• a first comprehensive law about data protection and privacy.


• Data must be stored and retained within the geographical boundaries of the
country where the data originated
• Data owners (about whom the data was) have the right to be forgotten, i.e.
data bout them must be deleted after a certain time, if they so demand.
• There are constraints on what kind of analysis can be done on data and how
the insights may be used.
• Data privacy violations are penalised significantly, with a fine of up to 4 % of
the global revenue of the violating organizations.
Personal Data Protection Bill - India
• Personal Data Protection Bill

03/07/2022 V. Senthil, BDM 137


Google Services
• https://about.google/intl/en/products/?tab=rh

03/07/2022 V. Senthil, BDM 138


Big Data Applications - Recent
• Amazon- https://www.youtube.com/watch?v=S4RL6prqtGQ

• Big Data for Smarter Customer Experiences - https://www.youtube.com/watch?v=449twsMTrJI

• Big Data in Marketing- https://www.youtube.com/watch?v=XjmldAL9RQs

• Big Data in Finance - https://www.youtube.com/watch?v=HPvepzVTQgA

• CHIEF DATA OFFICER INTERVIEW


• https://www.youtube.com/watch?v=kk4b0lUJOmQ

• https://www.youtube.com/watch?v=UhXRT_QM_uE

03/07/2022 V. Senthil, BDM 139


Case Titles
• Volume I
1. Koho Financial Inc.:Facing a New Banking Era
2. 111 INC.:Envisioning the Future of Health Care
3. Podium Data
4. Verisk
5. DOW Chemical
6. Supply Chain Collaboration at JD.com
7. Bairong
8. THE YES
• Volume II
9. Predicting Consumer Tastes with Big Data at Gap (9-517-115
10. Marcpoint: Strategizing with Big Data
11. Bossarg AG: Enabling Industry 4.0 Logistics, Worldwide
12. UCB: Data is the New Drug
13. RBC: Social Network Analysis
14. Shanghai Pharmaceuticals: Seeking A Prescription For Digital Transformation
15. An Introductory Note on Big Data and Data Analytics for Accountants and Auditors
BDM Tool - WEKA
Example 1 & 2 – Decision Tree
Example 3 & 4 – Regression
Example 5 – Logistic Regression
WEKA DEMO
• Popular Data Mining & ML Tool
• 1. Pre-process
• 2. Classification
• 3. Clustering
• 4. Association Rules
• 5. Select Attributes
• 6. Visualization
Classification Algorithms
Simple Decision Tree
Example 1 – Admitted/Not Admitted
• Data set name : admission_data.csv
• Decision Tree Construction using WEKA
• Step 1 – Open File
• Step 2 – Classify
• Choose Trees
• Choose J48
• Right click on <result list file>
• Visualize Tree
Example 1 – Output
Example 2 - Phishing
• Datasets
• Word Document – Phishing website features
• ARFF – Training dataset
• Classify
• Tree
• J48
• Dependent Variable – (NOM) Result
Example 2 – Output
Example 3 – Regression - TV Rating
• CTRP
• Promotion Amount
• Revenue

• Dependent & Independent Variable ???


Example 3 - Output
Example 4 – Regression - Ecom
• Family members
• Income
• spent

• Dependent & Independent Variable ???


Example 4 - Output
Example 5 – Logistic Regression –
Admission Data
• admitted
• GRE
• GPA
• CollegeRank

• Dependent & Independent Variable ???


Example 5 - Output
BDM Tool – WEKA
K-Means clustering
Example 6 - Bank Data Region based Clustering
Example 7 – Air Traffic Passenger Statistics
K-Means Introduction (1)
K-Means Introduction (2)
K-Means Introduction (3)
K-Means Introduction (4)
K-Means Introduction (5)
K-Means Introduction (6)
K-Means Introduction (7)
K-Means Introduction (8)
K-Means Introduction (9)
Example 6 : K-Means Clustering
• http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/k-means.html

• WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and numerical


attributes.

• SimpleKMeans algorithm uses Euclidean distance measure to compute distances between instances
and clusters.

• Dataset : bank-data.csv
• Cluster
• Choose : SimpleKMeans / EM / HierarchicalClusterer
• Classes to clusters evaluation : Variable selection (NOM)
• Check the number of clusters and Incorrectly clustered instances
Example 7 – Air Traffic Passenger
Statistics
• TRY IT
• Data set - Air Traffic Passenger Statistics.csv

• Activity Period
• Operating Airline
• Operating Airline IATA Code
• Published Airline
• Published Airline IATA Code
• GEO Summary
• GEO Region
• Activity Type Code
• Price Category Code
• Terminal
• Boarding Area
• Passenger Count
• Adjusted Activity Type Code
• Adjusted Passenger Count
• Year
• Month
KNIME
KNIME WORKBENCH
KNIME – Example 1 – Row Filter
Example 2 - Churn Prediction Model :
Training
• Data set : TELCO
• predict the customers who are going to quit the contract.
• Building a basic Model for Churn Prediction with KNIME
• -Churn Prediction - Training
Reader Node, Number to String, color manager
• -Churn Prediction - Evaluation
Partitioning node, Decision Tree Learner, Decision Tree
Predictor, Scorer Node
KNIME - Churn Prediction Model
Churn Prediction : Deployment
• Deployment is to check with new customer chances for churn ?
KNIME LDA Example 3 – Amazon Data
KNIME
Examples 4
Example 4 – Wine Quality
Wine Quality

• Observe the Confusion Matrix , Infer ?


• Accuracy ?

You might also like