L1
L1
L1
2
What is big data?
•No standard definition!
•Wikipedia:
• Big data is a field that treats ways to analyze,
systematically extract information from, or
otherwise deal with data sets that are too large or
complex to be dealt with by traditional data-
processing application software.
•Amazon:
• Big data can be described in terms of data
management challenges that – due to increasing
volume, velocity and variety of data – cannot be
solved with traditional databases.
3
What is big data?
5
Big Data Definitions Have Evolved Rapidly
•3 V’s
• In a research report by Doug Laney in 2001
• Volume, Velocity and Variety
•4 V’s
• In Hadoop – big data tutorial, 2006
• Veracity
•5 V’s
• Around 2014
• Value
•7 V’s, 8 V’s, 10 V’s, 17 V’s, 42 V’s, …
6
Major Characteristics of Big Data
Volume
Variety Veracity
Big Data
Velocity Variability
Value 7 Visibility
Volume (Scale)
•Quantity of data being created from all
sources
•The fundamental of big data
8
Volume
Source: https://www.nodegraph.se/how-much-data-is-on-the-internet/
9
Volume – Why Challenging?
Model RAM Disk Data
COST
11
Variety (Diversity)
•Different Types
• Relational data (tables/transactions)
• Text data (books, reports)
• Semi-structured data (JSON, XML)
• Graph data (social network, RDF)
• Image/video data (Instagram, Youtube)
•Different sources
• Movie reviews from IMBD and Rotten Tomatoes
• Product reviews from different provider websites
• Personal information from different social apps
12
Variety
•A single application can be generating or
collecting multiple types of data
• Email
• Webpage
13
Variety - A Single View to the Customer
Social Banking
Finance
Media
Our
Gaming
Customer Known
History
Entertain Purchase
14
Variety – Why Challenging?
•Data integration
• Heterogeneous
• Traditional data integration relies on schema mapping,
the difficulty and time complexity is directed related to
the level of heterogenity and data sources
• Record linkage in variety data
• needs to identify if two records refer to the same entity.
How to make use of different types of data/information
from different sources?
•Data curation
• Organization and integration of data collected
from various sources
• Long tail of data variety
15
The Long Tail of Data Variety and Data Curation
Source: Curry, E., & Freitas, A. (2014). Coping with the long tail of data variety.
16
Velocity (Speed)
•Data is being generated fast, thus need to be
• stored fast
• processed fast
• analysed fast
•Every second
• 8,991 Tweets sent
• 994 Instagram photos uploaded
• 4,683 Skype calls
• 93,508 GB of Internet traffic
• 83,165 Google searches
• 2,915,385 Emails sent
Source: http://www.internetlivestats.com/one-second/
17
Velocity
•Reason of growth
• Users:
• 16 million in 1995 to 3.4 billion in 2016
• IoT:
• sensor devices, surveillance cameras
• Cloud computing:
• $26.4 billion in 2012 to $260.5 billion in 2020
• Website:
• 156 million in 2008 to 1.5 billion in 2019
• Scientific data:
• weather data, seismic data
18
Velocity
•Data is now streaming into the server in real
time, in a continuous fashion and the result is
only useful if the delay is very short.
19
Velocity – Why Challenging?
•Batch processing
Collect Clean Feed in
Wait Act
Data Data Chunks
•Transmission
• Transferring data becomes a prominent issue in big
data
• Balancing latency/bandwidth and cost
• Reliability of data transmission
20
Veracity (Quality)
•Data = quantity + quality
• Some argues that veracity is the most important V
in big data
• 4-th V in big data
Source: IBM
22
Veracity – Where the Uncertainties Come From
23
Veracity – Why challenging?
•Easy to occur
• Due to other Vs
•Difficult to control
• Identify errors
• Handle errors
• correction
• eliminate the effects
24 Source
Variability
Variety: Variability:
same entity, same data,
different data different meaning
25
Variability
•Meaning of data changing all the time
• This is a great experience!
• Great, it totally ruined my day!
26
Visibility
•Visualization is the most straightforward way
to view data
• Benefits of data visualization
28
Visibility – Why challenging?
•Choose the most suitable way to present data
• Characteristics of data
• Purpose of presentation
29
Value
•Big data is meaningless if it does not provide
value toward some meaningful goal
•Value from other Vs
• Volume
• Variety
• Velocity
•…
•Value from applications of big data
30
Summary of 7 V’s in Big Data
•Fundamental V’s
• Volume
• Variety
• Velocity
•Characteristics/difficulties
• Veracity
• Variability
•Tools
• Visibility
•Objective
• Value
Source: google.com
32
Big Data in Retail
• Retailer:
• Adjust the price
• Improve shopping experience
• Supplier:
• Adjust the supply chain/stock range
source
33
Big Data in Entertainment
•Predict audience interests
•Understand the customer churn
•Suggest related videos
•Advertisement target
34
Source
Big Data in National Security
•Integrate shared
information
•Entity recognition and
tracking
•Monitor, predict and
prevent terrorist attacks
35
Big Data in Science
•Physics
• The large hadron collider in CERN collect 5
trillion bits of data every second
•Chemistry
• Extract information from patents
• Predict the property of compounds
•Biology
• UK's project alone will sequence 100,000 human
genomes producing more than 20 petabytes of
data
• Also helps a lot in medicine domain
36
Big Data in Healthcare
•Diagnostics
• Data mining and analysis
•Preventative medicine
• Prevent disease or risk
assessment
•Population health
• Disease trend
• Pandemics
Source
37
Introduction to Big Data Management
• Big data management
• Acquisition
• Storage
• Preparation
• Visualization
• Big data analytics
• Analysis
• Prediction
• Decision making
• Data science
38
Example
index Data Type Query Type Accuracy
40
Data Acquisition
•Data in relational databases
• Structured data
• Access by SQL
•Data in text files and excel spreadsheets
• Unstructured or structured data
• Access by scripting languages (e.g., python, perl)
•Data from website
• Semi-structured data (e.g., XML) and unstructured
data (e.g., image)
• Access
• Web socket services
• REST
• Crawler
41
Data Acquisition
•Scientific data
• E.g., physics experiments, genome data
• Structured, semi-structured, unstructured
• Access by specially designed software
•Graph data
• E.g., knowledge graphs, social networks
• Access by specially designed programs
• Difficult to handle (e.g., graph isomorphism problem)
•…
42
Hybrid in Real Applications
•Usually need to acquire data from multiple
resources
•E.g., COVID-19 Map from JHU
• WHO, CDC, …
• Structured data (tables)
• Media reports and Social media (e.g., DXY)
• Unstructured text data
• Acquire data from website
• Extract information from text/tables
43
Data Storage
•Big data storage is challenging
• Data Volumes are massive
• Reliability of Storing PBs of data is challenging
• All kinds of failures: Disk/Hardware/Network
Failures
• Probability of failures simply increase with the
number of machines …
46
Data Preparation
47
Data preparation
•Two-step data preparation process
•Data Exploration
•understand your data
•Data pre-processing
•Data cleansing
• Veracity
•Data Integration
• Variety
48
Data Exploration
•Explore
• Trends
• Correlations
• Outliers
• Statistics
• Mean, Mode, Median, Standard deviation, Range
•Visualization also helps data exploration Source
49
Data Cleansing
•Dirty data types
• Miss values/records
• Invalid data
• Inconsistency
• Duplicate
• Outliers
50
Data Integration
•Merge data from multiple, complex and
heterogenous resources.
•To perform a unified view of data
•Mature field in traditional databases
•Schema mapping
• Variety
•Record linkage
• Identify if two records refers to same entity
• Variety, velocity
•Data fusion
• Resolving conflicts
• veracity
51
Data Curation
•Data curation includes all the processes needed
for principled and controlled data creation,
maintenance, and management, together with the
capacity to add value to data.
• Analogy to an art curator…
• make decisions regarding what data to collect,
• oversee data care and documentation (metadata)
• conduct research based on the collection
• data-driven decision making
• ensure proper packaging of data for reuse
• share that data with the public
•…
52
The Long Tail of Data Variety and Data Curation
Source: Curry, E., & Freitas, A. (2014). Coping with the long tail of data variety.
53