Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
Lecture 3: Business Intelligence: OLAP, Data Warehouse, and Column Store
1
Why we still study OLAP/Data
Warehouse in Big Data?
• Understand the Big Data history
– How does the requirement of (big) data analytics/business
intelligence evolve over the time?
– What are the architecture and implementation techniques
being developed? Will they still be useful in Big Data?
– Understand their limitation and what factors have
changed from 90’s to now?
• NoSQL is not only SQL
• Hive/Impala aims to provide OLAP/BI for Big Data
using Hadoop
2
Highlights
• OLAP
– Multi-relational Data model
– Operators
– SQL
• Data warehouse (architecture, issues,
optimizations)
• Join Processing
• Column Stores (Optimized for OLAP workload)
3
Let’s get back to the root in 70’s:
Relational Database
Basic Structure
• Formally, given sets D1, D2, …. Dn a relation r is a subset of
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai Di
• Example:
customer_name = {Jones, Smith, Curry, Lindsay}
customer_street = {Main, North, Park}
customer_city = {Harrison, Rye, Pittsfield}
Then r = { (Jones, Main, Harrison),
(Smith, North, Rye),
(Curry, North, Rye),
(Lindsay, Park, Pittsfield) }
is a relation over
customer_name , customer_street, customer_city
Relation Schema
11
In early 90’s:
OLAP & Data Warehouse
Database Workloads
• OLTP (online transaction processing)
– Typical applications: e-commerce, banking, airline reservations
– User facing: real-time, low latency, highly-concurrent
– Tasks: relatively small set of “standard” transactional queries
– Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
• OLAP (online analytical processing)
– Typical applications: business intelligence, data mining
– Back-end processing: batch workloads, less concurrency
– Tasks: complex analytical queries, often ad hoc
– Data access pattern: table scans, large amounts of data involved
per query
OLTP
14
OLAP
• Of increasing importance are On-Line
Application Processing (OLAP) queries.
– Few, but complex queries --- may run for hours.
– Queries do not depend on having an absolutely
up-to-date database.
15
OLAP Examples
1. Amazon analyzes purchases by its customers
to come up with an individual screen with
products of likely interest to the customer.
2. Analysts at Wal-Mart look for items with
increasing sales in some region.
16
One Database or Two?
• Downsides of co-existing OLTP and OLAP
workloads
– Poor memory management
– Conflicting data access patterns
– Variable latency
• Solution: separate databases
– User-facing OLTP database for high-volume
transactions
– Data warehouse for OLAP workloads
– How do we connect the two?
OLTP/OLAP Architecture
ETL
(Extract, Transform, and Load)
OLTP OLAP
OLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• Extract-Transform-Load (ETL)
– Extract records from source
– Transform: clean data, check integrity, aggregate, etc.
– Load into OLAP database
• OLAP database for data warehousing
– Business intelligence: reporting, ad hoc queries, data
mining, etc.
– Feedback to improve OLTP services
The Data Warehouse
• The most common form of data integration.
– Copy sources into a single DB (warehouse) and try
to keep it up-to-date.
– Usual method: periodic reconstruction of the
warehouse, perhaps overnight.
– Frequently essential for analytic queries.
20
Warehouse Architecture
Client Client
Metadata Warehouse
Integration
21
Star Schemas
22
Example: Star Schema
• Suppose we want to record in a warehouse
information about every beer sale: the bar,
the brand of beer, the drinker who bought the
beer, the day, the time, and the price charged.
• The fact table is a relation:
Sales(bar, beer, drinker, day, time, price)
23
Example, Continued
• The dimension tables include information
about the bar, beer, and drinker
“dimensions”:
Bars(bar, addr, license)
Beers(beer, manf)
Drinkers(drinker, addr, phone)
24
Visualization – Star Schema
Dimension Table (Bars) Dimension Table (Drinkers)
25
Dimensions and Dependent
Attributes
• Two classes of fact-table attributes:
1. Dimension attributes : the key of a dimension
table.
2. Dependent attributes : a value determined by
the dimension attributes of the tuple.
26
Warehouse Models & Operators
• Data Models
– relations
– stars & snowflakes
– cubes
• Operators
– slice & dice
– roll-up, drill down
– pivoting
– other
27
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
28
Star Schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
29
Terms
• Fact table
• Dimension tables
• Measures sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
30
Dimension Hierarchies
sType
store
city region
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south
snowflake schema
constellations region regId name
north cold region
south warm region
31
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4
32
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
33
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
34
ROLAP vs. MOLAP
• ROLAP:
Relational On-Line Analytical Processing
• MOLAP:
Multi-Dimensional On-Line Analytical
Processing
35
Cube
dimensions = 2
36
3-D Cube
dimensions = 3
37
Multidimensional Data
• Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi
Office Day
Month
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
Cuboids Corresponding to the
Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
Cube Aggregation
Example: computing sums
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
41
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)
42
Extended Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* c267 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
43
Aggregation Using Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
44
Pivoting
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
c1 c2 c3
p1 56 4 50
p2 11 8
45
CUBE Operator (SQL-99)
Chevy Sales Cross Tab
Chevy 1990 1991 1992 Total (ALL)
black 50 85 154 289
white 40 115 199 354
Total 90 200 353 1286
(ALL)
49
Query & Analysis Tools
• Query Building
• Report Writers (comparisons, growth, graphs,…)
• Spreadsheet Systems
• Web Interfaces
• Data Mining
50
Other Operations
• Time functions
– e.g., time average
• Computed Attributes
– e.g., commission = sales * rate
• Text Queries
– e.g., find documents with words X AND B
– e.g., rank documents by frequency of
words X, Y, Z
51
Data Warehouse Implementation
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Loading, cleansing,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...
53
Multi-Tiered Architecture
Monitor
Metadata & OLAP Server
other
source Integrator
s Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
55
Data Cleaning
• Migration (e.g., yen dollars)
• Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
• Fusion (e.g., mail list, customer merging)
billing DB customer1(Joe)
merged_customer(Joe)
service DB customer2(Joe)
57
OLAP Implementation
Derived Data
• Derived Warehouse Data
– indexes
– aggregates
– materialized views (next slide)
• When to update derived data?
• Incremental vs. refresh
59
What to Materialize?
• Store in warehouse results useful for common
queries
• Example: total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19
60
Materialization Factors
• Type/frequency of queries
• Query response time
• Storage cost
• Update cost
61
Cube Aggregates Lattice
129
all
c1 c2 c3
p1 67 12 50
city product date
use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8
62
Dimension Hierarchies
all
city
63
Dimension Hierarchies
all
state
city, product, date
state, date
state, product
64
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters
months conceptual
dimension table
days
65
Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the
indexed column
• not suitable for high cardinality domains
i. B-1
Disk B main memory buffers Disk
Partitions
of R & S Join Result
Hash table for partition
Read in a partition hash Ri (k < B-1 pages)
of R, hash it using fn
85
Row Store vs Column Store
Column Store:
IBM 60.25 10,000 1/15/2006
SELECT account.account_number,
sum (usage.toll_airtime),
sum (usage.toll_price)
FROM usage, toll, source, account
WHERE usage.toll_id = toll.toll_id
AND usage.source_id = source.source_id
AND usage.account_id = account.account_id
AND toll.type_ind in (‘AE’. ‘AA’)
AND usage.toll_price > 0
AND source.type != ‘CIBER’
AND toll.rating_method = ‘IS’
AND usage.invoice_date = 20051013
GROUP BY account.account_number
(+) Easy to add/modify a record (+) Only need to read in relevant data
(-) Might read in unnecessary data (-) Tuple writes require multiple accesses
92
Read store: Column Encoding/Compression
• Use compression schemes and indices
– Null Suppression
– Dictionary encoding
– Run Length encoding
– Bit-Vector encoding
94
Write Store
• Same structure, but explicitly use
(segment, key) to identify records
– Easier to maintain the mapping
– Only concerns the inserted records
• Tuple mover
– Copies batch of records to RS
• Delete record
– Mark it on RS
– Purged by tuple mover
How to solve read/write
conflict
• Situation: one transaction updates
the record X, while another
transaction reads X.
• Use snapshot isolation
Query Execution - Operators
97
Query Execution - Operators
• Decompress: Converts compressed column to
uncompressed representation
• Mask(Bitstring B, Projection Cs) => emit only those
values whose corresponding bits are 1
• Concat: Combines one or more projections sorted in
the same order into a single projection
• Permute: Permutes a projection according to the
ordering defined by a join index
• Bitstring operators: Band – Bitwise AND, Bor –
Bitwise OR, Bnot – complement
98
Benefits in query processing
• Selection – has more indices to use
• Projection – some “projections”
already defined
• Join – some projections are
materialized joins
• Aggregations – works on required
columns only
Evaluation
• Use TPC-H – decision support queries
• Storage
Query performance
Query performance
• Row store uses materialized views
Summary: the performance gain
• Column representation – avoids reads of unused
attributes
• Storing overlapping projections – multiple orderings of
a column, more choices for query optimization
• Compression of data – more orderings of a column in
the same amount of space
• Query operators operate on compressed representation
Google’s Dremel:
Interactive Analysis of Web-Scale Datasets
104
Dremel system
• Trillion-record, multi-terabyte datasets at interactive
speed
– Scales to thousands of nodes
– Fault and straggler tolerant execution
• Nested data model
– Complex datasets; normalization is prohibitive
– Columnar storage and processing
• Tree architecture (as in web search)
• Interoperates with Google's data mgmt tools
– In situ data access (e.g., GFS, Bigtable)
– MapReduce pipelines
105
Widely used inside Google
• Analysis of crawled web Tablet migrations in
documents managed Bigtable instances
• Tracking install data for Results of tests run on
applications on Android Google's distributed build
Market system
• Crash reporting for Google Disk I/O statistics for
products hundreds of thousands of
• OCR results from Google disks
Books Resource monitoring for
• Spam analysis jobs run in Google's data
• Debugging of map tiles on centers
Google Maps Symbols and dependencies
in Google's codebase
106
Records vs. columns
DocId: 10
Links
r 1
Forward: 20
Name A
Language
Code: 'en-us'
* . . .*
Country: 'us' B E
Url: 'http://A' *
Name C D
Url: 'http://B' r1
r1
r1 r1
r2 r2
r2 Read less, r2
... cheaper
decompression
Challenge: preserve structure, reconstruct from a subset of fields
107
Nested data model
http://code.google.com/apis/protocolbuffers DocId: 10
Links
r 1
Forward: 20
Forward: 40
Forward: 60
multiplicity: Name
message Document { Language
required int64 DocId; [1,1] Code: 'en-us'
Country: 'us'
optional group Links { Language
repeated int64 Backward; [0,*] Code: 'en'
Url: 'http://A'
repeated int64 Forward; Name
} Url: 'http://B'
repeated group Name { Name
Language
repeated group Language { Code: 'en-gb'
required string Code; Country: 'gb'
optional string Country; [0,1]
}
DocId: 20 r 2
Links
optional string Url; Backward: 10
Backward: 30
} Forward: 80
} Name
Url: 'http://C'
108
Column-striped representation
DocId Name.Url Links.Forward Links.Backward
value r d value r d value r d value r d
10 0 0 http://A 0 2 20 0 2 NULL 0 1
20 0 0 http://B 1 2 40 1 2 10 0 2
NULL 1 1 60 1 2 30 1 2
http://C 0 2 80 0 2
Name.Language.Code Name.Language.Country
value r d value r d
en-us 0 2 us 0 3
en 2 2 NULL 2 2
NULL 1 1 NULL 1 1
en-gb 1 2 gb 1 3
NULL 0 1 NULL 0 1
109
Repetition and
definition levels DocId: 10
Links
Forward: 20
r 1
Forward: 40
r=1 r=2 (non-repeating) Forward: 60
Name
Name.Language.Code
epeated
Language
o rd ( r= 0) h as r
value r d rec Code: 'en-us'
Country: 'us'
en-us 0 2 repeated
Language (r=2) has
Language
Code: 'en'
en 2 2 Url: 'http://A'
NULL 1 1 Name
Url: 'http://B'
en-gb 1 2 Name
Language
NULL 0 1 Code: 'en-gb'
Country: 'gb'
111
SQL dialect for nested data
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
112
Serving tree
[Dean WSDM'09] client • Parallelizes scheduling
root server
and aggregation
• Fault tolerance
intermediate • Stragglers
servers
...
• Designed for "small"
... results (<1M records)
leaf servers
(with local ...
storage) histogram of
response times
storage layer (e.g., GFS)
113
Example: count()
SELECT A, COUNT(B) FROM T SELECT A, SUM(c)
0 GROUP BY A FROM (R11 UNION ALL R110)
T = {/gfs/1, /gfs/2, …, /gfs/100000} GROUP BY A
R11 R12
SELECT A, COUNT(B) AS c SELECT A, COUNT(B) AS c
1 FROM T11 GROUP BY A FROM T12 GROUP BY A ...
T11 = {/gfs/1, …, /gfs/10000} T12 = {/gfs/10001, …, /gfs/20000}
...
SELECT A, COUNT(B) AS c
3 FROM T31 GROUP BY A ...
T31 = {/gfs/1}
Data access ops
114
Experiments
• 1 PB of real data
(uncompressed, non-replicated)
• 100K-800K tablets per table
• Experiments run during business hours
115
Interactive speed
Monthly query
workload
percentage of queries of one 3000-node
Dremel instance
execution
time (sec)
BigQuery
2. Process Import to tables
Your
Apps
117
List of Column Databases
• Vertica/C-Store
• SybaseIQ
• MonetDB
• LucidDB
• HANA
• Google’s Dremel
• Parcell-> Redshit (Another Cloud-DB Service)
Take-home messages
• OLAP
– Multi-relational Data model
– Operators
– SQL
• Data warehouse (architecture, issues,
optimizations)
• Join Processing
• Column Stores (Optimized for OLAP workload)
119