Data Warehouse Concepts: Quách Đình Hoàng Hoangqd@hcmute - Edu.vn
Data Warehouse Concepts: Quách Đình Hoàng Hoangqd@hcmute - Edu.vn
Data Warehouse Concepts: Quách Đình Hoàng Hoangqd@hcmute - Edu.vn
CONCEPTS
2
Motivation for data warehouse
• Organizations face increasingly complex challenges
to achieve operational goals so need analysis tools for
decision support
• Traditional operational or transactional databases do
not satisfy the requirements for data analysis
– Designed/optimized to support daily business operations;
primary concern: concurrent access and recovery
techniques to guarantee data consistency
– Contain detailed data, do not include historical data, and
perform poorly for complex queries that involve many
tables or aggregate large volumes of data
3
Motivation for data warehouse
• To analyze the behavior of an organization, data from
several operational systems must be integrated
– Difficult to accomplish due to many differences in data
definition and content
• Data warehouse address requirements of decision-
making
– Populated from operational databases and external data
sources
– Integrated and transformed data
– Optimized for reporting and periodic integration
4
Data Warehouse
• A data warehouse is a copy of transaction data
specifically structured for query and analysis.
Ralph Kimball
6
Data Warehouse in Knowledge
Discovery (KDD) Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases 7
Data Warehouse in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
9
OLTP vs. OLAP
Transaction
processing (OLTP)
• Primary data from
transactions
• Daily operations and
short term decisions
Business intelligence
processing (OLAP)
• Transformed secondary
data
• Medium and long-term
decisions
10
OLTP vs. OLAP
Characteristic Operational Data Warehouse (OLAP)
Database (OLTP)
Currency Current Historical
Details level Individual Individual and summary
Orientation Process Subject
Records per request Few Thousands
Normalization level Mostly normalized Normalization relaxed
Update level Highly volatile Mostly refreshed (non
volatile)
Data model Relational Relational (star schemas)
and multidimensional (data
cubes)
…
11
OLTP vs. OLAP Schema
Comparison
Operational database Data warehouse
Manages Store
Item StoreId
ItemId StoreManager
Employee ItemName StoreStreet
ItemUnitPrice StoreCity
EmpNo
ItemBrand StoreSales StoreState
EmpFirstName StoreZip
ItemCategory
EmpLastName StoreNation
... DivId
ItemSales Sales
DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Takes Customer SalesCost
TimeDim
CustId TimeNo
Product CustName TimeSales TimeDay
Customer CustPhone
Order ProdNo TimeMonth
CustNo CustStreet CustSales TimeQuarter
ProdName
OrdNo CustCity TimeYear
CustFirstName Places Contains ProdQOH
OrdDate CustState TimeDayOfWeek
CustLastName ... CustZip TimeFiscalYear
... ...
CustNation
Qty
12
Multidimensional Model
• DWs and OLAP use a multidimensional view of data
• Represented as a data cube or an hypercube
– Dimensions: Perspectives for analyzing data
– Cells (facts): Contain measures, values that are to be
analyzed
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33
Paris
Q1 21 10 18 35 measure
Time (Quarter)
values
Q2 27 14 11 30
dimensions
Q3 26 12 35 32
Q4 14 20 47 31
games DVDs
books CDs
13
Product (Category)
Hierarchies
• Data granularity: level of detail at which measures are
represented for each dimension of the cube
• Data analyzed at different granularities (abstraction
levels)
• Hierarchies relate low-level (detailed) concepts to
higher-level (general concepts)
– Example: Store – City – Region/Province – Country
• Given two related levels in a hierarchy, lower level is
called child, higher level is called parent
• Instances of these levels are called members
14
Hierarchies
• Example
– Hierarchies of
the Product,
Time, and
Customer
dimensions
15
Hierarchies
• Members of the hierarchy Store City
Region/Province Country
Store dimension All
16
Measure Aggregation and Summarizability
• Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels
• Summarizability refers to the correct aggregation of
cube measures along dimension hierarchies
• Summarizability conditions:
– Disjointness of instances: the grouping of instances in a
level with respect to the parent in the next level must result
in disjoint sets
– Completeness: all instances are included in the hierarchy
and each instance is related to one parent in the next level
– Correct use of aggregation functions (“measure type”
condition): Type of measures determine the kind of
aggregation functions that can be applied. 17
Measure Classification
• Each measure is associated to an aggregation function
that combines several measure values into a single one
• Measures can be classified according to the way they
can be aggregated:
– Additive measures (flow or rate measures): Can be meaningfully
summarized using addition along all dimensions
• E.g., sales amount can be summarized when the hierarchies in Store,
Time, and Product dimensions are traversed
– Semi-additive measures (stock or level measures): Can be meaningfully
summarized using addition along some (not all) dimensions
• E.g., inventory quantities, can be aggregated in the Store dimension,
but cannot be aggregated in the Time dimension
– Non-additive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any dimension
• E.g., item price, cost per unit, exchange rate
18
Another Measure Classification
• Distributive measures: defined by an aggregation
function that can be computed in a distributed way
– The result of applying the distributive function to the whole
data set is the same as the result of applying it (not
necessarily the same) to the n aggregated values of n subsets.
– E.g., sum, min, max, and count (distinct count is not)
• Algebraic measures: defined by an aggregation
function that can be expressed as a scalar function of
distributive ones
– E.g., average (can be computed by sum and count)
• Holistic measures: Cannot be computed from other
sub-aggregates.
– E.g., median, mode, and rank
19
OLAP Operations: Roll up
• Transforms detailed measures into
summarized ones when one moves up in a
hierarchy
Milan 24 18 28 14
Rome 33 25 23 25 Italy 57 43 51 39
Nice 12 20 24 33 France
Paris Q1 33 30 42 68
Time (Quarter)
Q1 21 10 18 35
Roll-up to the Country level
Time (Quarter)
Q2 27 14 11 30
Q2 27 14 11 30
Q3 26 12 35 32
Q3 26 12 35 32
Q4 14 20 47 31
Q4 14 20 47 31
games DVDs
games DVDs books CDs
books CDs Product (Category)
Product (Category)
20
OLAP Operations: Drill down
• Opposite to the roll-up operation, i.e., it moves from a
more general level to a detailed level in a hierarchy
Milan 24 18 28 14 Milan 8 6 9 5
Rome 33 25 23 25 Rome 10 8 11 8
Nice Nice 4 7 8 10
12 20 24 33
Paris Paris
Jan 7 2 6 13
Q1 21 10 18 35 Drill-down to the
Time (Quarter)
Month level
Time (Quarter)
Q2 27 14 11 30 Feb 8 4 8 12
Q3 26 12 35 32 Mar 6 4 4 10
Paris 21 27 26 14
Store (City)
Q2 27 14 11 30
Pivot Nice 12 14 11 13
Q3 26 12 35 32
Rome 33 28 35 32
Q4 14 20 47 31
Milan 24 23 25 18
games DVDs
Q1 Q2 Q3 Q4
books CDs
Product (Category) Time (Quarter)
22
OLAP Operations: Slice
• Performs a selection on one dimension of a
cube, resulting in a subcube
Milan 24 18 28 14
Rome 33 25 23 25 Q1 21 10 18 35
Time (Quarter)
Nice 12 20 24 33
Paris Q2 27 14 11 30
Q1 21 10 18 35
Time (Quarter)
Q3 26 12 35 32
Q2 27 14 11 30 Slice on Store.City = ‘Paris’ Q4 14 20 47 31
Q3 26 12 35 32 games DVDs
Q4 14 20 47 31 books CDs
Product (Category)
games DVDs
books CDs
Product (Category)
23
OLAP Operations: Dice
• Defines a selection on two or more
dimensions, thus again defining a subcube
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33 Nice 12 20 24 33
Paris Paris
(Quarter)
Q1 21 10 18 35 Q1 21 10 18 35
Dice on Store.Country = ‘France’
Time (Quarter)
Time
Q2 27 14 11 30 and Time.Quarter= ‘Q1’ or ‘Q2’
Q2 27 14 11 30
Q3 26 12 35 32 games DVDs
books CDs
Q4 14 20 47 31
Product (Category)
games DVDs
books CDs
Product (Category)
24
OLAP Operations – Summary
Operation Purpose Description
Slice Focus attention on a Replace a dimension with a
subset of dimensions single member value or with a
summary of its measure values
Dice Focus attention on a Replace a dimension with a
subset of member values subset of members
Drill-down Obtain more detail about Navigate from a more general
a dimension level to a more specific level
Roll-up Summarize details about Navigate from a more specific
a dimension level to a more general level
Pivot Present data in a Rearrange the dimensions in a
different order data cube
25
Data Warehouse Architectures
Top Down
• Enterprise data warehouse
• Higher integration levels
• Logically centralized
• Larger project scope
Bottom Up
• Independent data marts
• Lower integration levels
• Logically decentralized
• Smaller project scope
26
Bottom-up Architecture
User
Data mart tier
departments
Operational
database
Transformation
process
Data mart
Operational
database
External
data source
Data mart
27
Top-Down Architecture
Data warehouse User
Data mart tier
server departments
Operational
database Staging Extraction
Area
process
Transformation
process
Detailed and
summarized data
EDM
External
data source Data warehouse
Data mart
28
General Architecture
Internal Data staging Metadata
sources
OLAP tools
Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases
Statistical
tools
External
sources Data marts
Data mining
tools
35