Data Warehouse Concepts: Quách Đình Hoàng Hoangqd@hcmute - Edu.vn

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 35

DATA WAREHOUSE

CONCEPTS

Quách Đình Hoàng


[email protected]
Outline
• Motivation for data warehouse
• Data warehouse definition
• OLTP vs. OLAP
• Multidimensional Model
• OLAP Operations
• Data warehouse architectures

2
Motivation for data warehouse
• Organizations face increasingly complex challenges
to achieve operational goals so need analysis tools for
decision support
• Traditional operational or transactional databases do
not satisfy the requirements for data analysis
– Designed/optimized to support daily business operations;
primary concern: concurrent access and recovery
techniques to guarantee data consistency
– Contain detailed data, do not include historical data, and
perform poorly for complex queries that involve many
tables or aggregate large volumes of data
3
Motivation for data warehouse
• To analyze the behavior of an organization, data from
several operational systems must be integrated
– Difficult to accomplish due to many differences in data
definition and content
• Data warehouse address requirements of decision-
making
– Populated from operational databases and external data
sources
– Integrated and transformed data
– Optimized for reporting and periodic integration

4
Data Warehouse
• A data warehouse is a copy of transaction data
specifically structured for query and analysis.
Ralph Kimball

• A data warehouse is a collection of


– subject-oriented,
– integrated,
– time-variant, and
– nonvolatile
data to support data management decisions
Bill Inmon
5
Data Warehouse
• Subject-oriented: Organized to analyze a particular
subject area, for example, "sales“.
• Integrated: many transformations to unify source
data from independent data sources (units of measure,
data formats, naming conventions)
• Time-variant: historical data, snapshots of business
processes captured at different points in time
• Nonvolatile: new data are appended periodically,
existing data is not changed

6
Data Warehouse in Knowledge
Discovery (KDD) Process
Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 7
Data Warehouse in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
8
OLTP vs. OLAP
• Operational databases (online transaction
processing systems or OLTP), are not suitable for
data analysis
– Contain detailed data, do not include historical data,
perform poorly for complex queries due to normalization
• Online analytical processing (OLAP): Allows
decision-making users to perform interactive analysis
of data

9
OLTP vs. OLAP
Transaction
processing (OLTP)
• Primary data from
transactions
• Daily operations and
short term decisions

Business intelligence
processing (OLAP)
• Transformed secondary
data
• Medium and long-term
decisions

10
OLTP vs. OLAP
Characteristic Operational Data Warehouse (OLAP)
Database (OLTP)
Currency Current Historical
Details level Individual Individual and summary
Orientation Process Subject
Records per request Few Thousands
Normalization level Mostly normalized Normalization relaxed
Update level Highly volatile Mostly refreshed (non
volatile)
Data model Relational Relational (star schemas)
and multidimensional (data
cubes)

11
OLTP vs. OLAP Schema
Comparison
Operational database Data warehouse
Manages Store
Item StoreId
ItemId StoreManager
Employee ItemName StoreStreet
ItemUnitPrice StoreCity
EmpNo
ItemBrand StoreSales StoreState
EmpFirstName StoreZip
ItemCategory
EmpLastName StoreNation
... DivId
ItemSales Sales
DivName
SalesNo
DivManager
SalesUnits
SalesDollar
Takes Customer SalesCost
TimeDim
CustId TimeNo
Product CustName TimeSales TimeDay
Customer CustPhone
Order ProdNo TimeMonth
CustNo CustStreet CustSales TimeQuarter
ProdName
OrdNo CustCity TimeYear
CustFirstName Places Contains ProdQOH
OrdDate CustState TimeDayOfWeek
CustLastName ... CustZip TimeFiscalYear
... ...
CustNation
Qty

12
Multidimensional Model
• DWs and OLAP use a multidimensional view of data
• Represented as a data cube or an hypercube
– Dimensions: Perspectives for analyzing data
– Cells (facts): Contain measures, values that are to be
analyzed
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33
Paris
Q1 21 10 18 35 measure
Time (Quarter)

values
Q2 27 14 11 30
dimensions
Q3 26 12 35 32

Q4 14 20 47 31
games DVDs
books CDs
13
Product (Category)
Hierarchies
• Data granularity: level of detail at which measures are
represented for each dimension of the cube
• Data analyzed at different granularities (abstraction
levels)
• Hierarchies relate low-level (detailed) concepts to
higher-level (general concepts)
– Example: Store – City – Region/Province – Country
• Given two related levels in a hierarchy, lower level is
called child, higher level is called parent
• Instances of these levels are called members
14
Hierarchies
• Example
– Hierarchies of
the Product,
Time, and
Customer
dimensions

15
Hierarchies
• Members of the hierarchy Store  City 
Region/Province  Country
Store dimension All

Country level France ... Italy

Region/ Ile-de-France Provence-Alpes- Lazio Lombardy


Province level Côte d'Azur ... ...

City level Paris Nice Rome Milan

Store level Store 1 Store 2 Store 3 ... Store 10 Store 11 Store 12

16
Measure Aggregation and Summarizability
• Measures are aggregated when using hierarchies for
visualizing data at different abstraction levels
• Summarizability refers to the correct aggregation of
cube measures along dimension hierarchies
• Summarizability conditions:
– Disjointness of instances: the grouping of instances in a
level with respect to the parent in the next level must result
in disjoint sets
– Completeness: all instances are included in the hierarchy
and each instance is related to one parent in the next level
– Correct use of aggregation functions (“measure type”
condition): Type of measures determine the kind of
aggregation functions that can be applied. 17
Measure Classification
• Each measure is associated to an aggregation function
that combines several measure values into a single one
• Measures can be classified according to the way they
can be aggregated:
– Additive measures (flow or rate measures): Can be meaningfully
summarized using addition along all dimensions
• E.g., sales amount can be summarized when the hierarchies in Store,
Time, and Product dimensions are traversed
– Semi-additive measures (stock or level measures): Can be meaningfully
summarized using addition along some (not all) dimensions
• E.g., inventory quantities, can be aggregated in the Store dimension,
but cannot be aggregated in the Time dimension
– Non-additive measures (value-per-unit measures): Cannot be
meaningfully summarized using addition along any dimension
• E.g., item price, cost per unit, exchange rate
18
Another Measure Classification
• Distributive measures: defined by an aggregation
function that can be computed in a distributed way
– The result of applying the distributive function to the whole
data set is the same as the result of applying it (not
necessarily the same) to the n aggregated values of n subsets.
– E.g., sum, min, max, and count (distinct count is not)
• Algebraic measures: defined by an aggregation
function that can be expressed as a scalar function of
distributive ones
– E.g., average (can be computed by sum and count)
• Holistic measures: Cannot be computed from other
sub-aggregates.
– E.g., median, mode, and rank
19
OLAP Operations: Roll up
• Transforms detailed measures into
summarized ones when one moves up in a
hierarchy
Milan 24 18 28 14
Rome 33 25 23 25 Italy 57 43 51 39
Nice 12 20 24 33 France
Paris Q1 33 30 42 68

Time (Quarter)
Q1 21 10 18 35
Roll-up to the Country level
Time (Quarter)

Q2 27 14 11 30
Q2 27 14 11 30
Q3 26 12 35 32
Q3 26 12 35 32
Q4 14 20 47 31
Q4 14 20 47 31
games DVDs
games DVDs books CDs
books CDs Product (Category)
Product (Category)

20
OLAP Operations: Drill down
• Opposite to the roll-up operation, i.e., it moves from a
more general level to a detailed level in a hierarchy
Milan 24 18 28 14 Milan 8 6 9 5
Rome 33 25 23 25 Rome 10 8 11 8
Nice Nice 4 7 8 10
12 20 24 33
Paris Paris
Jan 7 2 6 13
Q1 21 10 18 35 Drill-down to the
Time (Quarter)

Month level

Time (Quarter)
Q2 27 14 11 30 Feb 8 4 8 12

Q3 26 12 35 32 Mar 6 4 4 10

Q4 14 20 47 31 ... ... ... ... ...

games DVDs Dec 4 4 16 7


books CDs games DVDs
Product (Category) books CDs
Product (Category)
21
OLAP Operations: Pivot or Rotate
• Rotates the axes of a cube to provide an
alternative presentation of the data
Milan 24 18 28 14
Rome 33 25 23 25 DVDs 35 30 32 31
Nice 12 20 24 33 CDs 18 11 35 47
games 10 14 12 20
Paris
books
Q1 21 10 18 35
Time (Quarter)

Paris 21 27 26 14

Store (City)
Q2 27 14 11 30
Pivot Nice 12 14 11 13
Q3 26 12 35 32
Rome 33 28 35 32
Q4 14 20 47 31
Milan 24 23 25 18
games DVDs
Q1 Q2 Q3 Q4
books CDs
Product (Category) Time (Quarter)

22
OLAP Operations: Slice
• Performs a selection on one dimension of a
cube, resulting in a subcube
Milan 24 18 28 14
Rome 33 25 23 25 Q1 21 10 18 35

Time (Quarter)
Nice 12 20 24 33
Paris Q2 27 14 11 30
Q1 21 10 18 35
Time (Quarter)

Q3 26 12 35 32
Q2 27 14 11 30 Slice on Store.City = ‘Paris’ Q4 14 20 47 31
Q3 26 12 35 32 games DVDs
Q4 14 20 47 31 books CDs
Product (Category)
games DVDs
books CDs
Product (Category)

23
OLAP Operations: Dice
• Defines a selection on two or more
dimensions, thus again defining a subcube
Milan 24 18 28 14
Rome 33 25 23 25
Nice 12 20 24 33 Nice 12 20 24 33
Paris Paris

(Quarter)
Q1 21 10 18 35 Q1 21 10 18 35
Dice on Store.Country = ‘France’
Time (Quarter)

Time
Q2 27 14 11 30 and Time.Quarter= ‘Q1’ or ‘Q2’
Q2 27 14 11 30
Q3 26 12 35 32 games DVDs
books CDs
Q4 14 20 47 31
Product (Category)
games DVDs
books CDs
Product (Category)

24
OLAP Operations – Summary
Operation Purpose Description
Slice Focus attention on a Replace a dimension with a
subset of dimensions single member value or with a
summary of its measure values
Dice Focus attention on a Replace a dimension with a
subset of member values subset of members
Drill-down Obtain more detail about Navigate from a more general
a dimension level to a more specific level
Roll-up Summarize details about Navigate from a more specific
a dimension level to a more general level
Pivot Present data in a Rearrange the dimensions in a
different order data cube

25
Data Warehouse Architectures
Top Down
• Enterprise data warehouse
• Higher integration levels
• Logically centralized
• Larger project scope

Bottom Up
• Independent data marts
• Lower integration levels
• Logically decentralized
• Smaller project scope

26
Bottom-up Architecture
User
Data mart tier
departments

Operational
database
Transformation
process

Data mart

Operational
database

External
data source

Data mart

27
Top-Down Architecture
Data warehouse User
Data mart tier
server departments

Operational
database Staging Extraction
Area
process
Transformation
process

Operational Data mart


database

Detailed and
summarized data
EDM
External
data source Data warehouse
Data mart

28
General Architecture
Internal Data staging Metadata
sources
OLAP tools

Enterprise Reporting
ETL OLAP tools
Operational data
process warehouse server
databases

Statistical
tools

External
sources Data marts
Data mining
tools

Data Back-end Data warehouse OLAP Front-end 29


sources tier tier tier tier
General Architecture
• Data sources
– Operational databases
– Other internal or external sources of information (e.g. files)
• Back-end tier
– Extraction-Transformation-Loading (ETL) tools for manipulating data
from sources
– Data staging area: Intermediate database where manipulation is done
• OLAP tier
– OLAP Server: Supports multidimensional data and operations
• Front-end tier: Deals with data analysis and visualization
– Composed of OLAP tools, reporting tools, statistical tools, data-mining
tools, …
30
Extraction-Transformation-Loading
• Extraction: Gathers data from multiple heterogeneous data
sources
– May be operational databases or files in various formats
– May be internal or external to the organization
– Uses APIs such as ODBC, JDBC, … for achieving interoperability
• Transformation: Modifies data to conform to the data
warehouse format
– Cleaning: Removes errors, inconsistencies, format transformation
– Integration: Reconciles data from different sources
– Aggregation: Summarizes data according to the granularity (level of
detail) of the DW
• Loading: Feeds the DW with transformed data
– Also includes refreshing the data warehouse at a specified frequency
31
Data Warehouse Tier
• Enterprise data warehouse: Centralized DW that encompasses
all areas in an organization
• Data mart: Specialized DW targeted to a particular functional
area or user group
– Their data can be derived from the enterprise DW or collected from
data sources
• Metadata repository: Describes the content of the DW
– Business metadata: Meaning (semantics) of data, organization rules,
policies, constraints, …
– Technical metadata: How data is structured/stored in the computer
• Data sources, data warehouse, and data marts: logical and physical
schemas, security information, monitoring information …
• ETL process: Data lineage (trace to sources), rules, defaults, refresh and
purging rules, algorithms for summarization, … 32
OLAP Tier
• OLAP servers that provides multidimensional view
from DWs and data marts
– Can be ROLAP, MOLAP, or HOLAP
• Most database products provide OLAP extensions
and related tools for manipulating cubes
• However, no standardized language for querying data
cubes
– Oracle uses Java and query language OLAP DML
– SQL Server uses .NET and query language MDX
• XMLA (XML for Analysis) aims at providing a
common language for exchanging multidimensional
data 33
Front-End Tier
• OLAP tools: Allow interactive exploration and
manipulation of the warehouse data
– Facilitate formulation of ad hoc queries (no prior
knowledge of them)
• Reporting tools: Enable production, delivery and
management of reports (paper and web-based)
– Use predefined queries
• Statistical tools: Used to analyze and visualize the
cube data using statistical methods
• Data-mining tools: Allow users to analyze data to
discover patterns, trends, enable predictions
34
Summary
• Motivation for data warehouse
• Data warehouse definition
• OLTP vs. OLAP
• Multidimensional Model
• OLAP Operations
• Data warehouse architectures

35

You might also like