Data Warehousing
Data Warehousing
Data Warehousing
Data Warehousing
Bibliographic reference:
M. Golfarelli, S. Rizzi. Data Warehouse
Design: Modern Principles and
Methodologies. McGraw-Hill, 2009.
2
Enterprise Databases
Until the mid-1980s, enterprise databases stored only
operational data, i.e., data created by business
operations involved in daily management processes
3
Evolution of Information Systems
The role of Information systems has radically changed from
the early seventies, when they started to be commonly used
Indeed, from tools only useful to improve the efficiency of
business processes they have become central elements in
the organizations, able to completely change the structure of
such processes
Twofold role of
computer science
5
Business intelligence
In order to allow managers to realize powerful and
flexible analysis, it is necessary to define an
appropriate hardware and software infrastructure,
consisting of:
Dedicated Hardware
Network Infrastructures
DBMS
Back-end software
Front-end software
6
From data to Information
Information is an important enterprise asset,
needed to effectively plan and control the
business activities
It is the raw material that is transformed by
information systems, such as semi-finished
products are transformed by production systems
data ≠ information
7
From data to Information
Every enterprise must have quick, comprehensive access to
the information required by decision-making processes. This
strategic information is extracted mainly from the huge amount
of operational data stored in enterprise databases by means of
a progressive selection and aggregation process
Strategic knowledge
Value
Reports
Selected information
Volume
8
Decision Support Systems
Decision Support Systems (DSSs) started to become popular in
the eighties:
Role of DSSs
A DSS is a set of
In the past In the future
expandable, interactive IT
Describe the past Anticipate future
techniques and tools
designed for processing and Reduce costs Increase Profits
analyzing data and for Describe problems Suggest changes
supporting managers
in decision making.
10
A typical scenario….
.. is that of a large company, with numerous branches, whose
managers want to evaluate the contribution of each branch to
the overall business performance of the company
11
A typical scenario….
.. isAnthat of a large
information company,
repository thatwith numerous
integrates and branches, whose
managers wantdata
reorganizes to evaluate
collected the
fromcontribution
sources of of each branch to
the overall business
various kind performance
and makes of thefor
them available company
analysis and evaluations aimed at planning
and decision-making
DATA
WAREHOUSE
Data Warehousing: a collection of methods,
techniques, and tools used to support
knowledge workers (senior managers,
directors, managers, and analysts) to conduct
data analyses 12
OLTP e OLAP
Mixing together "analytical" and "transactional”
queries leads to inevitable delays that make
dissatisfied users of both categories.
13
Some Areas where DW technologies are
normally adopted
14
Companies complaints
We have heaps of data, but we cannot access it!
How can people playing the same role achieve
substantially different results?
We want to select, group, and manipulate data in
every possible way!
Show me just what matters!
Everyone knows that some data is wrong!
15
Data Warehousing Characteristics
16
Data Warehouse
Data warehouses are placed right in the middle of
this process and act as repositories for data. They
make sure that the requirements set can be fulfilled.
17
...subject oriented
attende
dImpossibile v isualizzare l'immagine.
Ticket doctors
Impossibile v isualizzare l'immagine.
reservation region
Impossibile v isualizzare l'immagine.
Analysis
result
Admission
management
Emphasis is on applications
Emphasis is on subjects
18
...integrated and Operational data and
external data
consistent
Data warehouses take Extraction
load DW access
22
Summarizing
23
Architectures: requirements
Separation: Analytical and transactional processing should be
kept apart as much as possible.
Scalability: Hardware and software architectures should be
easy to upgrade as the data volume (which has to be
managed and processed) and the number of users’
requirements (which have to be met) progressively increase.
Extensibility: The architecture should be able to host new
applications and technologies without redesigning the whole
system.
Security: Monitoring accesses is essential because of the
strategic data stored in data warehouses.
Administerability: Data warehouse management should not be
overly difficult.
24
Single-Layer Architecture
Its goal is to minimizes the amount of
data stored. The only physically
Source
available layer is the source layer. In Layer
this case, data warehouse is virtual. Operational data
* The name highlights a separation between physically available sources and data
26
warehouses, but in fact it consists of four subsequent data flow stages
Two-Layer Architecture
The data marts populated from a primary data warehouse are
often called dependent. They are very useful for data warehouse
systems in midsize to large enterprises because
they are used as building blocks while incrementally developing
data warehouses;
they mark out the information required by a specific group of users
to solve queries;
they can deliver better performance because they are smaller than
primary data warehouses.
ETL Tools
Data Staging This layer
materializes
Reconciled Data
Reconciled Data
Layer
operational data
ETL tools Meta-data obtained after
(loading)
Loading integrating and
cleansing source
Data Warehouse
data. As a result,
Data Warehouse
those data are
Layer integrated,
Data mart consistent, correct
and detailed.
Analysis
Reporting
Tools What-if analysis
OLAP tools
Data Mining
tools tools 29
Three-Layer Architecture
The main advantage of the reconciled data
layer is that it creates a common reference
data model for a whole enterprise. At the
same time, it sharply separates the problems
of source data extraction and integration from
those of data warehouse population.
However, reconciled data leads to more
redundancy of operational source data.
30
An additional classification (1)
independent data marts architecture, different data marts are separately
designed and built in a nonintegrated fashion (it is an extreme case of
independent data marts in the two-layer architecture, in the sense that
there can be no integration among different datamarts)
bus architecture (recom. by Ralph Kimball), is similar to the preceding
architecture, but adopts a basic set of conformed dimensions as a common
design guideline. This ensures logical integration of data marts and an
enterprise-wide view of information.
Meta-data
Meta-data
What-if analysis
OLAP tools
tools Data Mining Reporting 31
tools Tools
An additional classification (2)
hub-and-spoke architecture poses
much attention to scalability and
Operational Data External Data
extensibility, and to achieving an
ETL Tools
enterprise-wide view. Atomic,
normalized data is stored in a Reconciled Data
reconciled layer that feeds a set of data ETL tools Meta-data
marts containing summarized data in (loading)
multidimensional form
Data mart
The centralized architecture,
recommended by Bill Inmon, can be
seen as a particular implementation of
the hub-and-spoke architecture, where
the reconciled layer and the data marts
are collapsed into a single physical Reporting
Tools
What-if analysis
OLAP Data Mining tools
repository tools tools
32
An additional classification (3)
The federated architecture is adopted in dynamic contexts where preexisting
data warehouses/data marts are to be noninvasively integrated. Each data
warehouse/data mart is either virtually or physically integrated with the others,
leaning on a variety of advanced techniques such as distributed querying,
ontologies, and meta-data interoperability
33
ETL
ETL processes extract, integrate, and clean data from operational sources to
eventually feed the data warehouse.
At the abstract level, ETL processes produce a single, high-quality, detailed
data source, that in turn feed the DW (reconciliation)
Depending on the architecture, this data source is physical (reconciled data
layer) or virtual. In the former case ETL processes are physically directly
connected to the reconciled layer,in the latter to the (primary) DW or to the
datamarts.
Reconciliation takes place in two occasions: when a data warehouse is
populated for the first time, and every time the data warehouse is updated.
ETL consists in four phases:
extraction
cleansing (or cleaning)
transformation
loading
Note: Cleansing and tranformation are not always considered as separate phases in the
literature. Here we say that cleansing is essentially devoted to correct values, whereas
transformation is mainly devoted to correct formats. 34
Operational data
firstName: John
lastName: White Standardization
address: 10, Downing Street
ZIPcode: TWA1 2AA
city: London
country: United Kingdom
firstName: John
lastName: White
Correction address: 10, Downing Street
ZIPcode: SWA1 2AA
city: London
nation: United Kingdom
38
Operational data
Loading and external data
40
Multidimensional Model
It is the basic principled foundation for the
representation and querying of data in a DW.
Enterprise-specific Facts are represented as
cubes where:
Each cell contains numerical measures that
quantify the facts form various point of views.
Each axis represents a dimension of interest for
the analysis.
Each dimension can be associated with a
hierarchy of dimensional attributes used to
aggregate data stored in the cubes.
41
Three-dimensional cube modeling sales
10 packs of
shiny have
been sold on store
5/4/2008 in the
EverMore store EverMOre
10
product
shiny
date
5\4\08
42
attribute
Hierarchies
product type category
Shiny
Bleachy cleaner
Bighty
House
CleanHand cleaning
Scent soap
DrinkMe Food
Drink
Coky
All stores
SmartMart
Profit Miami
Florida
EvenMore Orlando
store
store
EvenMore' ’EvenMore'
date date
'5/4/2008'
store
state=’Florida'
categorry=’food'
date
year=2008
type
city
Quantity = 2000
Σ
month store
Σ product
date
Every macro fact measure is the sum of its component fact measures
45
Aggregation
EvenMore EvenMore2 SmartMart
1/1/2000 – – –
2/1/2000 10 15 5
3/1/2000 20 – 5
.......... .......... .......... ..........
1/1/2001 – – – EvenMore EvenMore2 SmartMart
2/1/2001 15 10 20 January 2000 200 180 150
3/1/2001 20 20 25 February 2000 180 150 120
.......... .......... .......... .......... March 2000 220 180 160
1/1/2002 – – – .......... .......... .......... ..........
2/1/2002 20 8 25 January 2001 350 220 200
3/1/2002 20 12 20 February 2001 300 200 250
.......... .......... .......... .......... March 2001 310 180 300
.......... .......... .......... ..........
January 2002 380 200 220
February 2002 310 200 250
March 2002 300 160 280
.......... .......... .......... ..........
EvenMore EvenMore2 SmartMart
2000 2400 2000 1600
2001 3200 2300 3000
2002 3400 2200 3200
46
Accessing Data Warehousing
After cleansing, integrating, and transforming data,
you should determine how to get the best out of it in
terms of information
The three main approaches for end users to query
data warehouses are essentially:
reports: It does not require IT knowledge
OLAP: requires the user to think in a multidimensional way
and to know the interface of the graphical tool used
dashboards: GUIs that display a limited amount of relevant
data in a brief and easy-to-read format
data mining: applies algorithms to identify hidden
associations between items and make them visible to the
user
47
Reports receipts (K€ ) October 2001 September 2001 August 2001
clothes 80 100 50
food 20 40 10
Devoted to users that forniture 50 5 10
need to access Perfumes 25 35 20
House
information structured in cleaning
15 20 5
atpredetermined time
intervals
Clothes
Food
Furniture
Perfumes
House cleaning
Free time
48
OLAP
It is the most popular way to exploit information in a data warehouse.
it gives end users, whose analysis needs are not easy to define
beforehand, the opportunity to analyze and explore data interactively
on the basis of the multidimensional model.
While users of reporting tools essentially play a passive role, OLAP
users are able to start a complex analysis session actively, where
each step is the result of the outcome of the preceeding steps
in-depth knowledge of data by users is required
complex queries have to be allowed
design for users not familiar with IT has to be guaranteed
49
OLAP: session
An OLAP session consists of a navigation path that
corresponds to an analysis process for facts according to
different view points and different levels of detail. This
path is turned into a sequence of queries, which are often
not issued directly, but differentially expressed with
reference to the previous query.
At every step of the session an OLAP operator is used,
which transforms the last performed query into a new one.
The results of queries are multidimensional. OLAP tools
typically use tables to display data, with multiple headers,
colors, and other features to highlight data dimensions.
50
The Virtual-mall example
51
OLAP: operators
roll-up
roll-up
(time hierarchy)
53
OLAP: operators
roll-up
(removing customer
dimension)
54
OLAP: operators
drill-down
drill-down
56
OLAP: operators
drill-down
(adding a dimension)
57
OLAP: operators
slice-and-dice
slice-and-dice
Slicing based on theYear='2006' predicate
59
OLAP: operators
slice-and-dice
Year =‘2006’ AND
Category =‘Electronics’ AND
Revenue>80 AND
CustomerRegion =‘Northwest’
60
OLAP: operators
pivoting
pivoting
62
OLAP: operators
pivoting
63
OLAP: operators
drill-across
Drilling across the sales cube (Revenue measure) and the promotion
cube (Dicount measure) 65
Dashboards
Dashboards can provide a real-time overview of the trends
for a specific phenomenon or for many phenomena that
are strictly connected with each other.
The term is a visual metaphor: the group of indicators in
the GUI are displayed like a car dashboard.
Dashboards are often used by senior managers who need
a quick way to view information. However, to conduct and
display very complex analyses of phenomena,
dashboards must be matched with analysis tools.
Dashboards are nothing but performance indicators
behind GUIs. Their effectiveness is due to a careful
selection of the relevant measures, while using data
warehouse information quality standard.
66
Dashboards
{shoes} ⇒ {socks} ⇒
70
Data mining: clustering
Given a population of objects as points in a multidimensional
space, in which each dimension corresponds to a particular
feature, clustering refers to grouping these objects together in a
reduced number of sets (clusters) better characterizing the
population. Clustering identifies the similarities between items.
Main applications:
Segmenting customers into
categories
The evaluation of clinical
cases on the basis of
symptoms
Epidemiological analysis
71
Data mining: decision trees
72
Data mining: decision trees
Main Applications:
Assessment of customer risk categories for companies
granting mortgages, loans and insurances
73
Data mining: time series
The goal here is to identify recurring or atypical
patterns in complex data sequences
Applications:
Identifying associated growth schemata in shares on stock
exchanges,
revealing anomalies in monitoring systems,
studying the relationships between different time series,
identifying companies with
similar developmental models,
analyzing navigation routes
on web sites
tempo
74
ROLAP
Stands for Relational OLAP: an implementation based on relational
DBMSs
Motivated by the huge amount of literature written about the
relational model, the broadly available corporate experience with
relational database usage and management, and the top
performance and flexibility standards of relational DBMSs. However,
the relational model does not include the concepts of dimension,
measure, and hierarchy: you must create specific types of schemata
that allow one to represent multidimensional schemata in terms of
relational ones => Star Schema.
Performance problems (costly join operations over large tables)
=>denormalization.
Since ROLAP uses a relational database, it requires more
processing time and/or disk space to perform some of the tasks that
multidimensional databases are designed for. However, ROLAP
supports larger user groups and greater amounts of data.
As is typical of relational databases, some queries are created and
stored in advance (in materialized views)
75
ROLAP
Data processing may take place within the database system, a
middleware server, or the client.
In a two-tiered architecture, the user submits an SQL query to the
database and receives back the requested data.
In a three-tiered architecture, the user submits a request for
multidimensional analysis and the ROLAP engine converts the request
to SQL for submission to the database. Then, the engine converts the
resulting data from SQL to a multidimensional format before it is
returned to the client for viewing.
front-end
back-end
Middleware
(ROLAP engine)
Relational server
Meta-dati
OLAP
Client
78
MOLAP
The big advantage of the MOLAP approach with respect to
the ROLAP one is that multidimensional operations can be
simply and naturally realized, without the need of resorting
to joins. For this reason, MOLAP system performance is
excellent.
However, no standard for the logical model at the basis of
MOLAP approach currently exists, and the various
implementations present very many differences one
another: They simply share the usage of optimization
techniques specifically designed for sparsity management.
No standard query language: each vendor tends to propose
a proprietary one ( but Microsoft Multidimensional
Expressions is imposing as standard de facto)
Loading process is often slow.
Some products may handle only up to ten dimensions with
limited cardinality
79
MOLAP: The problem of sparsity
In the multidimensional model, a set of coordinates
corresponds to a possible event even if this does not
actually took place.
Typically, the number of occurred events is far less
than possible events (around 20%).
Keeping track of events that did not occur leads to a
waste of resources and reduces the performance of
the system.
ROLAP: can store only useful cells (occurred events).
MOLAP: calls for complex techniques to minimize the space
required to keep track of not occurred events.
80
MOLAP: The problem of sparsity
We can group cubes into “chunks”, and classify these as sparse or
dense chunks. MOLAP systems try to materialize mainly dense chunks.
83
A top-down approach
Analyze global business needs, plan how to develop a
data warehouse, design it, and implement it as a whole.
it will reach excellent results because it is based on a global
picture of the goal to achieve, and in principle it ensures
consistent, well integrated data warehouses.
high-cost estimates with long-term implementations
discourage company managers from embarking on these
kind of projects.
analyzing and bringing together all relevant sources is a very
difficult task.
it is extremely difficult to forecast the specific needs of every
department involved in a project, which can result in the
analysis process coming to a standstill.
since no prototype is going to be delivered in the short term,
users cannot check for this project to be useful, so they lose
trust and interest in it
84
Bottom-up approach
In a bottom-up approach, data warehouses are
incrementally built and several data marts are iteratively
created. Each data mart is based on a set of facts that are
linked to a specific company department and that can be
interesting for a user subgroup
Determines concrete results in the short term, especially if
coupled with quick prototyping
It does not require high financial investments (at least, not
from the very beginning of the project)
It allows to study only the issues related to the data mart at
hand
Provides immediate feedback to senior management on the
effectiveness of the system under construction
Attention on the project is steadily high
Determines a partial view of the domain of interest!
85
Which data mart to realize first?
it should play a very
strategic role in the
company
It should be a
reference point for DM2 DM4
the whole data
warehouse DM1
It should exploit
consistent data DM5 DM3
already made
available
Source 3 Source 1 Source 2
CONCEPTUAL
DESIGN WORKLOAD
REFINEMENT
Operational workload
source ANALYSIS AND data volume
LOGICAL
schemata RECONCILIATION logical model DESIGN
Reconciled Schema
logical schema
ETL procedures
Physical schema
88
Facts
Facts are concepts on which data mart end users base
their decision-making process. Each fact describes a
category of events taking place in enterprise. They have
to be identified in the Requirement Analysis phase.
Designers should also have clear ideas on fact dimensions.
Focusing on the dimensions of a fact leads to the definition of
fact granularity, i.e., the highest detail level for data to be
represented in a data mart. Selecting a granularity level for fact
representation is the result of a delicate compromise between
two opposite needs: the need for maximum usage flexibility,
which implies the same granularity level as operational sources,
and the need for top performance, which implies a concise
synthesis of data.
Every fact needs a historical interval to be defined -- that is, a
period of time covered by the events stored in a data mart.
89
Typical facts of different application fields
90
Other requirements
Need to be analyzed and specified:
Requirements for logical and physical desing (e.g.,
available space to optimize data mart performances).
Requirements for data staging design (frequency of data
staging)
Requirements for data quality (especially end-users
expectation on it)
Data warehouse architecture (type of architecture, number
of layers, independent or dependent data marts, and
reconciled layer materialization)
Requirements on data analysis applications to adopt (on
the basis of the types of queries end users expect to issue
to their data marts and the analytic reports IT staff
members are usually required to draw up)
Startup schedule
Training schedule 91
Conceptual Modeling for DWs
It is well known that an accurate conceptual design is the
fundamental requirement for the construction of a well-
documented database that completely meets the
requirements.
While it is universally acknowledged that a DW relies (at
the logical level) on the multidimensional model, there is
no agreement on the methodology for conceptual
modeling in DWs.
The Entity/Relationship (ER) moldel is widely used in
organizations for the documentation of information
systems based on the relational model (even though not
always in a formal and completely understood way), it
cannot be used to model a DW.
92
Conceptual Modeling for DWs
Indeed, ER model is oriented to queries that follow
associations among data rather than summarize the
data.
The basic ER model is not able to accurately
highlight the distinctive features of the
multidimensional model (e.g, it does not directly allow
for easy modeling of hierarchies on dimensions)
It is not very efficient from the notational viewpoint.
UML class diagrams suffer of the same drawbacks.
93
UML-based Conceptual Modeling for DW:
an example [Lujan-ora-2006]
95
DFM: basic constructs
A fact is a concept relevant to decision-making processes. It typically
models a set of events taking place within a company (e.g., sales,
shipments, purchases, and complaints, etc.). It is essential that a fact
have dynamic properties or evolve in some way over time
A measure is a numerical property of a fact and describes a
quantitative fact aspect that is relevant to analysis (e.g., each sale is
measured by the number of units sold, the unit price, and the total
receipts)
A dimension is a fact property with a finite domain and describes an
analysis coordinate of the fact (typical dimensions for the sales fact are
products, stores, and dates)
A fact expresses a
many-to-many
association between
dimensions
97
DFM: basic constructs
The general term dimensional attributes stands for the dimensions and
other possible attributes, always with discrete values, that describe
them (for example, a product dimension is described by its type, the
category it belongs to, its brand, the department in which it is sold)
A hierarchy is a directed tree whose nodes are dimensional attributes
and whose arcs model many-to-one associations between dimensional
attributes. It includes a dimension, positioned at the trees root, and all
of the dimensional attributes that describe it.
98
DFM: correspondence with ER diagrams
99
“Naming conventions”
All dimensional attributes in a fact schema must have
different names.
Similar names can be differentiated by qualifying
them with the name of the dimensional attribute that
comes before them in hierarchies.
For instance, BrandCity and storeCity.
Names of dimensional attributes should not explicitly
refer to the fact they belong to
For instance, shippedProduct and shipmentDate have to be
avoided
Attributes with the same meaning in different fact
schemata must have the same name.
100
Events and Aggregations
A primary event is a particular occurrence of a fact, identified by
one n-ple made up of a value for each dimension. A value for each
measure is associated with each primary event.
In reference to sales, for example, a possible primary event records
that 10 packages of Shiny detergent were sold for total sales of $25 on
10/10/2008 in the SmartMart store.
Given some dimensional attributes (pattern), each n-ple of their
values identifies a secondary event which aggregates all
corresponding primary events. To any secondary event a value is
associated for each measure. Such a value sums up all the values
of the same measure in the corresponding primary events.
Thus, hierarchies define the way you can aggregate primary events
and effectively select them for decision-making process; while the
dimension in which a hierarchy takes root defines its finest aggregation
granularity, the other dimensional attributes correspond to a gradually
increasing granularity.
101
Events and Aggregation
type
storeCity
month
Type
month store
product
Aggregation
operators
date
102
DFM: advanced constructs
A descriptive attribute is
functionally determined by a
dimensional attribute of a
hierarchy and specifies a
property of this dimensional
attribute. Descriptive
attributes often are tied to
dimensional attributes by one-
to-one associations and do
not actually add useful levels
of aggregation. It can also be
directly connected to a fact if it
describes primary events, but
it is neither possible nor descriptive
attribute
interesting to use it to identify
single events or to make
calculations
Note: In an ER representation of such model, you can model descriptive
attributes as attributes of the entities corresponding to dimensional attributes
or the relationship modeling the fact 103
DFM: advanced constructs
o Some arcs can be marked as
optional. optional arc
o If the optionality is associated
to a dimensional attribute A
(as diet in the figure), this
means that A and all possible
descendents may be
undefined for one or more
instances of the fact to which
the arc is (indirectly)
associated
A coverage can be specified on two or more arcs that exit from the same
dimensional attribute A. It can be:
• total (T): every value of A is associated to at least a value for one of the
child attributes of A
• partial (P): there are values for A for which all the children are undefined
and
• Disjoint (D): a value for at most one of the children is associated to A
• Overlapping (O): values of A exist linking to values of two or more
children
106
DFM: advanced constructs
A cross-dimensional
attribute is a descriptive
attribute whose value is
defined by the
combination of two or
more dimensional
attributes. For example,
a product Value Added
Tax (VAT) depends both
on the product category
and on the country
where the product is
sold.
109
DFM: advanced constructs
Multiple arc:
Allows for modeling many-
to-many relations between
dimensional attributes
(instead of many-to-one
relations). In the example,
many different authors can
write many different books.
Multiple arc entering a dimension:
A fact is identified by a tuple containing a value for each dimension. In the
example, the value for diagnosis is, in fact, a “group of diagnosis”
112
Additivity
In the sales fact schema, the
numberOfCustomers measure
(i.e., estimated through the
number of receipts issued for a
product in a certain date in a
certain store) is non-additive
along the product dimension:
Indeed, a single receipt refers to
several products (possibly of
different types). In this case it is
also considered non-aggregable
at all (missing labels)
116
Logical Models for Data Marts
Conceptual modeling does not depend on the logical
model the designer has selected in the architecture
design phase, but the topics related to logical
modeling clearly do.
As already said, Multidimensional structure can be
represented sing two distinct logical models:
MOLAP (Multidimensional On-Line Analytical Processing)
stores data using structures that are intrinsically
multidimensional (e.g., multidimensional vectors).
ROLAP (Relational On-Line Analytical Processing) makes
use of the well-known relational model to represent
multidimensional data.
117
ROLAP: The star schema
In what follows we concentrate on ROLAP system
Multidimensional modeling in relational systems is
based on the so-called star schema and star schema
variants.
A star schema is constituted by:
A set of relations DT1…, DTn called dimension tables. Each
of them corresponds to a dimension. Every DTi features a
primary (typically surrogate) key ki and a set of attributes
describing its dimension at different aggregation levels.
A fact table FT referencing all the dimension tables. An FT
primary key is the composition of the set of foreign keys k1
through kn referencing dimension tables. An FT also contains
an attribute for every measure.
118
Star schema: structure
119
Star schema: instances
Fact Table
Dimension Table
Dimension Table
Dimension Table
120
Star schema: comments
Dimension tables are completely denormalized
(e.g., product → type; type → category)
A join is sufficient to obtain all data connected to a certain
dimension
High redundancy in the data due to denormalization (even
though redundancy is in the dimensional tables, whose size is
often negligible)
121
OLAP queries over star schemata
SELECT *
FROM SALES AS FT, PRODUCT AS DT1, STORE AS DT2, DATA AS DT3
WHERE FT.keyP = DT1.keyP AND FT.keyS = DT2.keyS AND
FT.keyD = DT3.keyD
A multidimensional view of data is obtained when you join the fact table to its
dimension tables
122
OLAP queries over star schemata
123
Snowflake schema
Dimension tables are not in third normal form. For example,
the STORE dimension table includes the transitive
functional dependencies store→storeCity, storeCity→state.
127
Snowflake schema normalization
When you decompose a dimension table to design a snowflake
schema, you should check for an appropriate set of attributes to
be inserted in the new relation
To break down a schema effectively, you need for all those
attributes that directly or transitively depend on the snowflaking
attribute (that is, on the natural key of the new relation) to be
part of the new relation.
STORE
keyS→ store
STORE keyS store →keyC
keyS keyS→ store store store → salesManager
store store →storeCity keyC
storeCity storeCity → state
salesManager
state → country
state CITY
store → salesManager
country
keyC
salesManager storeCity keyC → storeCity
state storeCity → state
country state → country
128
OLAP queries over snowflake schemata
130
Views
Views can classified on the basis of the aggregation
level (pattern) that characterizes them
v1 = {product, date, store}
v5 = {quarter, state}
{a,b}
b' a'
{a',b} {a,b'}
b a
{b'} {a'}
{}
Multidimensional grid
132
Secondary views and aggregation
Consider the following example, in which we on-the-fly calculate the
total receipts starting from sold quantity and unitPrice
type product quantity unitPrice receipts
T1 P1 5 1,00 5,00
T1 P2 7 1,50 10,50
T2 P3 9 0,80 7,20
22,70
type
Sum
quantity
AVG
unitPrice receipts
?
T1 12 1,25 15,00
T2 9 0,80 7,20
22.20
Due to the way in which we aggregated values, in the aggregated
view it is not possible to obtain the correct value for total receipts.
To correctly reconstruct information in secondary views it may be
necessary to introduce new measures (Derived Measures)
In the above example, the only way to obtain receipts is to explicitly
store them in the aggregated view.
133
Secondary views and aggregation
Support Measures: are necessary in the presence of non-
distributive aggregate operators*
AVG
31/12/1999 110
1999 Average: 113,33
*Distributive
operators allow you to calculate aggregate data from data that is partially
134
aggregated. The SUM, MIN, and MAX operators belong to this category
Relational schemata and views
135
Relational schemata and views
The fact tables that correspond to the group-by sets where one or more
dimensions are completely aggregated do not have foreign keys
referring to these aggregated dimensions 137
Relational schemata and views
A constellation schema optimizes the access to fact tables
because each of them includes only data at a specific
aggregation level.
138
Relational schemata and views
You can choose to replicate dimension tables for views.
Each dimension table will include only the set of attributes that
are valid for the aggregation level at which that dimension table
is created.
139
Relational schemata and views
140
Relational schemata and views
A compromise between the solutions mentioned so far applies
snowflaking at the aggregation levels where aggregate views are
materialized. This solution takes advantage of the optimization
achieved when you sort out aggregate data by aggregation level
without replicating dimension tables.
141
Aggregate navigators
The presence of multiple fact tables that contain the
data needed to solve a given query poses the
problem of determining the view that will guarantee
the minimum cost of execution.
142
Logical Design
The logical design phase includes a set of steps that,
starting from the conceptual schema, make it
possible to define the logical schema of a data mart.
143
Logical design
Main steps in DW logical design are:
144
Star vs Snowflake
There are discordant opinions on snowflaking:
Some authors consider it an embellishment that is in
conflict with the founding philosophy of data warehousing
145
Star vs Snowflake
It can be beneficial when
part of a hierarchy is common to more than one
dimension
148
Descriptive attributes
Contain information that will not be used to perform
aggregations but that is considered useful.
When connected to a dimensional attribute, it has to be
included in the dimension table that contains the attribute.
If connected directly to the fact, it has to be included in the
fact table.
It only makes sense if it is compatible with the level
of granularity of the event described in the fact
table, so if connected directly to the fact table it will
be omitted in the aggregated views. Also, if it is
connected to a dimensional attribute, it can be
included only in the dimensional tables that contain
the attribute it refers to.
149
Cross-dimensional attributes
From the conceptual point of view, a cross-dimensional
attribute b defines a many-to-many relationship between two or
more dimensional attributes a1,...,am.
Its translation into a logical schema requires you to enter a new
table that includes b and has a1,...,am as key attribute.
The choice between surrogate or non-surrogate keys for the new table
should be made according to the space occupation that they cause 150
Shared Hierarchies
If a hierarchy occurs several times in the same fact (or in different
facts) it is not convenient to introduce redundant copies of the
related dimension table.
If two (or more) hierarchies contain exactly the same attributes it is
sufficient to simply import the key twice (or more) in the fact table
connected to them
151
Shared hierarchies
If the two hierarchies share only some of the attributes you need to
decide if:
I. Introducing additional redundancy in the diagram by duplicating
hierarchies and replicating the common fields.
II. Snowflaking the schema on the first shared attribute, and thus
introducing a third table common to both the dimension table.
152
Multiple arcs
Although it is not very common, some hierarchies can also
model many-to-many relations.
The most obvious design solution is to insert an additional
table (bridge table) that models the multiple arc:
The key of the bridge table is the combination of the
attributes connected by the multiple arc.
An (optional) weight attribute makes it also possible to
consider to assign (in terms of a percentage) different
relevance to the participating tuples
153
Multiple arcs
154
Multiple arcs
It may take up to 3 join to retrieve all the information contained
in the hierarchy
155
Multiple arcs
It may take up to 3 join to retrieve all the information contained
in the hierarchy
156
Multiple arcs
If you wish to continue using the star schema it is necessary to
make finer the granularity of the fact thus shaping the multiple
arc directly in the fact table (push-down).
157
Multiple arcs
An instance for the book sales fact table with author push-down
158
Multiple arcs: comparison
From the informative point of view the two solutions are identical
162