Lecture2-Conceptual Modelling

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 38

Lecture 2

Designing the Data Warehouse:


Requirements Specification & Conceptual Design

Chapter 4 and Chapter 10 (Sections 1 to 4)


Vaisman and Zimanyi, Data Warehouse Systems, Springer, 2014
Overview of DW Design
• The design of a DW follows the traditional
database design phases:
1. Requirements specification
2. Conceptual design (Dimensional Model)
3. Logical design (star or snowflake schemas)
4. Physical Model
Requirements specification
• Requirements express the organizational goals
and needs to support the decision-making
process. Steps include:
1. Identify users
2. Determine business analysis requirements
a. Define goals
b. Operationalise goals (functional requirements…)
c. Define facts, measures, dimensions and granularity
d. Identify data sources (operational databases…)
3. Document requirements specification
Requirements Specification for
the Northwind Case Study
The NorthWind Case Study
• Northwind company imports and exports specialty foods from around the world
1. Identify Users
• Three groups of users identified:
– Executive: The members of the board of directors
of the Northwind company, who define the
ultimate company goals.
– Management: Managers at departmental levels,
for example, marketing, regional sales, and human
resources.
– Professional: Professional personnel who
implement the indications of the management.
Examples are marketing executive officers.
2. Determine Analysis Needs: Goals
• Goal: Increase the overall company sales by 10%
yearly.
• Subgoals:
1. Increase sales in underperforming regions
2. For customers buying below their potential, increase
their orders (in number of orders and individual
order amount)
3. Increase sales of products selling below the company
expectations
4. Take action on employees performing below their
expected quota
2. Determine Analysis Needs: Operationalise Goals
1. Increase sales in underperforming regions
a) Five best selling (measured as total sales amount) pairs of customer - supplier countries
b) Countries, states, and cities whose customers have the highest total sales amount
c) Five best selling (measured as total sales amount) products by customer country, state, and city

2. For customers buying below their potential, increase their orders


a) Monthly sales by customer compared to the sales for the same customer, in the previous year
b) Total number of orders by customer, time period (for example, year), and product
c) Average unit price per customer

3. Increase sales of products selling below company expectations


a) Monthly sales for each product category for the current year
b) Average discount percentage per product and month
c) Average quantity ordered per product

4. Take action on employees performing below their expected quota


a) Best selling employee per product per year with respect to sales amount
b) Average monthly sales by employee and year
c) Total sales by an employee and his/her subordinates during a certain time period
Dimensions and Measures
Dimensions/ Hierarchies and levels Analysis Scenarios
measures 1a 1b 1c 2a 2b 2c 3a 3b 3c 4a 4b 4c
Employee Supervision
Subordinate → Supervisor
Territories - - - - - - - - -   
Employee ↔ City → State →
Country → Continent
Time Calendar
Day → Month → Quarter → - - -      -   
Semester → Year
Product Categories
- -  -  -     - -
Product → Categories
Customer Geography
Customer → City → State →       - - - - - -
Country → Continent
Supplier Geography
Supplier → City → State → Country  - - - - - - - - - - -
→ Continent
Quantity - - - - - - - -  - - -
Discount - - - - - - -  - - - -
Sales amount     - -  - -   
UnitPrice - - - - -  - - - - - -
2. Determine Analysis Needs: Identify data sources

• Operational Database: MS SQL Server


Northwind database
• Excel file (Time.xls): contains dates 1 July 1996
to 30 June 1998
• Text file (cities.txt): contains list of cities and
corresponding states and countries
• XML file and schema (Territories.xml and
Territories.xsd): contains list of continents,
countries and states
DW Conceptual Design
MultiDim: A Conceptual Multidimensional Model

• Based on the Entity-Relationship (ER) model


• Includes concepts like:
– dimensions
– hierarchies
– facts
– measures
• Supports various kinds of hierarchies existing
in real-world applications
• Can be mapped to star or snowflake relational
structures
MultiDim Model: Notation
• Schema: set of dimensions and facts
• Dimension: level or one or more hierarchies
• Hierarchy: several related levels
• Level: describes real-world concept (similar to entity type in ER model). Has attributes.
• Member: every instance of a level
• Key attribute: Indicates how child members are grouped
• Descriptive attributes: Describe characteristics of members
• Child and parent levels: the lower and higher levels in a parent-child relationship
• Leaf and root levels: first and last levels in a hierarchy
• Cardinality: Minimum/maximum numbers of members in a level related to members in
another level
• Criterion: Expresses different hierarchical structures used for analysis
MultiDim Model: Notation
• Fact: Relates several dimensions (leaf levels). Has attributes (commonly
known as measures).
• Dimensions can be related to fact with one-to-one, one-to-many, or
many-to-many
• Dimension can be related several times to a fact with different roles
Measures
• Each measure in a cube is associated with an aggregation function
that combines several measure values into a single one.
– For example, if we use the Customer hierarchy, when changing the
granularity of the data from City to Country, then the sales values for all
customers in the same country will be aggregated using, e.g., the SUM
operation.
• Summarizability refers to the correct aggregation of cube measures
along dimension hierarchies
• Summarizability conditions:
– Disjointness of instances: the grouping of instances in a level with respect to
their parent in the next level must result in disjoint subsets (e.g., a product
cannot belong to two categories, otherwise its sales will be counted twice)
– Completeness: all instances are included in the hierarchy and each instance
is related to one parent in the next level
– Correctness: refers to the correct use of the aggregation functions
Measures
• According to the way they can be aggregated, measures
can be:
– Additive: can be meaningfully summarized along all

the dimensions, using addition


• The most common type of measures (E.g., Sales Quantity).
– Semiadditive: can be meaningfully summarized using
addition along some dimensions, but not all. Example:
inventory quantities cannot be added along the Time
dimension (but we can use average instead)
– Nonadditive: cannot be meaningfully summarized
using addition across any dimension. Example: item
price (but we can use average instead)
Measures
• Another classification of measures:
– Distributive measures: defined by an aggregation function that
can be computed in a distributed way
• Functions count, sum, minimum, and maximum are
distributive, distinct count is not
• Example: S = {3; 3; 4; 5; 8; 4; 7; 3; 8} partitioned in subsets
{3; 3; 4}, {5; 8; 4}, {7; 3; 8} gives a result of 8, while the
answer over the original set is 5
– Algebraic measures: defined by an aggregation function that
can be expressed as a scalar function of distributive ones;
example: average, computed by dividing the sum by the count
– Holistic measures: cannot be computed from other
subaggregates (e.g., median)
MultiDim Model: Notation (Summary)
Dimension Hierarchies
• Crucial in analytical applications: Enable analysis at various
abstraction levels
• In real-world situations, users must deal with complex hierarchies
of various kinds, but current DW and OLAP systems allow only a
limited set of kinds of hierarchies
• At the conceptual level, focus is to establish sequences of levels
that should be traversed during roll-up and drill-down
• Distinction between the various kinds of hierarchies should also
be made at the instance level
• Cardinalities in parent-child relationships must be considered
• MultiDim includes classification of hierarchies at the schema and
instance level and proposes a graphical notation
Balanced Hierarchies
• At schema level: only one path where all parent-child relationships are many-
to-one and mandatory

• At instance level: members form a balanced tree (all the branches have the same length)
• All parent members have at least one child member, and a child belongs exactly
to one parent
Unbalanced Hierarchies
• At schema level: only one path where all parent-child relationships are many-
to-one, but some are optional (e.g., not all branches have agencies and not all
agencies have ATMs)

• At instance level: members form a unbalanced tree (some parents without


children)
Recursive Hierarchies
• A special case of unbalanced hierarchies
• The same level is linked by the two roles of a parent-child relationship

• Schema level

• Instance level: employees may have no subordinates (descendants)


Generalised Hierarchies
• Sometimes, members of a level are of different types (e.g., customer is either a
person or a company)
• At schema level: multiple exclusive paths sharing at least the leaf level; may
also share other levels
• Two aggregation paths, one for
each type of customer

• At instance level: each member belongs to only one path


Generalised Hierarchies
• This kind of hierarchy is, in general, not summarizable
– The mapping from the splitting level to the parent levels is incomplete
• E.g., not all customers roll up to the Sector level or the Profession level

• Aggregation mechanism should be modified when a splitting and


joining levels are reached in a drill-down and roll-up operations
• It is not necessary that splitting levels are joined:
Ragged (Level-Skipping) Hierarchies
• A special case of generalized hierarchies
• At the schema level: Alternative paths are obtained by skipping one or several
intermediate levels. E.g., Some countries don’t have regions, and some don’t
have neither regions nor states.

• At instance level: Path length from the leaves to the same parent can be
different for different members
Alternative Hierarchies
• At schema level: Multiple nonexclusive hierarchies that share at least the leaf
level and account for the same analysis criterion
• Users must choose only one of the alternative paths for their analysis

• At instance level: Members form graph (child has more than one parent
• Fiscal year, here, starts in Feb
Generalised vs. Alternative Hierarchies
• Both hierarchies
– Share some levels
– Use one analysis criterion
• A child member
– Related to only one path in generalized hierarchies
– Related to all paths in alternative hierarchies and users must choose
one for analysis
Parallel Hierarchies
• Dimension has several hierarchies accounting for different analysis criteria
• Two different types
– Parallel independent hierarchies
– Parallel dependent hierarchies
• Parallel independent hierarchies
– Composed of disjoint hierarchies, i.e., hierarchies that do not share levels
Parallel Hierarchies
• Parallel dependent hierarchies
• Composed of several hierarchies that account for different analysis criteria and
share some levels
Alternative vs. Parallel Hierarchies
• Both hierarchies
– Share some levels
– May include several hierarchies
• Criterion
– Only one for alternative hierarchies
– Several for parallel hierarchies
• Combining hierarchies
– Meaningless for alternative hierarchies
– Useful for parallel hierarchies (e.g., Sales for stores in city A that
belongs to sales district B).
• Reusing aggregated measures for common levels
– Can be done for alternative hierarchies
– Cannot be done for parallel hierarchies
Alternative vs. Parallel Hierarchies
• Aggregated measure for the Month
level can be reused between both
paths
• Traversing the Calendar hierarchy
from a specific day in the Time level
will end up in the same year
independently of which path is used.

• Aggregated measure for State level


cannot be reused between both paths
• Traversing the hierarchies Lives and
Territories from the Employee to the
State level will lead to different states
for employees who live in one state and
work in another.
Nonstrict Hierarchies
• At schema level: At least one many-to-many cardinality

• At instance level: Members form a graph


Nonstrict Hierarchies: Double Counting
• Problem: Double counting of measures when a roll-up operation
reaches a many-to-many relationship.

• Employee with sales equal 100 is assigned to three cities,


therefore, this sale is going to be counted three times instead of
only once.
Possible Solution: Distributing Factor

• Employees may work in several sections. Salary of employee refers to the sum
of the salaries paid in each section.
• Distributing factor determines how measures are divided between several
parent members (e.g., percentage of time spent in each section)
• Distributing factor is not always known
– Percentage of time that an employee works in a section must be added to schema
• Sometimes this distribution is impossible to specify
– E.g., participation of customer in joint account
• Distributing factor can be approximated by considering the total number of
parent members with which the child member is associated
– If an employee works in three sections, 1/3 of the value of the measure
aggregated for each one
MultiDim Conceptual Schema of the
Northwind DW
Conceptual Design of the Northwind DW
• Conceptual schema is based on the queries and on the table summarising requirements
• Source data are organized into orders: must transform orders data into sales facts
during the ETL process
• Measures Quantity, UnitPrice, Discount, SalesAmount obtained directly from the
sources; Freight is produced in the ETL; NetAmount is derived from the data cube
(calculated from the other measures)
• Aggregate functions also specified, following the requirements
• Orders are associated with different time instants: Time dimension participates in the
Sales fact with roles OrderDate, DueDate, and ShippedDate (not indicated in the
requirements table)
• Time dimension contains four aggregation levels.
• Dimension Product and parent level Category follows from query 3(a)
• Many-to-many relationship between Employee and City, discovered analysing the
content of the source database in the requirements phase
• For HR analysis (queries 4(a) to 4(c)), we need to analyse sales by employee
supervisors, a recursive hierarchy Supervision in dimension Employee
• The dimensions Customer and Supplier share the geographic hierarchy starting with
City. This information is incomplete in the operational database → data for State,
Region, Country, and Continent must be obtained from an external source.
Summary
• What you need to take away from this lecture:
1. Be able to translate requirements into a conceptual
multidimensional model
2. Be able to read and create a conceptual
multidimensional model
3. Be able to distinguish between the different types of
hierarchies
• Further Reading: Chapter 4 and Chapter 10
(Sections 1 to 4) of Textbook
Tutorial
• Exercise 1: Consider a time dimension composed of two
alternative hierarchies: (a) day, month, quarter, and year and (b)
day, month, bimester (2 months), and year. Design the conceptual
schema of this dimension and show examples of instances.

• Exercise 2: A telephone provider stores data about:


– Customers: ID, name, address, zip code, city code, city name, state code, state
name, country code, country name.
– Time of call: day, day of week, month number, month name, quarter and year.
– Call programs: name, description and start date.
– Call type: ID, description, connection charge, daytime charge, evening charge and
weekend charge
The company wants to analyse the number of calls, total duration, and amount for
its customers (callers and callees). The analysis should provide a historical (time)
and geographical (location of customer) perspective. Design a conceptual schema
for this scenario.

You might also like