C04 Warehouse Development

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 12

Data Warehouse Development

A data warehouse is an information system containing archived operational and related


data that is used to make business decisions. They are designed to support ad hoc
queries and what-if decision analysis. The basic idea is to use data to “Identify problems
or discover opportunities” in time to take advantage of the information [ , private
communication]. Users need to query data for themselves to help with strategic and
tactical decisions.

Data warehouses are typically updated in batch mode so that massive normalization is
not an advantage and massive indexes are sustainable. Data warehouse designs follow
a dimensional model rather than a traditional Entity/Relationship model. The
dimensional modeling principle derives from work done by Codd at about the same time
that his work on relational databases was published. Dimensional models maximize
user understanding and ease of retrieval.

The typical dimensional structure is a data mart. Data marts are designed around a
central fact table which contains numeric values for analysis and dimension tables which
keep track of properties that determine categories and groupings. The following
example from the GB Video case illustrates this principle.

GB Video is a video rental club that has the objective of renting videos and dvds to
members at or below the price of commercial video rental stores. The organization has
had a transaction rental system built to support day to day rental activities of the
members. The ER diagram for it is displayed belob.

The chairman of GB Video has come to the team with additional requirements. He
wants a system that will allow him to examine the purchase patterns of members so that
GB can create more attractive packages and better target their video inventories.
Customer Rental Owner of
#Cust No #Rental No Line
F Name Requestor Date #Line No
L Name of Clerk No Due Date
Ads1 Pay Type Return Date
Ads2 CC No OD charge
City Expire Pay type
State CC Approval
Zip
Tel No Holder of
CC No
Expire Name for

Title Video

#Title No #Video No
Name One-day fee
Vendor No Extra days
Cost Weekend

GB Video E-R Diagram


Customer Line
CustID LineID Video
Cust No OD Charge VideoID
F Name OneDayCharge Video No
L Name ExtraDaysCharge
WeekendCharge
DaysReserved
DaysOverdue
Address CustID Title
AddressID AddressID TitleID
Adddress1 RentalId TitleNo
Address2 VideoID Name
City TitleID Cost
State RentalDateID Vendor Name
Zip DueDateID
AreaCode ReturnDateID
Phone

Rental Date
RentalDateID
Rental Due Date
SQLDate
RentalID DueDateIDReturn Date
Day
Rental No SQLDate ReturnDateID
Week SQLDate
Clerk No Day
Quarter Day
Store Week
Holiday
Pay Type Quarter Week
Holiday Quarter
Holiday

Note: It is not clear what the charge algorithm is. What do we charge for overdue
videos? How do you calculate weekend charges?

The diagram above is a data mart derived from the data model used by the GB Video
Rental system. It is not intended to be used in the everyday processing of video rentals,
but is intended for managers to use to examine the rental behavior of customers. The
fact table “Line” is generated from the many-to-many association table with the same
name on the GB Video Rental system. The fact table contains “facts” that describe
properties of each rental line and foreign keys that connect the fact records to records in
associated “dimension” tables. Dimension models can be directly implemented in
relational database engines and queried by SQL statements. More specialized models
are implemented in multidimensional data cubes that can be queried by spreadsheet-like
queries.

The process of converting the E-R diagram for the rental system into a data mart
proceeds as follows.
• Select a many-to-many association table as the central fact table for the mart.
If there are more than one association tables in the E-R diagram then the system will
probably generate more than one potential data mart. The records in the association
table define the grain of the fact table. The designer should explicitly describe what
that grain is in operational terms. In this case the grain of the mart is: “An individual
video rental line describing a video rented to a specific customer on a specific rental
form.”
• Replace all primary keys with artificially generated surrogate keys. Using the
operational primary keys opens up the design to potential problems because of the
differing lengths of time that the operational system and data mart must span. The
primary key in the operational system must remain constant over the life of the
database. Rental numbers could be reused each year without confusion in the
operational database for GB Video. The data mart should contain data from several
years causing a confusion in primary keys. For this reason a good data mart will
replace the operational primary keys. Technically, once the surrogate keys have
been inserted into the tables there is no need to keep the original primary keys in the
table. The primary keys have been retained in this design in case there is a need to
research where some values have come from after the mart is in place.

Surrogate keys are generated in a surrogate key index table for each dimension
table in the final design. This table has three columns: the original primary key, the
automatically generated key, and the permanent surrogate key. The automatically
generated key column allows an automatic numbering function to continue
generating keys as processing continues and rows are added to the dimension
tables. These numbers will not be duplicates. If two different primary key values in
the transaction system refer to the same thing, the third column can be adjusted to
preserve the same surrogate key. If primary key values are reused over time, then
the designer may have to manage them separately.
• Create dimension tables from entities by promoting foreign keys into the fact
table. For example, the relationship “Requestor of” connecting records in the
Customer table to individual Rental occurrences would be implemented by a foreign
key in the Rental table. Moving that foreign key from the Rental table into the Line
table maintains the information content and makes for a simpler query structure.
This, of course, creates a second normal form violation in the Lines table.
Normalization is not desirable in a data mart.
• Promote dates to the fact table and create a date dimension to replace the
actual date value. The date dimension should contain the date from the operational
file, but it should also contain properties of the date that are important for
understanding. Retrospective analysis often requires grouping values by data
properties that are difficult to derive from the SQL data stored in the transaction
system. Questions like “How many sales were made on Mondays?” or “Which
holiday had the most video rentals?” require a complicated algorithm to compute
date information from the standard date representation. In a data mart these useful
attributes are generally pre-calculated.

When there are several dates, like there are in the GB Video example, the fact table
should contain a surrogate key for each one of them. The actual database should
implement these several copies of the date dimension as views or virtual tables.
This both saves space and makes the generation of date attributes more consistent.
• Identify the facts in the fact table. Facts are attributes that describe the records in
the fact table. The most useful facts are numeric and additive. Additive facts are
those that can be added up across records and still have meaning. In sales marts
like the GB Video mart the most common facts are cost and quantity. Most order
systems can record price, quantity and cost. Price, for example, is not additive, but
quantity and cost are. In the example above, OD charge is an additive fact. One-
day fee, extra days, and weekend are not additive. The mart designer converted
these into OneDayCost, ExtraDayCost and WeekendCost and moved them from the
Video table into the fact table. The designer also added DaysReserved and
DaysOverdue by calculating the duration of the rental from the dates in the system.
These are additive facts; they pre-calculate useful information and make it easier to
derive price if necessary.

The promotion of pricing information from the Video table into the Line table also
solves another problem. Prices for videos often change as they get older. Hot new
titles command a higher fee than old classics. Entries in dimension tables do not
change over time. Values in the fact table represent the actual costs as they change
over time.
• Identify the attributes in the dimension table. These are properties that are
relevant to the purpose of the mart. These attributes are usually text or categorical
variables. Occasionally numeric variables appear as attributes in dimension tables
when there is no expectation of doing arithmetic with the values. Attributes that are
irrelevant or that might provide a security risk should not be included. That is the
reason that credit card number and expiration date do not appear in the mart.

In addition to keeping existing attributes, the designer should replace codes with
useful values. That is why the vendor number has been replaced with vendor name
in the Title table. The Store attribute has been added to the rental table so that
reports can be generated by store if needed.
• Add null records to the dimension. Joins with dimensional tables should not
permit null values. Null values can create some very misleading results when one of
the dimensions does not have a valid connection to the fact table. To avoid this
problem we add at least one record that represents “Not Valid” or “No Connection”
relationships so that all relationships with dimension tables can be mandatory.
• Manage slowly changing attributes. A slowly changing attribute is an attribute in a
dimension that will change over time but much less frequently than values in the fact
table do. If these values change frequently then they should be included as
attributes in the fact table, in essentially the same way that the OneDayCost,
ExtraDayCost and WeekendCost were included in the fact table rather than the
Video table. Customer address is another situation. Address may change, but it
would be a waste of space to include it in the fact table. There are three rational
approaches called Type 1, Type 2 and Type 3 approaches.
Type 1: keep only the current values.
Type 2: create another table to keep each change.
Type 3: add an attribute to the dimension table to keep the previous value.
The choice depends on the purpose of the mart. In the GB Video case it would be
useful to understand where sales have come from based on where the customer
lived at the time of the rental. The design above treats address as a Type 2 variable
which is why there is a separate address table.
• Remove unnecessary dimensions. Dimensions that are left with only a single
attribute are really not needed. It takes less space and less processing power to
simply include that attribute in the fact table. In this example the Video dimension
that originally contained information about the individual tape has only the Video No
attribute in it. Since that attribute has no value to the purpose of the mart, the
designer has chosen to omit that dimension.

The design in Figure X represents a data mart for GB Video. It contains essentially the
same information as the E-R diagram for the GB Video reservation system but with a
very different purpose. The dimensional model is heavily denormalized to improve
retrieval performance rather than optimize update performance as in the E-R model.

In addition to the design, the dimensional model needs to be supported by appropriate


metadata.

1. A description of data sources

Business IS Data Source


Source Owner Owner Platform Location Description
GB Operation IS Director Corporate Home Office VB system for
Video s Server transaction
Rental processing.
System

2. A description of the source to target extraction tables used to create the initial
source for editing. These are working tables similar to the ultimate presentation tables
that will be used in the data mart. They contain data in the form that it appears in the
source system.

Target Target Data Le Target Sourc Source Source Data


Table Column Type n Column e Table / Col / Field Txform
Description Syste File Notes
m
Custom CustID Integ 16 Surrog
er ate
Custom Cust No Num 8 Old Renta Custom Cust No
er primary l er
key
Custom F Name Text 20 First Name Renta Custom F Name
er l er
Custom L Name Text 30 Last Name Renta Custom F Name
er l er
Addres Address Integ 16 Surrog
s ID ate
Addres Cust No Num 8 Customer Renta Custom Cust No Delete
s primary l er after
key surroga
te key
Addres Address Text 40 Address Renta Custom Ads1
s 1 line 1 l er
Target Target Data Le Target Sourc Source Source Data
Table Column Type n Column e Table / Col / Field Txform
Description Syste File Notes
m
Addres Adress2 40 Address Renta Custom Ads2
s line 1 l er
Addres City 25 City Renta Custom City
s l er
Addres State 25 State Renta Custom State
s l er
Addres Zip 10 ZIP Renta Custom ZIP
s l er
Addres Phone Integ 10 Customer Renta Custom Phone
s Phone l er
Rental RentalID Integ 16 Surrog
ate
Rental Rental 16 ID for Renta Rental Rental No
No rental l
receipt
Rental Clerk No 16 Clerk Renta Rental Clerk No
employee l
number
Rental Store 40 Store Exter Look
Name nal Up
Rental Pay 25 Method of Renta Line Pay Type
Type payment l
Title TitleID Integ 16 Surrog
ate
Title TitleNo Text 10 Video Renta Title Title No
Primary l
Key
Title Title 70 Video Title Renta Title Title
Name l
Title Cost Curr 10 Video cost Title Cost
Line CustID Integ 16 Customer Surrog
SK ate
Line Address Integ 16 Address Surrog
ID SK ate
Line RentalID Integ 16 Rental SK Surrog
ate
Line TitleID Integ 16 Title SK Surrog
ate
Line Cust No Text 8 Customer Renta Rental Cust No Delete
Original l after
Foreignorig Replac
inal Key e with
surroga
te
Line Rental Rental Renta Line Rental No Delete
No original l after
Target Target Data Le Target Sourc Source Source Data
Table Column Type n Column e Table / Col / Field Txform
Description Syste File Notes
m
foreign key Replac
e with
surroga
te
Line Title No Title Renta Video Title No Delete
Original l after
Foreign Replac
key e with
surroga
te
Line OD Curr 12 Charge for Renta Line OD Delete
Charge overdue l Charge after
returns transfor
m
proces
s
Line One- Curr 12 Daily fee Renta Video One-Day Delete
Day Fee for the first l Fee after
dy transfor
m
proces
s
Line Extra Curr 12 Daily fee Renta Video Extra days Delete
Day Fee for each l after
day after transfor
the first m
proces
s
Line Weeken Curr 12 Premium Renta Video Weekend Delete
d Fee charged for l after
weekend transfor
days m
proces
s
Line OneDay Curr 12 One day Calcula
Charge charge for te
video
Line ExtraDa Curr 12 Charge for Calcula
ysCharg days after te
e the first
Line Weeken Curr 12 Premium Calcula
dCharge for te
weekend
use
Line DaysRe Integ 8 Number of Calcula
served days te
Target Target Data Le Target Sourc Source Source Data
Table Column Type n Column e Table / Col / Field Txform
Description Syste File Notes
m
reserved
Line DaysOv Integ 8 Number of Calcula
erdue days te
overdue
Line RentalD Integ 16 Date Surrog
ateID surrogate ate
for rental
date
Line DueDAt Integ 16 Date Surrog
eID surrogate ate
for due
date
Line ReturnD Integ 16 Date Surrog
ateID surrogate ate
for return
date
Line SQL Date 20 Rental Renta Rental Date Replac
Rental Date l e with
Date surroga
te
Line SQL Date 20 Due date Renta Line Due Date Replac
Due l e with
Date surroga
te
Line SQL Date 20 Returned Renta Line Return Replac
Return date l Date e with
Date surroga
te
Date DateID Integ 16 Surrog
ate
Date SQLDat Date 20 SQL date Calcula
e value te
Date Year Int 4 Year Calcula
te
Date Day Text 1 Day of Calcula
week te
Date Week Integ 4 Week of Calcula
year te
Date Quarter Integ 4 Quarter of Calcula
year te
Date Holiday Text 30 Holiday Calcula
Name te
3. A description of the transformation analysis needed to clean or reformat the
source data.

Derived Agg Transf


Fact Name Derived Fact T Rul Formula Constrai or-
Description ype e nts mation
s

OneDayCh Charge for the first One Day


arge day Rate
ExtraDays Charge for (Extra
Charge additional reserved Days)*(Day
days sReserved
– 1)
WeekendC Additional charge Weekend*(I
harge for weekend use f reserved
over
weekend)
DaysReser Total days reserved Due Date –
ved Rental
Date
DaysOver Days kept in Return
due addition to Date –
reserved days Rental
Date

4. A description of the ultimate data mart tables.

Table Column D Len Column Description PK


Name Name ata Null PK O FK
Typ s? rder
e
Customer CustID Inte 16 N Y 1
g
Customer Cust No Nu 8 Old primary key
m
Customer F Name Text 20 N First Name
Customer L Name Text 30 N Last Name
Address AddressI Inte 16 Y 1
D g
Rental Address1 Text 40 Address line 1
Rental Adress2 Text 40 Address line 1
Rental City Text 25 City
Rental State Text 25 State
Rental Zip Text 10 ZIP
Title TitleID Inte 16 N Y 1
Table Column D Len Column Description PK
Name Name ata Null PK O FK
Typ s? rder
e
g
Title TitleNo Text 10 Video ID
Title Title 70 Video Title
Title Cost Curr 10 Video cost
Line CustID Inte 16 N Y 1 Y
g
Line AddressI Inte 16 N Y 2 Y
D g
Line RentalID Inte 16 N Y 3 Y
g
Line TitleID Inte 16 N Y 4 Y
g
Line OD Curr 12 Overdue Charges
Charge
Line RentalDat Inte 16 N Date surrogate for Y
eID g rental date
Line DueDAteI Inte 16 N Date surrogate for due Y
D g date
Line ReturnDat Inte 16 N Date surrogate for Y
eID g return date
Date DateID Inte 16 N Y 11
g
Line SQLDate D 20 N
ate
Line Year Int 4 N Year
Line Day Text 1 N Day of week
Line Week Inte 4 N Week of year
g
Line Quarter Inte 4 N Quarter of year
g
Line Holiday Inte 4 Holiday Name
g
Source Tables Extract
Data
(Table 2) Generate
Surrogate
Keys

Working Tables

Insert Surrogate Tables


Surrogate
Keys

Transform
Data
(Table 3)

Data Mart Load Data


into Mart
(Table 4)

The data flow diagram above describes the operational sequence necessary to actually
load the data mart. The transformations described here are quite simple and do not
include any attribute reformatting or data editing that may be needed. In practice, data
preparation might be quite extensive and lead to an expansion of the Transform Data
process into a complex sub-process.

You might also like