C04 Warehouse Development
C04 Warehouse Development
C04 Warehouse Development
Data warehouses are typically updated in batch mode so that massive normalization is
not an advantage and massive indexes are sustainable. Data warehouse designs follow
a dimensional model rather than a traditional Entity/Relationship model. The
dimensional modeling principle derives from work done by Codd at about the same time
that his work on relational databases was published. Dimensional models maximize
user understanding and ease of retrieval.
The typical dimensional structure is a data mart. Data marts are designed around a
central fact table which contains numeric values for analysis and dimension tables which
keep track of properties that determine categories and groupings. The following
example from the GB Video case illustrates this principle.
GB Video is a video rental club that has the objective of renting videos and dvds to
members at or below the price of commercial video rental stores. The organization has
had a transaction rental system built to support day to day rental activities of the
members. The ER diagram for it is displayed belob.
The chairman of GB Video has come to the team with additional requirements. He
wants a system that will allow him to examine the purchase patterns of members so that
GB can create more attractive packages and better target their video inventories.
Customer Rental Owner of
#Cust No #Rental No Line
F Name Requestor Date #Line No
L Name of Clerk No Due Date
Ads1 Pay Type Return Date
Ads2 CC No OD charge
City Expire Pay type
State CC Approval
Zip
Tel No Holder of
CC No
Expire Name for
Title Video
#Title No #Video No
Name One-day fee
Vendor No Extra days
Cost Weekend
Rental Date
RentalDateID
Rental Due Date
SQLDate
RentalID DueDateIDReturn Date
Day
Rental No SQLDate ReturnDateID
Week SQLDate
Clerk No Day
Quarter Day
Store Week
Holiday
Pay Type Quarter Week
Holiday Quarter
Holiday
Note: It is not clear what the charge algorithm is. What do we charge for overdue
videos? How do you calculate weekend charges?
The diagram above is a data mart derived from the data model used by the GB Video
Rental system. It is not intended to be used in the everyday processing of video rentals,
but is intended for managers to use to examine the rental behavior of customers. The
fact table “Line” is generated from the many-to-many association table with the same
name on the GB Video Rental system. The fact table contains “facts” that describe
properties of each rental line and foreign keys that connect the fact records to records in
associated “dimension” tables. Dimension models can be directly implemented in
relational database engines and queried by SQL statements. More specialized models
are implemented in multidimensional data cubes that can be queried by spreadsheet-like
queries.
The process of converting the E-R diagram for the rental system into a data mart
proceeds as follows.
• Select a many-to-many association table as the central fact table for the mart.
If there are more than one association tables in the E-R diagram then the system will
probably generate more than one potential data mart. The records in the association
table define the grain of the fact table. The designer should explicitly describe what
that grain is in operational terms. In this case the grain of the mart is: “An individual
video rental line describing a video rented to a specific customer on a specific rental
form.”
• Replace all primary keys with artificially generated surrogate keys. Using the
operational primary keys opens up the design to potential problems because of the
differing lengths of time that the operational system and data mart must span. The
primary key in the operational system must remain constant over the life of the
database. Rental numbers could be reused each year without confusion in the
operational database for GB Video. The data mart should contain data from several
years causing a confusion in primary keys. For this reason a good data mart will
replace the operational primary keys. Technically, once the surrogate keys have
been inserted into the tables there is no need to keep the original primary keys in the
table. The primary keys have been retained in this design in case there is a need to
research where some values have come from after the mart is in place.
Surrogate keys are generated in a surrogate key index table for each dimension
table in the final design. This table has three columns: the original primary key, the
automatically generated key, and the permanent surrogate key. The automatically
generated key column allows an automatic numbering function to continue
generating keys as processing continues and rows are added to the dimension
tables. These numbers will not be duplicates. If two different primary key values in
the transaction system refer to the same thing, the third column can be adjusted to
preserve the same surrogate key. If primary key values are reused over time, then
the designer may have to manage them separately.
• Create dimension tables from entities by promoting foreign keys into the fact
table. For example, the relationship “Requestor of” connecting records in the
Customer table to individual Rental occurrences would be implemented by a foreign
key in the Rental table. Moving that foreign key from the Rental table into the Line
table maintains the information content and makes for a simpler query structure.
This, of course, creates a second normal form violation in the Lines table.
Normalization is not desirable in a data mart.
• Promote dates to the fact table and create a date dimension to replace the
actual date value. The date dimension should contain the date from the operational
file, but it should also contain properties of the date that are important for
understanding. Retrospective analysis often requires grouping values by data
properties that are difficult to derive from the SQL data stored in the transaction
system. Questions like “How many sales were made on Mondays?” or “Which
holiday had the most video rentals?” require a complicated algorithm to compute
date information from the standard date representation. In a data mart these useful
attributes are generally pre-calculated.
When there are several dates, like there are in the GB Video example, the fact table
should contain a surrogate key for each one of them. The actual database should
implement these several copies of the date dimension as views or virtual tables.
This both saves space and makes the generation of date attributes more consistent.
• Identify the facts in the fact table. Facts are attributes that describe the records in
the fact table. The most useful facts are numeric and additive. Additive facts are
those that can be added up across records and still have meaning. In sales marts
like the GB Video mart the most common facts are cost and quantity. Most order
systems can record price, quantity and cost. Price, for example, is not additive, but
quantity and cost are. In the example above, OD charge is an additive fact. One-
day fee, extra days, and weekend are not additive. The mart designer converted
these into OneDayCost, ExtraDayCost and WeekendCost and moved them from the
Video table into the fact table. The designer also added DaysReserved and
DaysOverdue by calculating the duration of the rental from the dates in the system.
These are additive facts; they pre-calculate useful information and make it easier to
derive price if necessary.
The promotion of pricing information from the Video table into the Line table also
solves another problem. Prices for videos often change as they get older. Hot new
titles command a higher fee than old classics. Entries in dimension tables do not
change over time. Values in the fact table represent the actual costs as they change
over time.
• Identify the attributes in the dimension table. These are properties that are
relevant to the purpose of the mart. These attributes are usually text or categorical
variables. Occasionally numeric variables appear as attributes in dimension tables
when there is no expectation of doing arithmetic with the values. Attributes that are
irrelevant or that might provide a security risk should not be included. That is the
reason that credit card number and expiration date do not appear in the mart.
In addition to keeping existing attributes, the designer should replace codes with
useful values. That is why the vendor number has been replaced with vendor name
in the Title table. The Store attribute has been added to the rental table so that
reports can be generated by store if needed.
• Add null records to the dimension. Joins with dimensional tables should not
permit null values. Null values can create some very misleading results when one of
the dimensions does not have a valid connection to the fact table. To avoid this
problem we add at least one record that represents “Not Valid” or “No Connection”
relationships so that all relationships with dimension tables can be mandatory.
• Manage slowly changing attributes. A slowly changing attribute is an attribute in a
dimension that will change over time but much less frequently than values in the fact
table do. If these values change frequently then they should be included as
attributes in the fact table, in essentially the same way that the OneDayCost,
ExtraDayCost and WeekendCost were included in the fact table rather than the
Video table. Customer address is another situation. Address may change, but it
would be a waste of space to include it in the fact table. There are three rational
approaches called Type 1, Type 2 and Type 3 approaches.
Type 1: keep only the current values.
Type 2: create another table to keep each change.
Type 3: add an attribute to the dimension table to keep the previous value.
The choice depends on the purpose of the mart. In the GB Video case it would be
useful to understand where sales have come from based on where the customer
lived at the time of the rental. The design above treats address as a Type 2 variable
which is why there is a separate address table.
• Remove unnecessary dimensions. Dimensions that are left with only a single
attribute are really not needed. It takes less space and less processing power to
simply include that attribute in the fact table. In this example the Video dimension
that originally contained information about the individual tape has only the Video No
attribute in it. Since that attribute has no value to the purpose of the mart, the
designer has chosen to omit that dimension.
The design in Figure X represents a data mart for GB Video. It contains essentially the
same information as the E-R diagram for the GB Video reservation system but with a
very different purpose. The dimensional model is heavily denormalized to improve
retrieval performance rather than optimize update performance as in the E-R model.
2. A description of the source to target extraction tables used to create the initial
source for editing. These are working tables similar to the ultimate presentation tables
that will be used in the data mart. They contain data in the form that it appears in the
source system.
Working Tables
Transform
Data
(Table 3)
The data flow diagram above describes the operational sequence necessary to actually
load the data mart. The transformations described here are quite simple and do not
include any attribute reformatting or data editing that may be needed. In practice, data
preparation might be quite extensive and lead to an expansion of the Transform Data
process into a complex sub-process.