Basics of Dimensional Modeling
Basics of Dimensional Modeling
Basics of Dimensional Modeling
Data warehouse and OLAP tools are based on a dimensional data model.
A dimensional model is based on dimensions, facts, cubes, and schemas
such as star and snowflake.
For a retail store, the important measurement or fact is the sales units. The
business dimensions might be time, promotion, product, and store. For an
insurance company, the important measurement or fact might be claims,
and the business dimensions are agent, policy, insured party, status, and
time.
Facts
Fact tables tend to be deep in terms of the number of rows but narrow in terms of
the number of columns.
All fact tables have two or more foreign keys that connect to the dimension
tables’ primary keys. When all the keys in the fact table match their
respective primary keys correctly in the corresponding dimension tables,
we say that the tables satisfy referential integrity. We access the fact table
via the dimension tables joined to it.
The fact table itself generally has its primary key made up of a subset of
the foreign keys. This key is called a composite or concatenated key.
Every fact table in a dimensional model has a composite key, and
conversely, every table that has a composite key is a fact table.
Another way to say this is that in a dimensional model, every table that
expresses a many-to-many relationship must be a fact table. All other
tables are dimension tables.
The model that brings the dimensions and facts together is termed as the
dimensional model. In this model, the fact table consisting of numeric
measurements is joined to a set of dimension tables filled with descriptive
attributes.
In the model, the fact table is at the center and the dimension tables are
hung around like a star. Hence, this characteristic structure is often termed
as star schema.
Dimension
Dimension
Fact
The simplicity also has performance benefits. Fewer joins are necessary for
query processing. A database engine can make strong assumptions about
first constraining the heavily indexed dimension tables, and then attacking
the fact table all at once with the Cartesian product of the dimension table
keys satisfying the user’s constraints.
In all of the cases above, existing data access applications will continue to
run without yielding different results. Data would not have to be reloaded.
Another way of thinking about the simplistic nature of star schema is to see
how the dimensions and facts contribute to the report. The dimension table
attributes supply the report labeling, whereas the fact tables supply the
report’s numeric values.
Data cubes can be translated into star schema and vice versa. However,
high level aggregation of data is efficiently stored as cubes; having been
pre-calculated; alternative roll-ups across changing dimensions are more
efficiently and flexibly performed by star schema, based on available
details.
In the data warehouse literature, 1-D, 2-D, 3-D cube and so on are in
general referred to as a cuboid. Given a set of dimensions, we can
construct a set of cuboids, each showing the data at a different level of
summarization. The cuboid that holds the lowest level of summarization is
called the base cuboid. For example, the 4-D cuboid below is the base
cuboid for the given time, item, location, and supplier dimensions. The apex
cuboid is typically denoted by all.
In a data warehouse or data mart, measures are stored in the fact table in
such detail that users can roll-up in various levels of summarization. This is
called aggregation.
For example, if sales data in a grocery store are kept in the level of a single
customer buying a particular item in a particular day in a particular store,
then we can summarize or aggregate the data for various days, weeks,
months, quarters, and years; and all of these for a store, zone, state, and
country; as well as by products, product group, department, and so on.
Only the sales data in the lowest level are kept in the fact table, but the
descriptions of various levels of data are kept in the dimension tables, so
that appropriate tools can be used to summarize data in various levels.
The dimensional levels form a tree-like structure, and the members in the
lowest level of the hierarchy are called leaf members. There is only one
member in the topmost level. A dimension can not exist without leaf
members, but it is possible to have a dimension with nothing but leaf
members – that is, with only one level.
For example, date, day, month, and year are stored in a Date dimension;
while product, brand, category, and department are stored in the Product
dimension. The example below illustrates a Retail Store database schema
and the associated Date and Product dimension tables.
For the above example, if we want to find the total Sales Quantity and the
Sales Dollar Amount for each of the two departments, Bakery and Frozen
Food, we first select Bakery and Frozen Food from the Product
Dimension table and then add up all the values of Sales Quantity and
Sales Dollar Amount from the Fact Table (not shown) corresponding to the
two products. This requires adding up separately, fact values for Product
key = 1, 2, 3, and 4, and Product key = 5, 6, 7, 8, and 9, for all possible
values of other keys in the Fact table. The result is shown below.
Department Description Sales Quantity Sales Dollar Amount
Bakery 5,088 $12,331
Frozen Food 15,565 $31,776
This organization provides users with the ability to view data from different
perspectives. A number of data cube operations exist to materialize the
different views, allowing interactive querying and analysis of the data.
Following are some typical OLAP operations for multidimensional data.
Slice and Dice: The slice operation performs a selection on one dimension
of the given cube, resulting in a sub-cube. For example, we can select all
sales data for various cities and items for a particular quarter = Q1.
Pivot (Rotate): Pivot is a visualization operation that rotates the data axes
(in view) in order to provide an alternative presentation of the data.