Surrogate Key in Data Warehouse
Surrogate Key in Data Warehouse
Surrogate Key in Data Warehouse
Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique
number attached with each and every record in a Dimension table in any Data Warehouse. It join between the fact and
dimension tables and is necessary to handle changes in dimension table attributes.
Surrogate Key (SK) is sequentially generated meaningless unique number attached with each and every record in a table
in any Data Warehouse (DW).
It is UNIQUE since it is sequentially generated integer for each record being inserted in the table.
It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table.
It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table, starting with
one and going up to the highest number that is needed.
During the FACT table load, different dimensional attributes are looked up in the corresponding Dimensions and SKs are
fetched from there. These SKs should be fetched from the most recent versions of the dimension records. Finally the
FACT table in DW contains the factual data along with corresponding SKs from the Dimension tables.
The below diagram shows how the FACT table is loaded from the source.
Why Should We Use Surrogate Key
Basically it’s an artificial key that is used as a substitute for a Natural Key (NK). We should have defined NK in our tables
as per the business requirement and that might be able to uniquely identify any record. But, SK is just an Integer
attached to a record for the purpose of joining different tables in a Star or Snowflake schema based DW. SK is much
needed when we have very long NK or the datatype of the NK is not suitable for Indexing.
The below image shows a typical Star Schema, joining different Dimensions with the Fact using SKs.
Ralph Kimball emphasizes more on the abstraction of NK. As per him, Surrogate Keys should NOT be:
Smart, where you can tell something about the record just by looking at the key.
Composed of natural keys glued together.
Implemented as multiple parallel joins between the dimension table and the fact table; so-called double or triple
barreled joins.
As per Thomas Kejser, a “good key” is a column that has the following properties:
It forced to be unique
It is small
It is an integer
Once assigned to a row, it never changes
Even if deleted, it will never be re-used to refer to a new row
It is a single column
It is stupid
It is not intended as being remembered by users
If the above mentioned features are taken into account, SK would be a great candidate for a Good Key in a DW.
Apart from these, few more reasons for choosing this SK approach are:
If we replace the NK with a single Integer, it should be able to save a substantial amount of storage space. The
SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintain Referential
Integrity (RI), and here instead of storing of those big or huge NKs, storing of concise SKs would result in less
amount of space needed. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built
on the NK which may be alphanumeric.
Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound to improve join
performance, since joining two Integer columns works faster. So, it provides an extra edge in the ETL
performance by fastening data retrieval and lookup.
Advantage of a four-byte integer key is that it can represent more than 2 billion different values, which would be
enough for any dimension and SK would not run out of values, not even for the Big or Monster Dimension.
SK is usually independent of the data contained in the record, we cannot understand anything about the data in
a record simply by seeing only the SK. Hence it provides Data Abstraction.
So, apart from the abstraction of critical business data involved in the NK, we have the advantage of storage space
reduction as well to implement the SK in our DW. It has become a Standard Practice to associate an SK with a table in
DW irrespective of being it a Dimension, Fact, Bridge or Aggregate table.
There are myriad number of disadvantages as well while working with SK. Let’s see them one by one:
The values of SKs have no relationship with the real world meaning of the data held in a row. Therefore over
usage of SKs lead to the problem of disassociation.
The generation and attachment of SK creates unnecessary ETL burden. Sometimes it may be found that the
actual piece of code is short and simple, but generating the SK and carrying it forward till the target adds extra
overhead on the code.
During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension,
we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. This may come as an extra
overhead on the ETL.
Even query optimization becomes difficult since SK takes the place of PK, unique index is applied on that column.
And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique
index on the SK.
Replication of data from one environment to another, i.e. Data Migration, becomes difficult since SKs from
different Dimension tables are used as the FKs in the Fact table and SKs are DW specific, any mismatch in the SK
for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema.
If duplicate records come from the source, there is a potential risk of duplicates About the Author
Being loaded into the target, since Unique Constraint is defined on the SK and not on the NK.
SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to
uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK.