What Is A Data Warehouse
What Is A Data Warehouse
What Is A Data Warehouse
Data warehouse is a relational database that is designed for analysis purpose, rather than
for transactional usage.
Data warehouse is a collection of data marts representing historical data from different
operational data sources.
Time variant
Non volatile
Subject oriented
Integrated database.
Time variant:
DWH is a time variant database because users can analyze the data with different time
periods means they can analyze Over a period of time
Non volatile:
DWH is non volatile database because this is read only database. Once the data entered
into the DWH, we can not apply any DML operations other than select.
Subject oriented:
DWH is subject oriented database which supports the business needs of individual
departments in the enterprise.
Integrated database:
DWH is an integrated database which captures the data from multiple operational
sources and integrates the data in the homogenous format.
What is a Data Mart?
A Data mart is a subset of a data warehouse that can provide data for reporting
and analysis on a particular section or a department.
Data marts are designed to help manager to make strategic decisions about their
business.
In data mart focus is only on the requirements of one department or business
function.
Here source data is directly comes from the Enterprise Data Warehouse.
Here source data is captured from one or more operational systems or external
information providers.
STAR SCHEMA
In designing data models for data warehouses and data marts, the most commonly
used schema types are
• Star Schema
• Snow flake schema.
Whether one uses a star or a snow flake, largely depends on business needs.
In Star schema, there will be a centrally located fact table, surrounded by
dimension tables, which look like a star.
Each dimension is represented as a single table. The primary key in each
dimension table is related to a foreign key in the fact table.
A Star schema can be simple or complex
A simple star schema consists of one fact table, a complex star schema can have
more than one fact table.
In Star schema every dimension table is directly connected with fact table.
For example, pls go through the below diagram.
Customer Dimension
Cust_ID
Cust_No
Name
Address Time Dimension
Orders Fact Time_ID
City
Date
State Time_ID
Year
Phone Cust_ID
Month
Product_ID
Day
Product Dimension Order_No
Quarter
Price
Product_ID Fiscal_Qtr
Quantity
Name Discount
Description
Vendor
Price
DIMENSIONAL MODELLING
BOTTOM UP APPROACH
STAGING AREA
The staging area is one of the important layer in data ware housing project
Architecture.
Staging area is the relational database, where raw data is brought in,
cleaned, combined and exported to one or more data marts.
The purpose of staging area is to get data ready for loading into
presentation server.
We use our staging to hold one cycle of loaded data. If that load is daily
then the table would be truncated daily, monthly load tables are
truncated monthly.
Here we load the (E) raw data from flat files or other tables into staging
area. Once this is done we perform all cleansing operations (T) on that
data, and then we load the data into (L) target tables.
STAR VS SNOW FLAKE SCHEMA
• Confirmed Dimensions
• Junk Dimensions
• Degenerated Dimensions
• Slowly Changing dimensions
Confirmed Dimensions:
Junk Dimension:
• The A junk dimension combines several low-cardinality flags and attributes into
a single dimension table rather than modeling them as separate dimensions.
• The benefit is, instead of keeping multiple keys in to the fact table, if you keep
only one junk key from the dimension table into the fact table that we can get
better performance because of reducing the no of joins.
Degenerated Dimensions
There are in general three ways to solve this type of problem, and they are
categorized as follows
2) Scd type 2:
In type 2, a new record is added to the table to represent the new information.
Therefore both the original and the new record will be present. The new record
gets its own primary key.
Advantage:
This allows us to accurately keep all historical information.
Disadvantages:
This will cause the size of the table to grow fast. In cases where the
number of rows for the table is very high, so storage and performance can be big
concern.
This necessarily complicates the etl process.
Type 2 should be used when it is necessary for the data warehouse to track
historical changes.
3) Scd type 3:
In type 3, there will be two columns to indicate the particular attribute, one
indicating the original value, and another one indicating current value.
Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Disadvantages:
Type 3 will not able to keep all history where an attribute is changed more than
once.Tyep 3 is rarely used in actual practice.
FACTS
Facts are nothing but measurements, which contains numeric data based on
columns in a fact table.
Fact table is the one which contains summarized numerical facts and historical
data .This fact table has a foreign key primary key relation with the dimension
table. Fact table maintains the information in 3 rd normal form.
TYPES OF FACTS