What Is A Data Warehouse

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

What is a Data warehouse?

We can define data warehouse as

Data warehouse is a relational database that is designed for analysis purpose, rather than
for transactional usage.

Data warehouse is a collection of data marts representing historical data from different
operational data sources.

Data warehouse is having characteristics like

 Time variant
 Non volatile
 Subject oriented
 Integrated database.

Time variant:

DWH is a time variant database because users can analyze the data with different time
periods means they can analyze Over a period of time

Non volatile:

DWH is non volatile database because this is read only database. Once the data entered
into the DWH, we can not apply any DML operations other than select.

Subject oriented:

DWH is subject oriented database which supports the business needs of individual
departments in the enterprise.

Integrated database:

DWH is an integrated database which captures the data from multiple operational
sources and integrates the data in the homogenous format.
What is a Data Mart?

 A Data mart is a subset of a data warehouse that can provide data for reporting
and analysis on a particular section or a department.
 Data marts are designed to help manager to make strategic decisions about their
business.
 In data mart focus is only on the requirements of one department or business
function.

Data Marts are two types:

 Dependent Data Marts:

Here source data is directly comes from the Enterprise Data Warehouse.

 Independent Data Marts:

Here source data is captured from one or more operational systems or external
information providers.

STAR SCHEMA

 In designing data models for data warehouses and data marts, the most commonly
used schema types are
• Star Schema
• Snow flake schema.

 Whether one uses a star or a snow flake, largely depends on business needs.
 In Star schema, there will be a centrally located fact table, surrounded by
dimension tables, which look like a star.
 Each dimension is represented as a single table. The primary key in each
dimension table is related to a foreign key in the fact table.
 A Star schema can be simple or complex
 A simple star schema consists of one fact table, a complex star schema can have
more than one fact table.
 In Star schema every dimension table is directly connected with fact table.
 For example, pls go through the below diagram.
Customer Dimension
Cust_ID
Cust_No
Name
Address Time Dimension
Orders Fact Time_ID
City
Date
State Time_ID
Year
Phone Cust_ID
Month
Product_ID
Day
Product Dimension Order_No
Quarter
Price
Product_ID Fiscal_Qtr
Quantity
Name Discount
Description
Vendor
Price

SNOW FLAKE SCHEMA

 Snow flake schema architecture is a more complex variation of a star schema


design.
 The main difference is that, Dimensional tables in a snow flake schema are
normalized.
 In Snow Flake design the dimension table (de-normalized table) will be
further divide into one or more dimensions.
 It is very much useful to organize the information in a better structural
format.
 Snow flake schemas are generally used when a dimensional table becomes
very big and when a star schema can’t represent the complexity of a data
structure.
 For example if a Product dimension table contains millions of rows, the use
of snow flake schemas should significantly improve performance by moving
out some data to other table.
 The problem is that the more normalized the dimension tables, the more
complicated SQL joins must be issued to query them.
 This is because in order for a query to be answered, many tables need to be
joined and aggregates generated.
Customer
Dimension
City Cust_ID
State Dimension Cust_No
Dimension City_ID City_ID
State_ID Name Time Dimension
State_ID
Address Orders Fact
State Name
Time_ID
Phone Time_ID Date
Cust_ID Year
Product_ID Month
Order_No Day
Price Quarter
Product Dimension
Quantity Fiscal_Qtr
Vendor Discount
Dimension Product_ID
Name
Vendor_ID
Vendor_ID
Name
Description
Price

DIMENSIONAL MODELLING

 Dimensional Modeling is a designing concept used by many data warehouse


designers to build their Data warehouse.
 In this design model all the data is stored in two types of tables
---Fact Tables and Dimensional tables
 Fact table contains the facts/measurements of the business.
 Dimension table contains the context of measurements ie, the dimensions on
which the facts are calculated.
 Unlike E-R modeling, the data in Dimensional modeling is denormalized.
 By using Dimensional modeling, the data modeler will implement star
schema or snow flake schema based on the requirement.

TOP DOWN APPROACH

 The Top down approach is coined by Bill Inmon.


 According to his approach, He says “First we need to implement the
Enterprise data warehouse by extracting the data from individual
departments and from the Enterprise data warehouse develop subject
oriented databases called as data marts.

BOTTOM UP APPROACH

 The bottom up approach is coined by Ralph Kimball.


 According to his approach he says “First we need to develop Subject oriented
database called as “Data Marts” then integrate all the Data marts to develop
the Enterprise data warehouse.

STAGING AREA
 The staging area is one of the important layer in data ware housing project
Architecture.
 Staging area is the relational database, where raw data is brought in,
cleaned, combined and exported to one or more data marts.
 The purpose of staging area is to get data ready for loading into
presentation server.
 We use our staging to hold one cycle of loaded data. If that load is daily
then the table would be truncated daily, monthly load tables are
truncated monthly.
 Here we load the (E) raw data from flat files or other tables into staging
area. Once this is done we perform all cleansing operations (T) on that
data, and then we load the data into (L) target tables.
STAR VS SNOW FLAKE SCHEMA

WHAT ARE DIMENSIONS

 Dimension is a data category that is used to organize business data for


retrieval and consolidation of values.
 Each dimension usually contains a hierarchy of related members grouped
within it.
 Simple we can say that dimensions are categories by which we can we
view the summarized data (Facts) available in fact table.
 For example a profit fact table can view by time dimensions.
 Hierarchies are logical structures that use ordered levels as a means of
organizing data.
 A hierarchy can be used to define data aggregation.
 For Example, a time dimension might have a hierarchy that represents data
at the month, quarter and year levels.
TYPES OF DIMENSIONS

 There are different types of dimensions available.

• Confirmed Dimensions
• Junk Dimensions
• Degenerated Dimensions
• Slowly Changing dimensions

 Confirmed Dimensions:

• The dimension which is reusable and fixed in nature is called as


confirmed dimension.
• Confirmed Dimension is a dimension table which is connected
To the multiple fact tables
• A dimension tables which can be shared between multiple data marts
are called as confirmed dimensions..
• In industry identifying confirmed dimensions is a big challenge.
• Because Different departments or different applications may uses the
same table with different attribute names. No of attributes may be
different and may be different in data types and sizes.
• Suppose if you want to build a confirmed dimension, the table should
have same no of columns and column names, data types, sizes and
should have the data in same granularity.

 Junk Dimension:

• The A junk dimension combines several low-cardinality flags and attributes into
a single dimension table rather than modeling them as separate dimensions.
• The benefit is, instead of keeping multiple keys in to the fact table, if you keep
only one junk key from the dimension table into the fact table that we can get
better performance because of reducing the no of joins.
 Degenerated Dimensions

• These are some date elements in the operational system which


are neither fact nor strictly dimension attributes. These are
useful for some kind of analysis.

• Degenerate dimension: A column of the key section of the fact


table that does not have the associated dimension table but
used for reporting and analysis, such column is called
degenerate dimension or line item dimension.

• For ex, we have a fact table with customer_id, product_id,


branch_id, employee_id, bill_no, and date in key section and
price, quantity, amount in measure section. In this fact table,
bill_no from key section is a single value; it has no associated
dimension table. Instead of creating a separate dimension table
for that single value, we can include it in fact table to improve
performance. So here the column, bill_no is a degenerate
dimension or Line item dimension.

 SLOWLY CHANGING DIMENSIONS

There are in general three ways to solve this type of problem, and they are
categorized as follows

1) SCD TYPE 1(Recent Updates)


2) SCD TYPE 2(Full Historical Information)
3) SCD TYPE 3(Partial Historical Information)
1) Scd Type 1:

In Type1, the new information simply overwrites the original information. In


other words, no history is kept. No trace of old record exists.

When to use Type 1:


Type 1 should be used when it is not necessary for the data ware house
to keep track of historical changes.

2) Scd type 2:

In type 2, a new record is added to the table to represent the new information.
Therefore both the original and the new record will be present. The new record
gets its own primary key.

Advantage:
This allows us to accurately keep all historical information.

Disadvantages:
This will cause the size of the table to grow fast. In cases where the
number of rows for the table is very high, so storage and performance can be big
concern.
This necessarily complicates the etl process.

When to use Type 2:

Type 2 should be used when it is necessary for the data warehouse to track
historical changes.

3) Scd type 3:

In type 3, there will be two columns to indicate the particular attribute, one
indicating the original value, and another one indicating current value.

Advantages:

This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.

Disadvantages:
Type 3 will not able to keep all history where an attribute is changed more than
once.Tyep 3 is rarely used in actual practice.

FACTS
Facts are nothing but measurements, which contains numeric data based on
columns in a fact table.

EX: Salary, Total_sales, Amount_Sold

What is a Fact Table?

Fact table is the one which contains summarized numerical facts and historical
data .This fact table has a foreign key primary key relation with the dimension
table. Fact table maintains the information in 3 rd normal form.
TYPES OF FACTS

There are three types of facts

1) Additive: The facts which can be summed up through all of


the dimensions in the fact table.
2) Semi Additive: The facts which can be summed up for some
of the dimensions in the fact table, but not the others.
3) Non Additive: The facts which can not be summed with any of
the dimensions present in the fact table.

You might also like