DWH Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 30

 What is Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived
from transaction data, but it can include data from other sources. It separates
analysis workload from transaction workload and enables an organization to
consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an


extraction, transportation, transformation, and loading (ETL) solution, an online
analytical processing (OLAP) engine, client analysis tools, and other applications
that manage the process of gathering data and delivering it to business users.

A common way of introducing data warehousing is to refer to the characteristics of


a data warehouse as set forth by William Inmon:

 Subject Oriented
 Integrated
 Nonvolatile
 Time Variant

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
Integrated

Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems
as naming conflicts and inconsistencies among units of measure. When they
achieve this, they are said to be integrated.

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change.
This is logical because the purpose of a warehouse is to enable you to analyze what
has occurred.
Time Variant

In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse's focus on change over time is what is meant by the term time
variant.

60-90 Days 5-10 Year


 Need of Data Warehouse

Cloud-based technology has revolutionized the business world, allowing


companies to easily retrieve and store valuable data about their customers, products
and employees. This data is used to inform important business decisions.

Many global corporations have turned to data warehousing to organize data that
streams in from corporate branches and operations centers around the world. It’s
essential for IT students to understand how data warehousing helps businesses
remain competitive in a quickly evolving global marketplace.

Data warehousing is an increasingly important business intelligence tool, allowing


organizations to:

1. Ensure consistency. Data warehouses are programmed to apply a


uniform format to all collected data, which makes it easier for corporate
decision-makers to analyze and share data insights with their colleagues
around the globe. Standardizing data from different sources also reduces the
risk of error in interpretation and improves overall accuracy.
2. Make better business decisions. Successful business leaders develop
data-driven strategies and rarely make decisions without consulting the
facts. Data warehousing improves the speed and efficiency of accessing
different data sets and makes it easier for corporate decision-makers to
derive insights that will guide the business and marketing strategies that set
them apart from their competitors.
3. Improve their bottom line. Data warehouse platforms allow business
leaders to quickly access their organization's historical activities and
evaluate initiatives that have been successful — or unsuccessful — in the
past. This allows executives to see where they can adjust their strategy to
decrease costs, maximize efficiency and increase sales to improve their
bottom line.
 Data Warehouse Architecture

Another definition: A data warehouse is a repository (data & metadata) that


contains integrated, cleansed, and reconciled data from disparate sources for
decision support applications, with an emphasis on online analytical processing.
Typically the data is multidimensional, historical, non-volatile.
Regardless of which option you choose, the architectural layers remain the same:

 Source Layer
 Staging Layer
 Storage Layer
 Presentation Layer

Data Source Layer

 The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired operations.
 The data can be of any type.
 The Source Data can be a database, a Spreadsheet or any other kinds of text
file.
 The Source Data can be of any format. We cannot expect to get data with
the same format considering the sources are vastly different.
 In Real Life, Some examples of Source Data can be
 Log Files of each specific application or job or entry of employers in a
company.
 Survey Data, Stock Exchange Data, etc.
 Web Browser Data and many more.
2. Data Staging Layer
The following steps take place in Data Staging Layer.

Step #1: Data Extraction

 The Data received by the Source Layer is feed into the Staging Layer, where
the first process that takes place with the acquired data is extraction.

Step #2: Landing Database

 The extracted data is temporarily stored in a landing database.


 It retrieves the data once the data is extracted.

Step #3: Staging Area

 The Data in Landing Database is taken, and several quality checks and
staging operations are performed in the staging area.
 The Structure and Schema are also identified, and adjustments are made to
data that are unordered, thus trying to bring about a commonality among the
data that has been acquired.
 Having a place or set up for the data just before transformation and changes
is an added advantage that makes the Staging process very important.
 It makes data processing easier.

Step #4: ETL

 It is an Extraction, Transformation, and Load.


 ETL Tools are used for the integration and processing of data where logic is
applied to rather raw but somewhat ordered data.
 This data is extracted as per the analytical nature that is required and
transformed to data that is deemed fit to be stored in the Data Warehouse.
 After Transformation, the data or rather information is finally loaded into
the data warehouse.
 Some examples of ETL tools are Informatica, SSIS, etc.
3. Data Storage Layer

 The processed data is stored in the Data Warehouse.


 This Data is cleansed, transformed, and prepared with a definite structure
and thus provides opportunities for employers to use data as required by the
Business.
 Depending upon the approach of the Architecture, the data will be stored in
Data Warehouse as well as Data Marts. Data Marts will be discussed in the
later stages.
 Some also include an Operational Data Store.

4. Data Presentation Layer

 This Layer where the users get to interact with the data stored in the data
warehouse.
 Queries and several tools will be employed to get different types of
information based on the data.
 The information reaches the user through the graphical representation of
data.
 Reporting Tools are used to get Business Data, and Business logic is also
applied to gather several kinds of information.
 Meta Data Information and System operations and performance are also
maintained and viewed in this layer.

 Introduction to OLTP, ETL and OLAP Systems

Overview of OLTP

OLTP or Online Transaction Processing is a type of data processing approach


where the transactions play a major role in data manipulation in the database. This
provides facts and statistics for the Business Intelligence required for the business
decision-making process. OLTP is usually applied on applications related to e-
commerce, online banking, online shopping, sales and service platforms, etc. This
type of data processing is known for its high performance, faster accessibility and
reliable & consistent data.
Understand the OLTP

OLTP supports only those operations that are already defined. For example, In the
case of online airline booking, we need to book an airline which is related to
insertion in the database. OLTP ensures the availability in the cart and concurrency
in case a large number of users are accessing the same website at the same time.
This is done using the following characteristics of OLTP:-

1. Predefined operations – It strictly performs only the predefined


operations only on a small number of records.
2. Updating of databases is directly accessible to end-users.
3. A small number of records – It processes only a handful of records
only. For example – ‘Retrieving the money from ATM.’
4. Maintaining historical data- It stores the records of the last few days or
a week to successfully perform the transactions.

Why do we need OLTP

We need OLTP to perform the tasks which are often performed by the system and
require only a smaller number of records. The tasks are related to insertion,
updation or deletion of data in databases. Consistency and concurrency are
required to perform such tasks, which ensure its greater availability. OLTP
maintains normalized databases and decentralized system, which provides greater
availability and consistency. This also ensures maintaining concurrency among the
database. OLTP often works differently in the case of batch processing and grid
computing. On the contrary, OLAP(Online Analytic Processing) works on a
greater number of tasks related to complex on databases used in main business
intelligence tasks.

Overview of OLAP

OLAP is online analytical processing as the name itself indicating the OLAP is for
the data analytic purpose; hence it enables us to analyze information from multiple
database systems at the same time. In other words, we can tell that it is a
computing method that allows users to easily extract required data and query data
in order to analyze it from different points of view. It is basically based on the huge
data that is called data warehouse; it collects the required data from the data
warehouse and perform the business required analysis to take some decision in the
business to improve in profit, to improve sale, to improve brand, to improve
marketing and so all. Therefore it uses in business intelligence to queries aid in
trends analysis, sales forecasting, financial reporting, planning purposes,
budgeting, and so other things.

What is OLAP

Like OLAP, the other term we are having is OLTP that is online transactional
processing; both are online processing systems; the OLTP is transactional
processing mainly concerned task on the transaction task while OLAP is an
analytical processing system which is mainly concerned with the analysis and
reporting and gives them valuable insight to improve the business.

What is ETL

It is a process in data warehousing used to extract data from the database or source
systems and, after transforming placing the data into the data warehouse. It is a
combination of three database functions, i.e. Extract, Transform and Load.
 Extract: This is the process of reading data from single or multiple
databases where the source can be homogeneous or heterogeneous. All data
acquired from different sources are converted into the same data warehouse
format and passed to perform the transformation.
 Transform: This is the process of transforming the extracted data into the
form required as an output or in the form suitable to place in another
database.
 Load: This is the process of writing the desired output into the target
database

Why do we Need ETL?

 Helps in taking decisions by analyzing data.


 It can handle complex problems which cannot be handled by traditional
databases.
 It provides a common data repository.
 Loads data from different sources into the target database.
 Data warehouse automatically updates according to the changes in the data
source.
 Verify data transformation, calculations and aggregation rules.
 Compares source and target systems data.
 Improves productivity.

ETL Overview

ETL stands for Extract, Transform and Load. It is a programming tool


consisting of several functions that extract the data from specified Relational
Database source systems and then transforms the acquired data into the
desired form by applying various methods. It then loads or writes the
resulting data on the target database. ETL is a type of Data assimilation
process for gathering data from multiple data sources and converting it into
one common format in order to build a Data Warehouse or a Database or
any Data Storage system, using the three steps as the name suggests, that is,
Extract, Transform & Load, where Extract means to collect the data from all
the data sources as required, Transform means to convert the data from
multiple sources with multiple formats into a single common format that can
be used for analysis and reporting purposes, and Load means to store all the
transformed data into the Database or Data Warehouse system.
 Difference between OLTP and OLAP

 Data Marts
Introduction to DATA MART

A pattern used in data warehouse environment to retrieve client data is called data
mart. It is a structure specific to the data warehouse and used by the business
domain in the team. Every organization has a single data mart which is located in
the data warehouse repository. Different types of data mart are dependent,
independent and hybrid data marts. Dependent data marts take data that is already
created whereas independent data marts take data from external sources and from
data warehouse. We can call data marts as logical subsets of data warehouse.

Data Mart vs Data Warehouse

A data warehouse is a warehouse with a collection of data from multiple streams of


the subject. The maintenance and control part like the collection of raw data and
processing them is mainly handled by Corporate Information Technology IT
groups which provides various services to the parent organizations.

The data warehouse is also referred to as a central or enterprise data warehouse. So


the source to a data warehouse will be multiple in contrast to the data mart which is
a subset of data warehouse in some cases.

Types of Data Mart

1 ) Dependent Data Warehouse


A dependent data mart is purely from the data warehouse and all the grouped
dependent will form an enterprise data warehouse. It is purely a subset of data
warehouse since it is created from central DW.

Since clean and summarised data is already present in the central data warehouse
ETT process or Extract Transform and Transportation is simplified. We just need
to identify the particular subset here and perform ETT on top of it.

These data marts are typically built to achieve better availability and a lot of
improved performance with better control and efficiency.

2. Independent Data Mart


This is not created from the central data warehouse and the source to this can be
different. Since data is from other than the central DW ETL process is a bit
different.

Most of the independent data mart is used by a smaller group of organizations and
the source to this is also limited. The Independent data mart is generally created
when we need to get a solution in a relatively shorter time-bound.

3) Hybrid Data Mart

Hybrid data mart will allow you to group the data from all other sources other than
the central data warehouse DW. When we deal with ad hoc integration this will
greatly benefit the top work on all the products that got added to the organizations
externally.
 ODS [Operational Data Store]

 What Is an ODS?
An operational data store (or “ODS”) is another paradigm for integrating enterprise
data that is relatively simpler than a data warehouse (DW). The ODS is a central
database (DB) where data that has been integrated from disparate sources
undergoes specific functions.

Like a DW (in which an ODS is often a component), data in an ODS can come
from batch inputs, and is processed by extract, transform, and load (ETL), and data
quality, operations:

 ODS Example

This example compares and contrasts the ODS and DW as well. Here, the ODS is
acting as a batch-oriented DW, updating and replacing each datum that resides in it
(and adding new data). But it is not keeping a running history of the measures it
stores. You can implement this kind of ODS with batch-oriented middleware,
reporting and OLAP tools. Or you can use one platform, like IRI Voracity, to
connect to the sources, administer the DB(s), prototype and perform ETL jobs,
cleanse, and report.
 Introduction to Data warehouse Schemas

The Data Warehouse Schema is a structure that rationally defines the contents of
the Data Warehouse, by facilitating the operations performed on the Data
Warehouse and the maintenance activities of the Data Warehouse system, which
usually includes the detailed description of the databases, tables, views, indexes,
and the Data, that are regularly structured using predefined design types such as
Star Schema, Snowflake Schema, Galaxy Schema (also known as Fact
Constellation Schema).

A schema is a logical description that describes the entire database. In the data
warehouse there includes the name and description of records. It has all data items
and also different aggregates associated with the data. Like a database has a
schema, it is required to maintain a schema for a data warehouse as well. There are
different schemas based on the setup and data which are maintained in a data
warehouse.

 Star Schema and Snowflake Schema

1. Star Schema
Here are some of the basic points of star schema which are as follows:

In a star schema, as the structure of a star, there is one fact table in the middle and
a number of associated dimension tables. This structure resembles a star and hence
it is known as a star schema.

The fact table here consists of primary information in the data warehouse. It
surrounds the smaller dimension lookup tables which will have details for different
fact tables. The primary key which is present in each dimension is related to a
foreign key which is present in the fact table.

This infers that fact table has two types of columns having foreign keys to
dimension tables and measures which contain numeric facts. At the center of the
star, there is a fact table and the points of the star are the dimension tables.
The fact tables are in 3NF form and the dimension tables are in denormalized form.
Every dimension in star schema should be represented by the only one-dimensional
table. The dimension table should be joined to a fact table. The fact table should
have a key and measure.

• A single fact table and for each dimension one single dimension table.

• Every fact points to one tuple in each of the dimension tables and has additional
attributes

• Does not capture hierarchies directly

• Generated keys are used for performance and maintenance reasons


2. Snowflake Schema
Here are some of the basic points of snowflake schema which are as follows:

Snowflake schema acts like an extended version of a star schema. There are
additional dimensions added to Star schema. This schema is known as snowflake
due to its structure.

In this schema, the centralized fact table will be connected to different multiple
dimensions. The dimensions present are in normalized form from the multiple
related tables which are present. The snowflake structure is detailed and structured
when compared to star schema.

There are multiple levels of relationships and child tables involved that have
multiple parent tables. In snowflake schema, the affected tables are only the
dimension tables and not the fact tables.

The difference between star and snowflake schema is that the dimensions of
snowflake schema are maintained in such a way that they reduce the redundancy of
data. The tables are easy to manage and maintain. They also save storage space.

However, due to this, it is needed to have more joins in the query in order to
execute the query. The further expansion of the tables leads to snow flaking. When
a dimension table has a low cardinality attribute of dimensions then it is said to be
snowflakes.

The dimension tables have been divided into segregated normalized tables. Once
they are segregated they are further joined with the original dimension table which
has a referential constraint. This schema may hamper the performance as the
number of tables that are required are more so that the joins are satisfied.

The advantage of snowflake schema is that it uses small disk space. The
implementation of dimensions is easy when they are added to this schema. The
same set of attributes are published by different sources.
• Represent dimensional hierarchies directly by normalizing the dimension tables

• Easy to maintain

• Save storage, but it is alleged that it reduces effectiveness of browsing

 What is fact table

A Fact table in a Data Warehouse system is nothing but the table that contains all
the facts or the business information, which can be subjected to analysis and
reporting activities when required. These tables hold fields that represent the direct
facts, as well as the foreign fields that are used to connect the fact table with other
dimension tables in the Data Warehouse system. A Data Warehouse system can
have one or more fact tables, depending on the model type used to design the Data
Warehouse.
Characteristics of Fact Table

Below are the characteristics of the fact table:

 Keys: It has a key or a primary key which is the accumulation of all the

primary keys of all dimension tables linked with it. That key is known as a

concatenated key that helps to uniquely identify the row.

 Fact Table Grain: The grain of a table depicts the level of the detail or the

depth of the information that is contained in that table. More the level, the

more the efficiency of the table.

 Additive Measures: Attributes present in this can be fully additive, non-

additive or semi-additive. Fully additive or additive measures are added to

all the dimensions. Semi- additive are those measures that are added to some

of the dimensions and not to all the dimensions and non-additive measures

are stored fundamental units of measurement for a business process.

 Sparse Data: There are records that have attributes containing null values

or measures. They provide no information.

 Shrunken Rollup Dimensions: Shrunken Rollup dimensions are the

subdivisions of the base dimension.


 What is the Dimension table

Introduction to Dimension Table

This is a table in a star schema of a data warehouse. Data warehouses are built
using dimensional data models which consist of fact and dimension tables. The
latter is used to describe dimensions. They contain dimension keys, values and
attributes.

Types of Dimensions

1. SCD (Slowly Changing Dimensions)


The dimension attributes that tend to change slowly with time rather than changing
in a regular interval of time are called slowly changing dimensions. For e.g.
address and phone number changes but not regularly. Let us see an example of a
man who travels to different countries so he needs to change his address according
to that country. This can be done in six ways:

What are slowly changing dimensions?


When organising a data warehouse into Kimball-style star schemas, you relate fact
records to a specific dimension record with its related attributes. But what if the
information in the dimension changes? Do you now associate all fact records with
the new value? Do you ignore the change to keep historical accuracy? Or do you
treat facts before the dimension change differently to those after?

It is this decision that determines whether to make your dimension a slowly


changing one. There are several different types of SCD depending on how you
treat incoming change.

What are the types of SCD?

Very simply, there are 6 types of Slowly Changing Dimension that are commonly
used, they are as follows:

 Type 0 – Fixed Dimension


 No changes allowed, dimension never changes
 Type 1 – No History
 Update record directly, there is no record of historical values,
only current state
 Type 2 – Row Versioning
 Track changes as version records with current flag & active
dates and other metadata
 Type 3 – Previous Value column
 Track change to a specific attribute, add a column to show the
previous value, which is updated as further changes occur
 Type 4 – History Table
 Show current value in dimension table but track all changes in
separate table
 Type 6 – Hybrid SCD
 Utilise techniques from SCD Types 1, 2 and 3 to track change
In reality, only types 0, 1 and 2 are widely used, with the others reserved for very
specific requirements. Confusingly, there is no SCD type 5 in commonly agreed
definitions.

After you have implemented your chosen dimension type, you can then point your
fact records at the relevant business or surrogate key. Surrogate keys in these
examples relate to a specific historical version of the record, removing join
complexity from later data structures.

Practical Examples
We have a very simple ‘customer’ dimension, with just 2 attributes – Customer
Name and Country:

However, Bob has just informed us that he has now moved to the US and we want
to update our dimension record to reflect this. We can see how the different SCD
types will handle this change and the pro/cons of each method.

Type 0

Our table remains the same. This means our existing reports will continue to show
the same figures, maybe it is a business requirement that each customer is always
allocated to the country they signed up from.

All future transactions associated to Bob will also be allocated to the ‘United
Kingdom’ country.

Type 1

The table is updated to reflect Bob’s new country:

All fact records associated with Bob will now be associated with the ‘United
States’ country, regardless of when they occurred.

We often just want to see the current value of a dimension attribute – it could be
that the only dimension changes that occur are corrections to mistakes, maybe
there is no requirement for historical reporting.

Type 2

In order to support type 2 changes, we need to add four columns to our table:

· Surrogate Key – the original ID will no longer be sufficient to identify the


specific record we require, we therefore need to create a new ID that the fact
records can join to specifically.

· Current Flag – A quick method of returning only the current version of each
record
· Start Date – The date from which the specific historical version is active

· End Date – The date to which the specific historical version record is active

With these elements in place, our table will now look like:

This method is very powerful – you maintain the history for the entire record and
can easily perform change-over-time analysis. However, it also comes with more
maintenance overhead, increased storage requirement and potential performance
impacts if used on very large dimensions.

Type 2 is the most common method of tracking change in data warehouses.

Type 3

Here, we add a new column called “Previous Country” to track what the last value
for our attribute was.

Note how this will only provide a single historical value for Country. If the
customer changes his name, we will not be able to track it without adding a new
column. Likewise, if Bob moved country again, we would either need to add
further “Previous Previous Country” columns or lose the fact that he once lived in
the United Kingdom.

Type 4

There is no change to our existing table here, we simply update the record as if a
Type 1 change had occurred. However, we simultaneously maintain a history table
to keep track of these changes:

Our Dimension table reads:


Whilst our Type 4 historical table is created as:

Depending on your requirements, you may place both ID and Surrogate Key onto
the fact record so that you can optimise performance whilst maintaining
functionality.

Separating the historical data makes your dimensions smaller and therefore reduces
complexity and improves performance if the majority of uses only need the current
value.

However, if you do require historical values, this structure adds complexity and
data redundancy overheads. It is generally assumed that the system will use Type 1
or Type 2 rather than Type 4.

Type 6

The ‘Hybrid’ method simply takes SCD types 1, 2 and 3 and applies all techniques.
We would maintain a history of all changes whilst simultaneously updating a
“current value” column on all records.

This gives you the ability to provide an element of change comparison without
additional calculation, whilst still maintaining a full, detailed history of all changes
in the system.

Personally, if this requirement came up, I would avoid the data redundancy of this
extra column and simply calculate the current value using the “LAST_VALUE()”
window function at run-time. Although this depends on your priorities between
data storage and direct querying performance.

Conformed Dimension

A conformed dimension is a dimension that has exactly the same meaning and
content when being referred from different fact tables. A conformed dimension can
refer to multiple tables in multiple data marts within the same organization. For
two dimension tables to be considered as conformed, they must either be identical
or one must be a subset of another. There cannot be any other type of difference
between the two tables. For example, two dimension tables that are exactly the
same except for the primary key are not considered conformed dimensions.

Degenerate Dimension

A degenerate dimension is when the dimension attribute is stored as part of the


fact table and not in a separate table. Product id comes from product dimension
table. Invoice number is a standalone attribute and has no other attributes
associated with it. An invoice number can be crucial since the business would want
to know the quantity of the products.
Junk Dimension

It is a single table with a combination of different and unrelated attributes to avoid


having a large number of foreign keys in the fact table. They are often created to
manage the foreign keys created by rapidly changing dimensions.

Junk Dimension Tables

Junk dimensions are used to reduce the number of dimensions in the dimensional
model and reduce the number of columns in the fact table. A junk dimension
combines two or more related low cardinality flags into a single dimension. An
example of this may be car color (red, black, blue, etc.) and body style (sedan, van,
SUV, etc.) As you can see these are limited in number and, if created as single
dimensions, the dimensions would be limited to a single attribute. In order to
eliminate these small dimensions, we create a single “junk” dimension which cross
joins all possible attributes into a single dimension which will be used in the fact
table.
By combining these into a single dimension we have made the model easier to
understand and use by both IT and business users.

These two types of dimensions are useful and powerful in creating better to use
and understand data models.

Role play dimension

It is a table that has multiple valid relationships with a fact table. For example, a
fact table may include foreign keys for both ship date and delivery date. But the
same attributes apply to each foreign key so the same tables can be joined to the
foreign keys.

Conformed vs Role-Playing Dimension

 Conformed is the same dimension used in different facts and has the same
meaning ex: CustomerID.


 Role-Playing is the same dimension which used multiple times within the
same fact but with different meanings ex: Date.

You might also like