DWH Notes
DWH Notes
DWH Notes
A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived
from transaction data, but it can include data from other sources. It separates
analysis workload from transaction workload and enables an organization to
consolidate data from several sources.
Subject Oriented
Integrated
Nonvolatile
Time Variant
Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems
as naming conflicts and inconsistencies among units of measure. When they
achieve this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change.
This is logical because the purpose of a warehouse is to enable you to analyze what
has occurred.
Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse's focus on change over time is what is meant by the term time
variant.
Many global corporations have turned to data warehousing to organize data that
streams in from corporate branches and operations centers around the world. It’s
essential for IT students to understand how data warehousing helps businesses
remain competitive in a quickly evolving global marketplace.
Source Layer
Staging Layer
Storage Layer
Presentation Layer
The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired operations.
The data can be of any type.
The Source Data can be a database, a Spreadsheet or any other kinds of text
file.
The Source Data can be of any format. We cannot expect to get data with
the same format considering the sources are vastly different.
In Real Life, Some examples of Source Data can be
Log Files of each specific application or job or entry of employers in a
company.
Survey Data, Stock Exchange Data, etc.
Web Browser Data and many more.
2. Data Staging Layer
The following steps take place in Data Staging Layer.
The Data received by the Source Layer is feed into the Staging Layer, where
the first process that takes place with the acquired data is extraction.
The Data in Landing Database is taken, and several quality checks and
staging operations are performed in the staging area.
The Structure and Schema are also identified, and adjustments are made to
data that are unordered, thus trying to bring about a commonality among the
data that has been acquired.
Having a place or set up for the data just before transformation and changes
is an added advantage that makes the Staging process very important.
It makes data processing easier.
This Layer where the users get to interact with the data stored in the data
warehouse.
Queries and several tools will be employed to get different types of
information based on the data.
The information reaches the user through the graphical representation of
data.
Reporting Tools are used to get Business Data, and Business logic is also
applied to gather several kinds of information.
Meta Data Information and System operations and performance are also
maintained and viewed in this layer.
Overview of OLTP
OLTP supports only those operations that are already defined. For example, In the
case of online airline booking, we need to book an airline which is related to
insertion in the database. OLTP ensures the availability in the cart and concurrency
in case a large number of users are accessing the same website at the same time.
This is done using the following characteristics of OLTP:-
We need OLTP to perform the tasks which are often performed by the system and
require only a smaller number of records. The tasks are related to insertion,
updation or deletion of data in databases. Consistency and concurrency are
required to perform such tasks, which ensure its greater availability. OLTP
maintains normalized databases and decentralized system, which provides greater
availability and consistency. This also ensures maintaining concurrency among the
database. OLTP often works differently in the case of batch processing and grid
computing. On the contrary, OLAP(Online Analytic Processing) works on a
greater number of tasks related to complex on databases used in main business
intelligence tasks.
Overview of OLAP
OLAP is online analytical processing as the name itself indicating the OLAP is for
the data analytic purpose; hence it enables us to analyze information from multiple
database systems at the same time. In other words, we can tell that it is a
computing method that allows users to easily extract required data and query data
in order to analyze it from different points of view. It is basically based on the huge
data that is called data warehouse; it collects the required data from the data
warehouse and perform the business required analysis to take some decision in the
business to improve in profit, to improve sale, to improve brand, to improve
marketing and so all. Therefore it uses in business intelligence to queries aid in
trends analysis, sales forecasting, financial reporting, planning purposes,
budgeting, and so other things.
What is OLAP
Like OLAP, the other term we are having is OLTP that is online transactional
processing; both are online processing systems; the OLTP is transactional
processing mainly concerned task on the transaction task while OLAP is an
analytical processing system which is mainly concerned with the analysis and
reporting and gives them valuable insight to improve the business.
What is ETL
It is a process in data warehousing used to extract data from the database or source
systems and, after transforming placing the data into the data warehouse. It is a
combination of three database functions, i.e. Extract, Transform and Load.
Extract: This is the process of reading data from single or multiple
databases where the source can be homogeneous or heterogeneous. All data
acquired from different sources are converted into the same data warehouse
format and passed to perform the transformation.
Transform: This is the process of transforming the extracted data into the
form required as an output or in the form suitable to place in another
database.
Load: This is the process of writing the desired output into the target
database
ETL Overview
Data Marts
Introduction to DATA MART
A pattern used in data warehouse environment to retrieve client data is called data
mart. It is a structure specific to the data warehouse and used by the business
domain in the team. Every organization has a single data mart which is located in
the data warehouse repository. Different types of data mart are dependent,
independent and hybrid data marts. Dependent data marts take data that is already
created whereas independent data marts take data from external sources and from
data warehouse. We can call data marts as logical subsets of data warehouse.
Since clean and summarised data is already present in the central data warehouse
ETT process or Extract Transform and Transportation is simplified. We just need
to identify the particular subset here and perform ETT on top of it.
These data marts are typically built to achieve better availability and a lot of
improved performance with better control and efficiency.
Most of the independent data mart is used by a smaller group of organizations and
the source to this is also limited. The Independent data mart is generally created
when we need to get a solution in a relatively shorter time-bound.
Hybrid data mart will allow you to group the data from all other sources other than
the central data warehouse DW. When we deal with ad hoc integration this will
greatly benefit the top work on all the products that got added to the organizations
externally.
ODS [Operational Data Store]
What Is an ODS?
An operational data store (or “ODS”) is another paradigm for integrating enterprise
data that is relatively simpler than a data warehouse (DW). The ODS is a central
database (DB) where data that has been integrated from disparate sources
undergoes specific functions.
Like a DW (in which an ODS is often a component), data in an ODS can come
from batch inputs, and is processed by extract, transform, and load (ETL), and data
quality, operations:
ODS Example
This example compares and contrasts the ODS and DW as well. Here, the ODS is
acting as a batch-oriented DW, updating and replacing each datum that resides in it
(and adding new data). But it is not keeping a running history of the measures it
stores. You can implement this kind of ODS with batch-oriented middleware,
reporting and OLAP tools. Or you can use one platform, like IRI Voracity, to
connect to the sources, administer the DB(s), prototype and perform ETL jobs,
cleanse, and report.
Introduction to Data warehouse Schemas
The Data Warehouse Schema is a structure that rationally defines the contents of
the Data Warehouse, by facilitating the operations performed on the Data
Warehouse and the maintenance activities of the Data Warehouse system, which
usually includes the detailed description of the databases, tables, views, indexes,
and the Data, that are regularly structured using predefined design types such as
Star Schema, Snowflake Schema, Galaxy Schema (also known as Fact
Constellation Schema).
A schema is a logical description that describes the entire database. In the data
warehouse there includes the name and description of records. It has all data items
and also different aggregates associated with the data. Like a database has a
schema, it is required to maintain a schema for a data warehouse as well. There are
different schemas based on the setup and data which are maintained in a data
warehouse.
1. Star Schema
Here are some of the basic points of star schema which are as follows:
In a star schema, as the structure of a star, there is one fact table in the middle and
a number of associated dimension tables. This structure resembles a star and hence
it is known as a star schema.
The fact table here consists of primary information in the data warehouse. It
surrounds the smaller dimension lookup tables which will have details for different
fact tables. The primary key which is present in each dimension is related to a
foreign key which is present in the fact table.
This infers that fact table has two types of columns having foreign keys to
dimension tables and measures which contain numeric facts. At the center of the
star, there is a fact table and the points of the star are the dimension tables.
The fact tables are in 3NF form and the dimension tables are in denormalized form.
Every dimension in star schema should be represented by the only one-dimensional
table. The dimension table should be joined to a fact table. The fact table should
have a key and measure.
• A single fact table and for each dimension one single dimension table.
• Every fact points to one tuple in each of the dimension tables and has additional
attributes
Snowflake schema acts like an extended version of a star schema. There are
additional dimensions added to Star schema. This schema is known as snowflake
due to its structure.
In this schema, the centralized fact table will be connected to different multiple
dimensions. The dimensions present are in normalized form from the multiple
related tables which are present. The snowflake structure is detailed and structured
when compared to star schema.
There are multiple levels of relationships and child tables involved that have
multiple parent tables. In snowflake schema, the affected tables are only the
dimension tables and not the fact tables.
The difference between star and snowflake schema is that the dimensions of
snowflake schema are maintained in such a way that they reduce the redundancy of
data. The tables are easy to manage and maintain. They also save storage space.
However, due to this, it is needed to have more joins in the query in order to
execute the query. The further expansion of the tables leads to snow flaking. When
a dimension table has a low cardinality attribute of dimensions then it is said to be
snowflakes.
The dimension tables have been divided into segregated normalized tables. Once
they are segregated they are further joined with the original dimension table which
has a referential constraint. This schema may hamper the performance as the
number of tables that are required are more so that the joins are satisfied.
The advantage of snowflake schema is that it uses small disk space. The
implementation of dimensions is easy when they are added to this schema. The
same set of attributes are published by different sources.
• Represent dimensional hierarchies directly by normalizing the dimension tables
• Easy to maintain
A Fact table in a Data Warehouse system is nothing but the table that contains all
the facts or the business information, which can be subjected to analysis and
reporting activities when required. These tables hold fields that represent the direct
facts, as well as the foreign fields that are used to connect the fact table with other
dimension tables in the Data Warehouse system. A Data Warehouse system can
have one or more fact tables, depending on the model type used to design the Data
Warehouse.
Characteristics of Fact Table
Keys: It has a key or a primary key which is the accumulation of all the
primary keys of all dimension tables linked with it. That key is known as a
Fact Table Grain: The grain of a table depicts the level of the detail or the
depth of the information that is contained in that table. More the level, the
all the dimensions. Semi- additive are those measures that are added to some
of the dimensions and not to all the dimensions and non-additive measures
Sparse Data: There are records that have attributes containing null values
This is a table in a star schema of a data warehouse. Data warehouses are built
using dimensional data models which consist of fact and dimension tables. The
latter is used to describe dimensions. They contain dimension keys, values and
attributes.
Types of Dimensions
Very simply, there are 6 types of Slowly Changing Dimension that are commonly
used, they are as follows:
After you have implemented your chosen dimension type, you can then point your
fact records at the relevant business or surrogate key. Surrogate keys in these
examples relate to a specific historical version of the record, removing join
complexity from later data structures.
Practical Examples
We have a very simple ‘customer’ dimension, with just 2 attributes – Customer
Name and Country:
However, Bob has just informed us that he has now moved to the US and we want
to update our dimension record to reflect this. We can see how the different SCD
types will handle this change and the pro/cons of each method.
Type 0
Our table remains the same. This means our existing reports will continue to show
the same figures, maybe it is a business requirement that each customer is always
allocated to the country they signed up from.
All future transactions associated to Bob will also be allocated to the ‘United
Kingdom’ country.
Type 1
All fact records associated with Bob will now be associated with the ‘United
States’ country, regardless of when they occurred.
We often just want to see the current value of a dimension attribute – it could be
that the only dimension changes that occur are corrections to mistakes, maybe
there is no requirement for historical reporting.
Type 2
In order to support type 2 changes, we need to add four columns to our table:
· Current Flag – A quick method of returning only the current version of each
record
· Start Date – The date from which the specific historical version is active
· End Date – The date to which the specific historical version record is active
With these elements in place, our table will now look like:
This method is very powerful – you maintain the history for the entire record and
can easily perform change-over-time analysis. However, it also comes with more
maintenance overhead, increased storage requirement and potential performance
impacts if used on very large dimensions.
Type 3
Here, we add a new column called “Previous Country” to track what the last value
for our attribute was.
Note how this will only provide a single historical value for Country. If the
customer changes his name, we will not be able to track it without adding a new
column. Likewise, if Bob moved country again, we would either need to add
further “Previous Previous Country” columns or lose the fact that he once lived in
the United Kingdom.
Type 4
There is no change to our existing table here, we simply update the record as if a
Type 1 change had occurred. However, we simultaneously maintain a history table
to keep track of these changes:
Depending on your requirements, you may place both ID and Surrogate Key onto
the fact record so that you can optimise performance whilst maintaining
functionality.
Separating the historical data makes your dimensions smaller and therefore reduces
complexity and improves performance if the majority of uses only need the current
value.
However, if you do require historical values, this structure adds complexity and
data redundancy overheads. It is generally assumed that the system will use Type 1
or Type 2 rather than Type 4.
Type 6
The ‘Hybrid’ method simply takes SCD types 1, 2 and 3 and applies all techniques.
We would maintain a history of all changes whilst simultaneously updating a
“current value” column on all records.
This gives you the ability to provide an element of change comparison without
additional calculation, whilst still maintaining a full, detailed history of all changes
in the system.
Personally, if this requirement came up, I would avoid the data redundancy of this
extra column and simply calculate the current value using the “LAST_VALUE()”
window function at run-time. Although this depends on your priorities between
data storage and direct querying performance.
Conformed Dimension
A conformed dimension is a dimension that has exactly the same meaning and
content when being referred from different fact tables. A conformed dimension can
refer to multiple tables in multiple data marts within the same organization. For
two dimension tables to be considered as conformed, they must either be identical
or one must be a subset of another. There cannot be any other type of difference
between the two tables. For example, two dimension tables that are exactly the
same except for the primary key are not considered conformed dimensions.
Degenerate Dimension
Junk dimensions are used to reduce the number of dimensions in the dimensional
model and reduce the number of columns in the fact table. A junk dimension
combines two or more related low cardinality flags into a single dimension. An
example of this may be car color (red, black, blue, etc.) and body style (sedan, van,
SUV, etc.) As you can see these are limited in number and, if created as single
dimensions, the dimensions would be limited to a single attribute. In order to
eliminate these small dimensions, we create a single “junk” dimension which cross
joins all possible attributes into a single dimension which will be used in the fact
table.
By combining these into a single dimension we have made the model easier to
understand and use by both IT and business users.
These two types of dimensions are useful and powerful in creating better to use
and understand data models.
It is a table that has multiple valid relationships with a fact table. For example, a
fact table may include foreign keys for both ship date and delivery date. But the
same attributes apply to each foreign key so the same tables can be joined to the
foreign keys.
Conformed is the same dimension used in different facts and has the same
meaning ex: CustomerID.
Role-Playing is the same dimension which used multiple times within the
same fact but with different meanings ex: Date.