DWM Lecture 1
DWM Lecture 1
DWM Lecture 1
Mining
Lecture 1
Lecture 1
Introduction and general information about the
subject
Introduction to Data Warehouse Concepts and
Designs
Reading
Data Warehousing Fundamentals, Paulraj Ponniah,
John Wiley & Sons, 2001.
The Data Warehouse Lifecycle Toolkit: Expert Methods
for Designing, Developing and Deploying Data
Warehouses, Ralph Kimball, John Wiley & Sons, 1998.
Handouts
Assessment
Activity Weightage
Mid-Term 20%
Final Exam 30%
Case Studies/CP 25%
Final Project 15%
Quizzes 10%
Total 100%
Requirements for Passing
Minimum 80% attendance
50% or more in the final components
which is
50% or more for the assignment
50% or more for the exams (Mid and
final)
Topics to be covered
Introduction and uses of a data
warehouse
Data Warehouse Concepts and Design
Extraction, Transformation and Loading
(ETL)
On-Line Analytic Processing (OLAP) and
On-Line Transactional Processing
(OLTP)
Introduction to DW Concepts
and Design
What is a Data Warehouse?
The purpose of a DW
An Operational Database V a Data
Warehouse
Schema Types for Data Warehouses
What is a Data Warehouse?
A Decision Support Database that is
maintained separately from the
organizations operational database
OR
Support Information Processing by
providing a solid platform for historical
consolidated data
What is a Data Warehouse?
A Data Warehouse is
Subject Oriented
Integrated
Time-variant
Non-Volatile
5
Coronel, C and Rob, P., Database Systems Design, Implementation &
Management, Thompson Learning, 2000.
Integrated
The sources of a data warehouse can be more than one
and of different types (relational database,
transactions etc)
However, the data must be standardized before it is
input inside the DW
For example, consider the following two data sources:
SOURCE A
ID Name Gende Heigh Data cleaning and data
r t integration techniques are
123 John Male 1.82 applied.
4 Smith SOURCE B
ID FNam LNam Gende Height
e e r
C00 David Jones M 178
1
Time Variant
DW provides a time-based perspective of
historical data;
At least one unit of time is mandatory;
Each part of the DW should have been accurate
at one point in time i.e. changes stored as new
entries;
Time-dependent aggregations (i.e. expenses)
are re-done after new uploads
Uploaded expenses for Jan= 100$
Uploaded expenses for Feb= 100$
The total will be re-calculated = 200$
Non-Volatile
Data is stored separately from the
Operational Environment
If the data has been entered into the
system, changing or modifying it is
strictly controlled. The purpose is to
store time-variant data for a long time.
Updating data in ODS is different from
Updating Data in a DW.
If any updating is required, the data is
uploaded with a new timestamp.
Historical data can be kept this way.
++Non-Volatile
Unlike the ODS, a DW does not allow you
22
Differences between OLTP and OLAP
Database design
OLTP
Entity relationship (ER) data model
Application oriented design
OLAP
Star or snowflake model (multidimensional -discuss later)
Subject oriented database design
View
OLTP
Current data within enterprise or department
Does not refer to historical data or from other organisations
OLAP
Multiple versions of database schema
Information from other organisations
24
Differences between OLTP and OLAP
Access patterns
OLTP
Short,atomic transactions
Hence requires concurrency control and
recovery mechanisms
OLAP
Mostlyread-only operations
Complex queries over large intervals of time
25
Feature OLTP OLAP
characteristic operational processing informational processing
orientation transaction analysis
clerk, DBA, database
user professional knowledge worker
function day to day operation decision support
star/snowflake, subject-
DB design ER based, application oriented oriented
historical, maintained over
data current, guaranteed up to date time
summarisation high detail summarised, consolidated
view detailed, flat relational multidimensional
unit of work short, simple transaction complex query
access read/write mostly read
focus data in information out
number of users thousands hundreds
DB size 100MB to GB 100GB to TB
high performance, high
26
priority availability high flexibility
+++Data Warehouses DSS/BI
Why is understanding data warehouse
concepts and design important for the
business manager/analyst?
To know what business problems are solvable
using data warehouse technology
To have an understanding of the workload
involved in designing and implementing a
data warehouse solution
To converse with database designers who
also need to be aware of the business needs
of their users
27
Data Warehouse as a Support for DSS
With a correctly implemented data warehouse
the DSS analyst should not have to worry
about:
Searching for the definitive source of data
Creating special extract programs from existing
systems
Dealing with un-integrated data
Compiling and linking detailed and summary data and
the linkage between the two
Finding an appropriate time basis of data (finding
historical data)
Management constantly changing its mind about what
needs to be looked at next
In addition,
9 the DSS
Source: Inmon, W.H., analyst has
Building the DataaWarehouse,
rich supply of2005
Wiley,
28 summary data available.9
Introduction to Data Warehouse
Concepts and Design
29
Schema types for Data
Warehouses
Star schema
One fact document, multiple not normalized
dimensions
Snowflake
One fact document, multiple normalized
dimensions
Fact constellation
Multiple fact documents, multiple dimensions
30
STAR SCHEMA DATA
WAREHOUSE
Time dimension
Item dimension
Time_key
Day_of_the_week Item_key
Month Sales Fact Table Item_name
Quarter Brand
Semester Time_key Supplier
Year Item_key
Branch_key
Location_key
Branch dimension Dollars_sold Location dimension
Units_sold
Branch_key Location_key
Branch_name Street
Branch_type City
Branch_address State
Country
31
SNOWFLAKE DATA WAREHOUSE
Time dimension
Item dimension Supplier
Time_key dimension
Day_of_the_week Item_key
Month Item_name Supplier_key
Quarter Sales Fact Table Brand Supplier_name
Semester Supplier_key Supplier_address
Year Time_key
Item_key
Branch_key
Location_key City
Dollars_sold dimension
Branch dimension Units_sold City_key
Location City_name
dimension State
Branch_key
country
Branch_name Location_key
Branch_type Street
Branch_address City_key ***
dimension
***
dimension .
.
32
FACT CONSTELLATION DATA WAREHOUSE
Supplier
Item dimension
dimension
Sales Fact Table
Item_key
Time dimension Supplier_key
Time_key Item_name
Supplier_name
Item_key Brand
Time_key Supplier_address
Branch_key Supplier_key
Day_of_the_week
Month Location_key
Quarter Dollars_sold
Semester Units_sold Location
Year dimension
Location_key
Street
Shipping fact
City
table
State
Branch dimension Country
Item_key
Time_key
Branch_key Shipper_key Shipper dimension
Branch_name From_location
Branch_type To_location Shipper_key
Branch_address Dollars_cost Shipper_name
Units_shipped Location_key
Shipper_type
Shipper_address
33