CS408: Data Warehousing: Welcome To Course
CS408: Data Warehousing: Welcome To Course
CS408: Data Warehousing: Welcome To Course
2
Course Information
Level: 4
8
Why should you be here?
• Bad decisions can lead to disaster
– Data Warehousing is at
the base of decision
support systems
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 10
Why should you be here?
• Data Warehousing & OLAP are important
• They help to
– Understand the information hidden
within an organization’s data
• See data from different angles:
product,client,time,geographical area
• Get adequate statistics to get
your point of argumentation
across
• Get a glimpse of the future…
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11
Why should you be here?
12
Course Contents and Schedule
Week LECTURE LAB
Introduction to course, Introduction to Data
1 • Introduction with lab course, Introduction with Tool
Warehouses
• Database and SQL Revision
2 DW Life Cycle and Basic Architecture
• Lab task based on SQL queries
• PSQL introduction and lab tasks based on PSQL
3 Storage Architecture
• Lab task based on PSQL
• Will follow a Data set in in our labs
4 DW Modeling – Conceptual Model • In this step we will design data warehouse architecture
• Create staging Database in SQL Server 2012.
5 Logical Model • Create Production Data warehouse in SQL Server 2012.
• Data will be extracted from source data and load in staging tables
6 Physical Model
using SSIS
• Different transformations will be applied on the data and data will
7 Project Brief stored in production tables using SSIS and implementation of slowly
changing dimension.
Indexes • OLAP queries writing
8
• Query based tasks will be assigned to perform with in the lab.
Indexes (continued) • OLAP Query Optimizing
9
• Query based tasks will be assigned to perform with in the lab.
10 Optimisation • Multidimensional cube will be design using SSAS
11 Optimisation (continued) • Formulas, aggregation applied on the built cube
12 OLAP Queries • Data visualization and reporting based on Datawarehouse
Mid 15%
Final
35%
Project 30%
14
Recommended Literature
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15
Lecture 01
16
This week
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17
What is a data warehouse?
• Basically a very large da ta ba se…
– Not all very large databases are data warehouses,
but all data warehouses are pretty large databases
– Nowadays a warehouse is considered
to start at around 800 GB and goes
up to severalTB
– It spans over several servers and
needs an impressive amount of
computing power
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18
What is a data warehouse?
• More specific, a collective data repository
– Containing snapshots of the operational data
(history)
– Obtained through data cleansing (Extract-
Transform- Load)
– Useful for analytics
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19
What is a data warehouse?
• Compared to other solutions it…
– Is suitable for tactical/strateg ic focus
– Implies a sm all num ber of transactions
– Implies large transactions spanning over a
long period of time
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 20
Some Definitions
• Experts say…
– Ralph Kimball:“a copy of transaction
data specifically structured for query and
analysis”
– Bill Inmon: “A data warehouse is a:
– Subject oriented
– Integrated
– Non-volatile
– Time variant
collection of data in support of management’s decisions.”
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 21
Inmon Definition (cont’d.)
• S ubject oriented
– The data in the data warehouse is organized so that
all the data elements relating to the same real-world
event or object are linked together
• Typical subject areas in DWs are
Customer,Product,Order,Claim,
Account,…
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 22
Inmon Definition (cont’d.)
• S ubject oriented
– Example:customer as subject in a DW
• DW is organized in this case by the customer
• It may consist of 10,100 or more physical tables, all
related
CUSTOMER
Base customer
Data 2003 - 2005
Customer activity
Base customer
2001 - 2004
Data 2000 - 2002
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 23
Inmon Definition (cont’d.)
• Integ ra ted
– The data warehouse contains data from most or all
of an organization's operational systems and this data is
made consistent
– E.g. gender,measurement,conflicting keys,
consistency,…
operational encoding DW
appl A – M,F M,F
appl B – 1,0
appl C – male,female
appl A – cm cm
appl B – inches
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 24
Inmon Definition (cont’d.)
• Non-vola tile
– Data in the data warehouse is never over-written
or deleted - once committed, the data is static, read-
only, and retained for future reporting
– Data is loaded,but not updated
– When subsequent changes occur,a new snapshot record is
written
Operational DW
insert change
delete access
access load
Record-by-record Mass load/
manipulation access of data access
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 25
Inmon Definition (cont’d.)
• T im e-va rying
– The changes to the data in the data warehouse are
tracked and recorded so that reports can be produced
showing changes over tim e
– Different environments have different tim e horizons
• associated
• While for operational systems a 60-
to-90 day time horizon is normal,
data warehouse has a
• 5-to-10 year horizon
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 26
General Definition
• More general,a DW is a
– Repository of an
org anization’s
electronically stored
data
– Desig ned to
facilitate
reporting and
analysis
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 27
Typical Features
• DW typically…
– Reside on computers dedicated to this function
– Run on DB MS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server
– Retain data for long periods of tim e
– Consolidate data obtained from a variety of
sources
– Are built around their own carefully desig ned data
m odel
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 28
Use case
• DW stands for big data volume,so lets take an
example of 2 big com pa nies,a retailer,say
Walmart and a RDBMS vendor,Sybase:
– Walmart CEO:I want to keep track of sales in
all my stores simultaneously
– Sybase consultant:You need our wonderful RDBMS software.
You can stuff data in as sales are rung up at cash registers and
simultaneously query data right in your office
– SoWalmart buys a $1 milion Sun E10000 multi-CPU
server,a $500 000 Sybase license,a book
“Database Design for Smarties”,and build
themselves a normalized SQL data model
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 29
Use case (cont’d.)
• After a few months of stuffing data into the tables…
aWalmart executive asks…
– I have noticed that there was a Colgate promotion recently,
directed to people who live in small towns.How much
toothpaste did we sell in those towns yesterday?
– Translation to a query:
select sum(sales.quantity_sold) from sales,products,product_categories, manufacturers,
stores,cities where manufacturer_name =‘Colgate’
and product_category_name =‘toothpaste’ and cities.population < 40 000
and trunc(sales.date_time_of_sale) = trunc(sysdate-1) and sales.product_id =
products.product_id
and sales.store_id = stores.store_id
and products.product_category_id = product_categories.product_category_id and
products.manufacturer_id = manufacturers.manufacturer_id
and stores.city_id = cities.city_id
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 30
Use case (cont’d.)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 31
Use case (cont’d.)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 32
Use case (cont’d.)
– Walmart CEO: It sounds like a bug.
– Sybase support: Actually it is a feature. We call it
pessimistic locking.
– Walmart CEO: Can you fix your system so that it doesn’t
lock up???
– Sybase support: No. But we made this great loader tool
so that you can copy everything from your OLTP system
into a separate Data Warehouse system at 100 GB/hour
• After a while…
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 33
OLTP vs. DW
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 34
OLTP vs. DW (cont’d.)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 35
OLTP vs. DW (cont’d.)
• Properties
Operational DB DW
Mostly updates Mostly reads
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 36
OLTP vs. DW (cont’d.)
Sales_unit_D
(PK) Name
Product Address
Status City
Product_ID (PK) Postal_cod
Status_code (PK) Name e
Status Description Telephone_number
Cost Email
Sales_unit_ID (FK)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 37
OLTP vs. DW (cont’d.)
• If we were to set up a DW for that store,we
would start by building the following star schema
Customer Product
Customer_ID (PK) Product_ID (PK)
Name Name
Address Description
City Cost
Postal_code Sales_facts
Product_ID (FK)
Customer_ID (FK)
Time_key (FK)
Sales_unit_ID (FK)
Sales_unit Total Time_period
Sales_tax
Sales_unit_ID (PK) Shipping_charge Time_key
Name (PK)
Address City Description
Postal_code Day
Telephone Fiscal_week
Fiscal_period
Fiscal_year
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 38
OLTP vs. DW (cont’d.)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 39
OLTP vs. DW (cont’d.)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 40
Applications of DW
• Typica l questions which can be answered
with DW & OLAP
– How much did sales unitA earn in January?
– How much did sales unit B earn in February?
What was their combined sales amount for the first
quarter?
• Answering these questions with S QL -queries is
difficult
– Complex query formulation necessary
– Process is likely to be slow due to complex joins and
multiple scans
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 41
Applications of DW (cont’d.)
• Why such questions can be answered better with
a DW?
– Because in a DW tables are rearranged and
pre- ag g reg ated (known as computing cubes)
e.g. Years, weeks, days etc.
• The tables arrangement is subject oriented,usually some
sta r schem a
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 42
Applications of DW (cont’d.)
• A DW is the base repository for front-end
a na lytics
– OLAP
– KDD (Knowledge Discovery in Databases) a
data mining process
– Data visualization
– Reporting
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 43
Applications of DW (cont’d.)
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 44
Applications of DW (cont’d.)
• KDD (Data Mining)
– Constructs m odels of the data in question
• Models can be viewed as high level summaries of the
underlying data
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 45
Applications of DW (cont’d.)
– Based on this example a query returns the data that
fulfills the constraints
• SELECT * FROM CUSTOMER_TABLE WHERE
TOTAL_SPENT > €100;
– Data mining might return the following set of rules
for customers spending more than €100:
• IF AGE > 35 AND CAR =‘MINIVAN’ THEN TOTAL SPENT
> €100
• IF SEX =‘M’ AND ZIP = 38106 THEN TOTAL SPENT >
€100
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 46
Applications of DW (cont’d.)
– It answers questions like
• Which products or customers are more profitable
• Which outlets have sold the least this year
– In consequence it motivates decisions like
• Which products should have their production increased
• Which customers should be targeted for special
promotions
• Which outlets should be closed
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 47
Who is the user?
• Users of DW are called DS S a na lysts
and usually are business persons
– Their primary job is to define and discover
information used in corporate decision-m aking
– The way they think
• “Give me what I say I want,and then I can tell you what I
really want”
• They work in explorative manner
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 48
Who is the user? (cont’d.)
– Typical explorative line of work
• “Ah! Now that I see what the possibilities are,I can tell
what I really want to see.But until I know what the
possibilities are,I cannot describe exactly what I want…”
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 49
Summary
• Data Warehouse - Introduction
– DW Definitions
– OLTP vs. DW
– Applications of DW
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 50
Next lecture
Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 51