CS408: Data Warehousing: Welcome To Course

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Welcome to course

CS408: Data Warehousing


Outlines
• About Me
• Course information
• Teaching team
• Learning outcomes
• Course contents and schedule
• Assessments
• Resources
• Lecture 01

2
Course Information

Course Title: Data Warehousing

Course Code: CS408

Database Systems, Data


Prerequisites:
Structures, Operating Systems

Level: 4

8
Why should you be here?
• Bad decisions can lead to disaster
– Data Warehousing is at
the base of decision
support systems

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 10
Why should you be here?
• Data Warehousing & OLAP are important
• They help to
– Understand the information hidden
within an organization’s data
• See data from different angles:
product,client,time,geographical area
• Get adequate statistics to get
your point of argumentation
across
• Get a glimpse of the future…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 11
Why should you be here?

12
Course Contents and Schedule
Week LECTURE LAB
Introduction to course, Introduction to Data
1 • Introduction with lab course, Introduction with Tool
Warehouses
• Database and SQL Revision
2 DW Life Cycle and Basic Architecture
• Lab task based on SQL queries
• PSQL introduction and lab tasks based on PSQL
3 Storage Architecture
• Lab task based on PSQL
• Will follow a Data set in in our labs
4 DW Modeling – Conceptual Model • In this step we will design data warehouse architecture
• Create staging Database in SQL Server 2012.
5 Logical Model • Create Production Data warehouse in SQL Server 2012.
• Data will be extracted from source data and load in staging tables
6 Physical Model
using SSIS
• Different transformations will be applied on the data and data will
7 Project Brief stored in production tables using SSIS and implementation of slowly
changing dimension.
Indexes • OLAP queries writing
8
• Query based tasks will be assigned to perform with in the lab.
Indexes (continued) • OLAP Query Optimizing
9
• Query based tasks will be assigned to perform with in the lab.
10 Optimisation • Multidimensional cube will be design using SSAS
11 Optimisation (continued) • Formulas, aggregation applied on the built cube
12 OLAP Queries • Data visualization and reporting based on Datawarehouse

13 OLAP Queries (continued) • Data visualization and reporting based on Cube

14 Building the DW • Reports and Dashboard management on Powerbi.com


Real-time Data Warehousing • Introduction to Hadoop and other Big Data tools and techniques
15
Project submission • DWH lab revision
16 Big Data – Challenges and Opportunities • Semester Project discussion 13
Assessments

Assessment type Marks weight

Lab Tasks 20%

Mid 15%

Final
35%

Project 30%

14
Recommended Literature

• Building the Data Warehouse


– William H. Inmon
– Wiley, ISBN 978-0-7645-9944-6
• The Data Warehouse Toolkit
– Ralph Kimball & Margy Ross
– Wiley, ISBN 978-1-118-53080-1

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 15
Lecture 01

Introduction to Data Warehouse

16
This week

1.What is a data warehouse?


2. Applications and users

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 17
What is a data warehouse?
• Basically a very large da ta ba se…
– Not all very large databases are data warehouses,
but all data warehouses are pretty large databases
– Nowadays a warehouse is considered
to start at around 800 GB and goes
up to severalTB
– It spans over several servers and
needs an impressive amount of
computing power

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 18
What is a data warehouse?
• More specific, a collective data repository
– Containing snapshots of the operational data
(history)
– Obtained through data cleansing (Extract-
Transform- Load)
– Useful for analytics

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 19
What is a data warehouse?
• Compared to other solutions it…
– Is suitable for tactical/strateg ic focus
– Implies a sm all num ber of transactions
– Implies large transactions spanning over a
long period of time

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 20
Some Definitions
• Experts say…
– Ralph Kimball:“a copy of transaction
data specifically structured for query and
analysis”
– Bill Inmon: “A data warehouse is a:
– Subject oriented
– Integrated
– Non-volatile
– Time variant
collection of data in support of management’s decisions.”

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 21
Inmon Definition (cont’d.)
• S ubject oriented
– The data in the data warehouse is organized so that
all the data elements relating to the same real-world
event or object are linked together
• Typical subject areas in DWs are
Customer,Product,Order,Claim,
Account,…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 22
Inmon Definition (cont’d.)

• S ubject oriented
– Example:customer as subject in a DW
• DW is organized in this case by the customer
• It may consist of 10,100 or more physical tables, all
related
CUSTOMER

Base customer
Data 2003 - 2005
Customer activity
Base customer
2001 - 2004
Data 2000 - 2002

Customer activity Customer activity


Detail 2002 - 2004 Detail 2005 - 2006

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 23
Inmon Definition (cont’d.)
• Integ ra ted
– The data warehouse contains data from most or all
of an organization's operational systems and this data is
made consistent
– E.g. gender,measurement,conflicting keys,
consistency,…
operational encoding DW
appl A – M,F M,F
appl B – 1,0
appl C – male,female

appl A – cm cm
appl B – inches

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 24
Inmon Definition (cont’d.)
• Non-vola tile
– Data in the data warehouse is never over-written
or deleted - once committed, the data is static, read-
only, and retained for future reporting
– Data is loaded,but not updated
– When subsequent changes occur,a new snapshot record is
written

Operational DW

insert change

delete access
access load
Record-by-record Mass load/
manipulation access of data access

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 25
Inmon Definition (cont’d.)
• T im e-va rying
– The changes to the data in the data warehouse are
tracked and recorded so that reports can be produced
showing changes over tim e
– Different environments have different tim e horizons
• associated
• While for operational systems a 60-
to-90 day time horizon is normal,
data warehouse has a
• 5-to-10 year horizon

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 26
General Definition
• More general,a DW is a
– Repository of an
org anization’s
electronically stored
data
– Desig ned to
facilitate
reporting and
analysis

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 27
Typical Features
• DW typically…
– Reside on computers dedicated to this function
– Run on DB MS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server
– Retain data for long periods of tim e
– Consolidate data obtained from a variety of
sources
– Are built around their own carefully desig ned data
m odel

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 28
Use case
• DW stands for big data volume,so lets take an
example of 2 big com pa nies,a retailer,say
Walmart and a RDBMS vendor,Sybase:
– Walmart CEO:I want to keep track of sales in
all my stores simultaneously
– Sybase consultant:You need our wonderful RDBMS software.
You can stuff data in as sales are rung up at cash registers and
simultaneously query data right in your office
– SoWalmart buys a $1 milion Sun E10000 multi-CPU
server,a $500 000 Sybase license,a book
“Database Design for Smarties”,and build
themselves a normalized SQL data model

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 29
Use case (cont’d.)
• After a few months of stuffing data into the tables…
aWalmart executive asks…
– I have noticed that there was a Colgate promotion recently,
directed to people who live in small towns.How much
toothpaste did we sell in those towns yesterday?
– Translation to a query:
select sum(sales.quantity_sold) from sales,products,product_categories, manufacturers,
stores,cities where manufacturer_name =‘Colgate’
and product_category_name =‘toothpaste’ and cities.population < 40 000
and trunc(sales.date_time_of_sale) = trunc(sysdate-1) and sales.product_id =
products.product_id
and sales.store_id = stores.store_id
and products.product_category_id = product_categories.product_category_id and
products.manufacturer_id = manufacturers.manufacturer_id
and stores.city_id = cities.city_id

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 30
Use case (cont’d.)

– The tables contain large volumes of data and the


query implies a 6 way join so it will take some
time to execute
– The tables are at the sam e tim e also updated
by new sales
– Soon after executive start their quest for marketing
information store employees notice that there are times
during the day when it is impossible to process a sale
Any attempt to update the database results in freezing the
computer up for 20 minutes

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 31
Use case (cont’d.)

• In minutes…theWalmart CEO calls Sybase


tech support
• Walmart CEO: WE TYPE IN THE TOOTHPASTE
QUERY AND OUR SYSTEM HANGS!!!
• Sybase support: Of course it does! You built an on-
line transaction processing (OLTP) system. You
can’t feed it a decision support system (DSS)
query and expect things to work!
– Walmart CEO:!@%$#.I thought this was the whole point of SQL
and your RDBMS…to query and insert simultaneously!!
– Sybase support:Uh,not exactly.If you’re reading from the database,
nobody can write to the database.If you’re writing to the database,
nobody can read from the database.So if you’ve got a query that takes
20 minutes to run and don’t specify special locking instructions,
nobody can update those tables for 20 minutes.

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 32
Use case (cont’d.)
– Walmart CEO: It sounds like a bug.
– Sybase support: Actually it is a feature. We call it
pessimistic locking.
– Walmart CEO: Can you fix your system so that it doesn’t
lock up???
– Sybase support: No. But we made this great loader tool
so that you can copy everything from your OLTP system
into a separate Data Warehouse system at 100 GB/hour
• After a while…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 33
OLTP vs. DW

• OLTP (OnL ine T ransaction P rocessing):


– Also known under the name of operational data,
it represents day-to-day operational business activities:
• Purchasing,sales,production distribution, …
– Typically for data entr y and retrieval
transaction processing
– Reflects only the current state
of the data

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 34
OLTP vs. DW (cont’d.)

• OLAP (OnL ine A nalytical P rocessing):


– Represents front-end analytics based on a
DW repository
– It provides information for activities like
• Resource planning,capital budgeting,marketing initiatives,…
– It is decision oriented

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 35
OLTP vs. DW (cont’d.)
• Properties
Operational DB DW
Mostly updates Mostly reads

Many small transactions Queries long, complex

MB-TB of data GB-PB of data

Raw data Summarized data

Clerical users Decision makers

Up-to-date data May be slightly outdated

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 36
OLTP vs. DW (cont’d.)

• Consider a norm a lized da ta ba se for a store


– The tables would look like this…
Shipment
Invoice Invoice
Invoice_line_item
Customer

Customer_ID (PK) Invoice_number (PK) Invoice_number (FK)


Name Date Item_seq_number
Address Customer_ID (FK) Product_ID (FK)
City Status_code (FK) Units
Postal_code Total Unit_cost
Sales_tax
Shipping_charge
Sales_unit

Sales_unit_D
(PK) Name
Product Address
Status City
Product_ID (PK) Postal_cod
Status_code (PK) Name e
Status Description Telephone_number
Cost Email
Sales_unit_ID (FK)

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 37
OLTP vs. DW (cont’d.)
• If we were to set up a DW for that store,we
would start by building the following star schema
Customer Product
Customer_ID (PK) Product_ID (PK)
Name Name
Address Description
City Cost
Postal_code Sales_facts

Product_ID (FK)
Customer_ID (FK)
Time_key (FK)
Sales_unit_ID (FK)
Sales_unit Total Time_period
Sales_tax
Sales_unit_ID (PK) Shipping_charge Time_key
Name (PK)
Address City Description
Postal_code Day
Telephone Fiscal_week
Fiscal_period
Fiscal_year

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 38
OLTP vs. DW (cont’d.)

• B a sic insig hts from comparing OLTP and DWs


– A DW is a separate (RDB MS ) installation that
contains copies of data from on-line systems
• Physically separate hardware may not be absolutely
necessary if you have lots of extra com puting power,
but it is recommended
– With an optim istic locking DBMS you might even
be able to get away for a while with keeping just one
copy of your data

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 39
OLTP vs. DW (cont’d.)

• There is an essentially different pattern of


ha rdwa re utiliz a tion between on-line
and analytical processing

Operational Data warehouse

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 40
Applications of DW
• Typica l questions which can be answered
with DW & OLAP
– How much did sales unitA earn in January?
– How much did sales unit B earn in February?
What was their combined sales amount for the first
quarter?
• Answering these questions with S QL -queries is
difficult
– Complex query formulation necessary
– Process is likely to be slow due to complex joins and
multiple scans

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 41
Applications of DW (cont’d.)
• Why such questions can be answered better with
a DW?
– Because in a DW tables are rearranged and
pre- ag g reg ated (known as computing cubes)
e.g. Years, weeks, days etc.
• The tables arrangement is subject oriented,usually some
sta r schem a

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 42
Applications of DW (cont’d.)
• A DW is the base repository for front-end
a na lytics
– OLAP
– KDD (Knowledge Discovery in Databases) a
data mining process
– Data visualization
– Reporting

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 43
Applications of DW (cont’d.)

• OLAP is a form of information processing and


thus needs to provide timely,accurate and
understandable information
– timely is however a relative term:
• In OLTP we expect an update to go
through in a matter of seconds
• In OLAP the time to answer a query
can take m inutes, hours or even longer
• There are many flavors of OLAP
– ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 44
Applications of DW (cont’d.)
• KDD (Data Mining)
– Constructs m odels of the data in question
• Models can be viewed as high level summaries of the
underlying data

ID Name ZIP Sex Age Income Children Car Spent

12 Peter 38106 M 35 € 55,000 2 Mini Van € 210.00


15 Gabriel 38100 M 32 € 56,000 0 SUV € 30.00
… … … … … … … … …
122 Claire 38106 F 21 € 42,000 0 Coupe € 50.00

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 45
Applications of DW (cont’d.)
– Based on this example a query returns the data that
fulfills the constraints
• SELECT * FROM CUSTOMER_TABLE WHERE
TOTAL_SPENT > €100;
– Data mining might return the following set of rules
for customers spending more than €100:
• IF AGE > 35 AND CAR =‘MINIVAN’ THEN TOTAL SPENT
> €100
• IF SEX =‘M’ AND ZIP = 38106 THEN TOTAL SPENT >
€100

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 46
Applications of DW (cont’d.)
– It answers questions like
• Which products or customers are more profitable
• Which outlets have sold the least this year
– In consequence it motivates decisions like
• Which products should have their production increased
• Which customers should be targeted for special
promotions
• Which outlets should be closed

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 47
Who is the user?
• Users of DW are called DS S a na lysts
and usually are business persons
– Their primary job is to define and discover
information used in corporate decision-m aking
– The way they think
• “Give me what I say I want,and then I can tell you what I
really want”
• They work in explorative manner

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 48
Who is the user? (cont’d.)
– Typical explorative line of work
• “Ah! Now that I see what the possibilities are,I can tell
what I really want to see.But until I know what the
possibilities are,I cannot describe exactly what I want…”

– This usage has fundamental effect on the way a DW


is developed
• The classical system developm ent life cycle
assumes that the requirements are known at the start of
design
• The DSS analyst starts with existing requirements,but
fa ctoring in new requirem ents with time

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 49
Summary
• Data Warehouse - Introduction
– DW Definitions
– OLTP vs. DW
– Applications of DW

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 50
Next lecture

• Data Warehouse Life Cycle

Acknowledgment - Thanks to Wolf-Tilo Balke and Silviu Homoceanu - TU Braunschweig for the slides 51

You might also like