Data Warehousing

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

Data Warehousing: Data Models and OLAP operations

By Kishore Jaladi [email protected]

Topics Covered
1. Understanding the term Data Warehousing 2. Three-tier Decision Support Systems Three3. Approaches to OLAP servers 4. Multi-dimensional data model Multi5. ROLAP 6. MOLAP 7. HOLAP 8. Which to choose: Compare and Contrast 9. Conclusion

Understanding the term Data Warehousing


Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, subjectintegrated, time-variant and non-volatile collection of data in support timenonof management's decision making process". He defined the terms in the sentence as follows:

Subject Oriented:
Data that gives information about a particular subject instead of about a company's ongoing operations.

Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.

Time-variant: TimeAll data in the data warehouse is identified with a particular time period.

Non-volatile NonData is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

Data Warehouse Architecture

Other important terminology


Enterprise Data warehouse
collects all information about subjects (customers,products,sales,assets, personnel) that span the entire personnel) organization

Data Mart
Departmental subsets that focus on selected subjects

Decision Support System (DSS)


Information technology to help the knowledge worker (executive, manager, analyst) make faster & better decisions

Online Analytical Processing (OLAP)


an element of decision support systems (DSS)

ThreeThree-Tier Decision Support Systems


Warehouse database server
Almost always a relational DBMS, rarely flat files

OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators Multidimensional OLAP (MOLAP): special-purpose server specialthat directly implements multidimensional data and operations

Clients
Query and reporting tools Analysis tools Data mining tools

The Complete Decision Support System


Information Sources Data Warehouse Server (Tier 1) OLAP Servers (Tier 2)
e.g., MOLAP Semistructured Sources Data Warehouse extract transform load refresh etc. Operational DBs

Clients (Tier 3)
OLAP serve Query/Reporting

serve
e.g., ROLAP Data Mining

serve

Data Marts

Approaches to OLAP Servers


Three possibilities for OLAP servers (1) Relational OLAP (ROLAP) Relational and specialized relational DBMS to store and manage warehouse data OLAP middleware to support missing pieces (2) Multidimensional OLAP (MOLAP) Array-based storage structures ArrayDirect access to array data structures (3) Hybrid OLAP (HOLAP)

Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools

The Multi-Dimensional Data Model MultiSales by product line over the past six months Sales by store between 1990 and 1995
Store Info

Key columns joining fact table Numerical Measures to dimension tables


Prod Code Time Code Store Code Sales Qty

Product Info

Fact table for measures

Dimension tables

Time Info

...

ROLAP: Dimensional Modeling Using Relational DBMS


Special schema design: star, snowflake Special indexes: bitmap, multi-table join multiProven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets Products IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

Star Schema (in RDBMS)

Star Schema Example

The Classic Star Schema


Store Dimension STORE KEY
Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Level

Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence

Product Dimension PRODUCT KEY


Product Desc. Brand Color Size Manufacturer Level

 

A single fact table, with detail and summary data Fact table primary key has only one key column per dimension Each key is generated Each dimension is a single table, highly dedenormalized

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata

Star Schema with Sample Data

The Snowflake Schema


Store Dimension STORE KEY
Store Description City State District ID Region_ID Regional Mgr.

District_ID
District Desc. Region_ID

Region_ID
Region Desc. Regional Mgr.

Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

Aggregation in a Single Fact Table


Store Dimension STORE KEY
Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. Level

Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence

Product Dimension PRODUCT KEY


Product Desc. Brand Color Size Manufacturer Level

Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

The Fact Constellation Schema


Store Dimension STORE KEY
Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr.

Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Sequence District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price
Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price

Product Dimension PRODUCT KEY


Product Desc. Brand Color Size Manufacturer

Aggregations using Snowflake Schema and Multiple Fact Tables


Store Dimension STORE KEY
Store Description Cit y State District ID District Desc. Region_ ID Region Desc. Regional Mgr.

District_ ID
Dist rict Desc. Region_ID

Region_ID
Region Desc. Regional Mgr.

Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

Dist rict Fact Table District_ID PRODUCT_KEY PERIOD_KEY


Dollars Unit s Price

RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Unit s Price

No LEVEL in dimension tables Dimension tables are normalized by decomposing at the attribute level Each dimension table has one key for each level of the dimensions hierarchy The lowest level key joins the dimension table to both the fact table and the lower level attribute table

How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

Aggregation Contd
Store Dimension STORE KEY
Store Description Cit y State District ID District Desc. Region_ ID Region Desc. Regional Mgr.

District_ ID
Dist rict Desc. Region_ID

Region_ID
Region Desc. Regional Mgr.

Store Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars Units Price

Dist rict Fact Table District_ID PRODUCT_KEY PERIOD_KEY


Dollars Unit s Price

RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Unit s Price

Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

Aggregates
y Add up amounts for day 1 y In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
sale prodId storeId p1 s1 p2 s1 p1 s3 p2 s2 p1 s1 p1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

81

Aggregates
y Add up amounts by day y In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

ans

date 1 2

sum 81 48

Another Example
y Add up amounts by day, product y In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

sale

prodId p1 p2 p1

date 1 1 2

amt 62 19 48

rollup drill-down

Points to be noticed about ROLAP


Defines complex, multi-dimensional data with simple multimodel Reduces the number of joins a query has to process Allows the data warehouse to evolve with rel. low maintenance Can contain both detailed and summarized data. ROLAP is based on familiar, proven, and already selected technologies. BUT!!! SQL for multi-dimensional manipulation of multicalculations.

MOLAP: Dimensional Modeling Using the Multi Dimensional Model


MDDB: a special-purpose data model specialFacts stored in multi-dimensional arrays multiDimensions used to index array Sometimes on top of relational DB Products
Pilot, Arbor Essbase, Gentia

The MOLAP Cube


Fact table view:
sale prodId p1 p2 p1 p2 storeId s1 s1 s3 s2 amt 12 11 50 8

Multi-dimensional cube:
p1 p2 s1 12 11 s2 8 s3 50

dimensions = 2

3-D Cube
Fact table view:
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4

Multi-dimensional cube:

day 2 day 1

p1 p2 s1 p1 12 p2 11

s1 44 s2 8

s2 4 s3 50

s3

dimensions = 3

Example
roll-up to region

NY SF LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 56 M T W Th F S S

Dimensions: Time, Product, Store roll-up to brand Attributes: Product (upc, price, ) Store Hierarchies: Product p Brand p Day p Week p Quarter roll-up to week Store p Region p Country

Product

Time
56 units of bread sold in LA on M

Cube Aggregation: Roll-up RollExample: computing sums


day 2 day 1
p1 p2 s1 p1 12 p2 11 s1 44 s2 8 s2 4 s3 50 s3

...

sum
p1 p2 s1 56 11 s2 4 8 s3 50

s1 67

s2 12

s3 50

129
p1 p2 sum 110 19

rollup drill-down

Cube Operators for Roll-up Rollday 2 day 1


p1 p2 s1 p1 12 p2 11 s1 44 s2 8 s2 4 s3 50 s3

... sale(s1,*,*)
sum s1 67 s2 12 s3 50

p1 p2

s1 56 11

s2 4 8

s3 50

129
p1 p2 sum 110 19

sale(s2,p2,*)

sale(*,*,*)

Extended Cube
*

day 2

p1 p2 * s1 p1 p2 s1 * 12 11 23 44
s2 44 8 8

s1 56 11 67 s2 4
s3 4 50 50

s2 4 8 12 s3
* 62 19 81

s3 50 50 * 48 48

* 110 19 129

day 1

p1 p2 *

sale(*,p2,*)

Aggregation Using Hierarchies


day 2 day 1
p1 p2 s1 p1 12 p2 11 s1 44 s2 8 s2 4 s3 50 s3

store region country

p1 p2

region A region B 56 54 11 8

(store s1 in Region A; stores s2, s3 in Region B)

Points to be noticed about MOLAP


PrePre-calculating or pre-consolidating transactional data preimproves speed. BUT Fully pre-consolidating incoming data, MDDs require an preenormous amount of overhead both in processing time and in storage. An input file of 200MB can easily expand to 5GB MDDs are great candidates for the <50GB department data marts. Rolling up and Drilling down through aggregate data. With MDDs, application design is essentially the definition of dimensions and calculation rules, while the RDBMS requires that the database schema be a star or snowflake.

Hybrid OLAP (HOLAP)


HOLAP = Hybrid OLAP:
Best of both worlds Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools

Data Flow in HOLAP


RDBMS Server MDBMS Server
Multidimensional access Multidimensional data SQL-Reach Through

Client

SQL-Read User data

Meta data Derived data

Multidimensional Viewer

Relational Viewer
SQL-Read

When deciding which technology to go for, consider:


1) Performance:
How fast will the system appear to the end-user? endMDD server vendors believe this is a key point in their favor.

2) Data volume and scalability:


While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of gigabytes and terabytes.

An experiment with Relational and the Multidimensional models on a data set


The analysis of the authors example illustrates the following differences between the best Relational alternative and the Multidimensional approach. relational MultiMultiImprovement dimensional Disk space requirement (Gigabytes) Retrieve the corporate measures Actual Vs Budget, by month (I/O s) Calculation of Variance Budget/Actual for the whole database (I/O time in hours) 17 240 10 1 1.7 240

237

2*

110*

* This may include the calculation of many other derived data without any additional I/O. Reference: http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf

WhatWhat-if analysis
IF A. You require write access B. Your data is under 50 GB C. Your timetable to implement is 60-90 days 60D. Lowest level already aggregated E. Data access on aggregated level F. You re developing a general-purpose application for inventory movement or assets management generalTHEN Consider an MDD /MOLAP solution for your data mart IF A. Your data is over 100 GB B. You have a "read-only" requirement "readC. Historical data at the lowest level of granularity D. Detailed access, long-running queries longE. Data assigned to lowest level elements THEN Consider an RDBMS/ROLAP solution for your data mart. IF A. OLAP on aggregated and detailed data B. Different user groups C. Ease of use and detailed data THEN Consider an HOLAP for your data mart

Examples
ROLAP
Telecommunication startup: call data records (CDRs) ECommerce Site Credit Card Company

MOLAP
Analysis and budgeting in a financial department Sales analysis

HOLAP
Sales department of a multi-national company multiBanks and Financial Service Providers

Tools available
ROLAP:
ORACLE 8i ORACLE Reports; ORACLE Discoverer ORACLE Warehouse Builder Arbors Software s Essbase

MOLAP:
ORACLE Express Server ORACLE Express Clients (C/S and Web) MicroStrategy s DSS server Platinum Technologies Plantinum InfoBeacon

HOLAP:
ORACLE 8i ORACLE Express Serve ORACLE Relational Access Manager ORACLE Express Clients (C/S and Web)

Conclusion
ROLAP: RDBMS -> star/snowflake schema MOLAP: MDD -> Cube structures ROLAP or MOLAP: Data models used play major role in performance differences MOLAP: for summarized and relatively lesser volumes of data (10-50GB) (10ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP.

References
http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
OLAP, Relational, and Multidimensional Database Systems, by George Colliat, Arbor Software Corporation

http://www.donmeyer.com/art3.html
Data warehousing Services, Data Mining & Analysis, LLC Services,

http://www.cs.man.ac.uk/~franconi/teaching/2001/CS636/CS636http://www.cs.man.ac.uk/~franconi/teaching/2001/CS636/CS636olap.ppt
Data Warehouse Models and OLAP Operations, by Enrico Franconi Operations,

http://www.promatis.com/mediacenter/papers - ROLAP, MOLAP, HOLAP: How to determine which to technology is appropriate, by Holger Frietch, PROMATIS Corporation

You might also like