Data Warehousing
Data Warehousing
Data Warehousing
Topics Covered
1. Understanding the term Data Warehousing 2. Three-tier Decision Support Systems Three3. Approaches to OLAP servers 4. Multi-dimensional data model Multi5. ROLAP 6. MOLAP 7. HOLAP 8. Which to choose: Compare and Contrast 9. Conclusion
Subject Oriented:
Data that gives information about a particular subject instead of about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
Time-variant: TimeAll data in the data warehouse is identified with a particular time period.
Non-volatile NonData is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.
Data Mart
Departmental subsets that focus on selected subjects
OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operators Multidimensional OLAP (MOLAP): special-purpose server specialthat directly implements multidimensional data and operations
Clients
Query and reporting tools Analysis tools Data mining tools
Clients (Tier 3)
OLAP serve Query/Reporting
serve
e.g., ROLAP Data Mining
serve
Data Marts
Storing detailed data in RDBMS Storing aggregated data in MDBMS User access via MOLAP tools
The Multi-Dimensional Data Model MultiSales by product line over the past six months Sales by store between 1990 and 1995
Store Info
Product Info
Dimension tables
Time Info
...
Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence
A single fact table, with detail and summary data Fact table primary key has only one key column per dimension Each key is generated Each dimension is a single table, highly dedenormalized
Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata
District_ID
District Desc. Region_ID
Region_ID
Region Desc. Regional Mgr.
Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Resolution Sequence
Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem
Time Dimension PERIOD KEY Period Desc Year Quarter Month Day Current Flag Sequence District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price
Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price
District_ ID
Dist rict Desc. Region_ID
Region_ID
Region Desc. Regional Mgr.
No LEVEL in dimension tables Dimension tables are normalized by decomposing at the attribute level Each dimension table has one key for each level of the dimensions hierarchy The lowest level key joins the dimension table to both the fact table and the lower level attribute table
How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.
Aggregation Contd
Store Dimension STORE KEY
Store Description Cit y State District ID District Desc. Region_ ID Region Desc. Regional Mgr.
District_ ID
Dist rict Desc. Region_ID
Region_ID
Region Desc. Regional Mgr.
Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database
Aggregates
y Add up amounts for day 1 y In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
sale prodId storeId p1 s1 p2 s1 p1 s3 p2 s2 p1 s1 p1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
81
Aggregates
y Add up amounts by day y In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
ans
date 1 2
sum 81 48
Another Example
y Add up amounts by day, product y In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
sale
prodId p1 p2 p1
date 1 1 2
amt 62 19 48
rollup drill-down
Multi-dimensional cube:
p1 p2 s1 12 11 s2 8 s3 50
dimensions = 2
3-D Cube
Fact table view:
sale prodId p1 p2 p1 p2 p1 p1 storeId s1 s1 s3 s2 s1 s2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
Multi-dimensional cube:
day 2 day 1
p1 p2 s1 p1 12 p2 11
s1 44 s2 8
s2 4 s3 50
s3
dimensions = 3
Example
roll-up to region
Dimensions: Time, Product, Store roll-up to brand Attributes: Product (upc, price, ) Store Hierarchies: Product p Brand p Day p Week p Quarter roll-up to week Store p Region p Country
Product
Time
56 units of bread sold in LA on M
...
sum
p1 p2 s1 56 11 s2 4 8 s3 50
s1 67
s2 12
s3 50
129
p1 p2 sum 110 19
rollup drill-down
... sale(s1,*,*)
sum s1 67 s2 12 s3 50
p1 p2
s1 56 11
s2 4 8
s3 50
129
p1 p2 sum 110 19
sale(s2,p2,*)
sale(*,*,*)
Extended Cube
*
day 2
p1 p2 * s1 p1 p2 s1 * 12 11 23 44
s2 44 8 8
s1 56 11 67 s2 4
s3 4 50 50
s2 4 8 12 s3
* 62 19 81
s3 50 50 * 48 48
* 110 19 129
day 1
p1 p2 *
sale(*,p2,*)
p1 p2
region A region B 56 54 11 8
Client
Multidimensional Viewer
Relational Viewer
SQL-Read
237
2*
110*
* This may include the calculation of many other derived data without any additional I/O. Reference: http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
WhatWhat-if analysis
IF A. You require write access B. Your data is under 50 GB C. Your timetable to implement is 60-90 days 60D. Lowest level already aggregated E. Data access on aggregated level F. You re developing a general-purpose application for inventory movement or assets management generalTHEN Consider an MDD /MOLAP solution for your data mart IF A. Your data is over 100 GB B. You have a "read-only" requirement "readC. Historical data at the lowest level of granularity D. Detailed access, long-running queries longE. Data assigned to lowest level elements THEN Consider an RDBMS/ROLAP solution for your data mart. IF A. OLAP on aggregated and detailed data B. Different user groups C. Ease of use and detailed data THEN Consider an HOLAP for your data mart
Examples
ROLAP
Telecommunication startup: call data records (CDRs) ECommerce Site Credit Card Company
MOLAP
Analysis and budgeting in a financial department Sales analysis
HOLAP
Sales department of a multi-national company multiBanks and Financial Service Providers
Tools available
ROLAP:
ORACLE 8i ORACLE Reports; ORACLE Discoverer ORACLE Warehouse Builder Arbors Software s Essbase
MOLAP:
ORACLE Express Server ORACLE Express Clients (C/S and Web) MicroStrategy s DSS server Platinum Technologies Plantinum InfoBeacon
HOLAP:
ORACLE 8i ORACLE Express Serve ORACLE Relational Access Manager ORACLE Express Clients (C/S and Web)
Conclusion
ROLAP: RDBMS -> star/snowflake schema MOLAP: MDD -> Cube structures ROLAP or MOLAP: Data models used play major role in performance differences MOLAP: for summarized and relatively lesser volumes of data (10-50GB) (10ROLAP: for detailed and larger volumes of data Both storage methods have strengths and weaknesses The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP.
References
http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
OLAP, Relational, and Multidimensional Database Systems, by George Colliat, Arbor Software Corporation
http://www.donmeyer.com/art3.html
Data warehousing Services, Data Mining & Analysis, LLC Services,
http://www.cs.man.ac.uk/~franconi/teaching/2001/CS636/CS636http://www.cs.man.ac.uk/~franconi/teaching/2001/CS636/CS636olap.ppt
Data Warehouse Models and OLAP Operations, by Enrico Franconi Operations,
http://www.promatis.com/mediacenter/papers - ROLAP, MOLAP, HOLAP: How to determine which to technology is appropriate, by Holger Frietch, PROMATIS Corporation