Note2 3
Note2 3
Note2 3
7 October 2003 1
Lectures Outline
Pat I: Overview on DM and DW
1. Introduction (ch1) Ass1 Due: Sep 23 Tue
2. Data preprocessing (ch3)
Part II: DW and OLAP
3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14
Part III: Data Mining Methods/Algorithms
4. Data mining primitives (ch4)
5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21
6. Association data mining (ch6) Ass4: Oct 21 – Nov 5
7. Characterization data mining (ch5)
8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data
9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)
Project Presentations
Project Due: Dec 8
7 October 2003 2
Reservation of the LCD Lab:
7 October 2003 3
2. DATA WAREHOUSING AND OLAP
(Ch2)
Objectives of DW/OLAP
What is a DW?
Multidimensional Data Model
DW Schemas
Aggregations
OLAP Operations
DW Architecture
From data warehousing to data mining
7 October 2003 4
How to define DW schema: a data mining query
language: DMQL
7 October 2003 5
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
7 October 2003 6
Defining a Star Schema in DMQL
7 October 2003 7
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
7 October 2003 8
Defining a Snowflake Schema in DMQL
7 October 2003 9
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
7 October 2003 11
How hierarchical data are materialized in a data warehouse ?
7 October 2003 12
Aggregations
- To measure a business event
What do I want to look at? What am I trying to compare?
* define a grouping (i.e. determine a cuboid of the data cube),
* measure the fact about the event (I.e., the cuboid)
|
retrieval a pre-calculated value, or invoke an aggregate function
* OLAP query: Dimension-value pairs.
E.g., dimension: <time="Q1", location="Vancouver", item="Computer">
value (measured): sales=sum (the data set).
7 October 2003 13
Measures: Three Categories
Distributive and algebraic aggregate functions are most frequently used and can be
calculated efficiently. In contrast holistic aggregate functions can not be efficiently calculated
in general which are not used in data warehouses.
7 October 2003 14
Pre-aggregation vs. On-line aggregation
7 October 2003 15
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
n
T = ∏ ( Li + 1)
i =1
E.g. The cube has 10 dimensions and 4 levels for each dimension:
5^10 = 9.8 x 10^6.
Materialization of data cube
Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
7 October 2003 16
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
7 October 2003 17
OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or
introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table, etc
7 October 2003 18
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
7 October 2003 19
7 October 2003 20
Example of data warehousing using MS SQL server 2000
7 October 2003 21
7 October 2003 22
7 October 2003 23
Drill down to see product categories.
7 October 2003 24
Drill down to see product “Clams” sales information
7 October 2003 25
DW Development Procedure
7 October 2003 26
Data Warehouse Development: An
Incremental Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
The abstraction hierarchy of data and its description helps users navigate around a data
warehouse. As data gets more abstract, it generally gets less voluminous.
7 October 2003 28
The architecture of data (cont)
7 October 2003 29
Multitiers architecture:
• Client site: The end user can query and visualize data on the local computer or
connect up to a display server that has access to the DW.
• Middle server: Logically, OLAP engines present the users with multidimensional
data from DWs or data marts. However, the physical architecture implementation
issues must be considered for OLAP engines.
• DW server: Data warehouse generated from relational or operational databases,
gateways for extraction and integration of multiple data sources: ODBC (Open
Database Connection), and OLEDB (Open Linking and Embedding for Databases), and
JDBC (Java Database Connections), etc
7 October 2003 30
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
Data extraction:
get data from multiple, heterogeneous, and external sources
Data cleaning:
detect errors in the data and rectify them when possible
Data transformation:
convert data from legacy or host format to warehouse format
Load:
sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
Refresh
propagate the updates from the data sources to the warehouse
7 October 2003 32
OLAP Server Architectures
Can be processed and queried with traditional RDBMS technology (I.e. indexes and
joins etc)
Greater scalability
No “built-in” indexing
E.g. The same data stored in a multidimensional array for MOLAP, and multi-tables
for RLOAP (the distributed sheet).
Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array
7 October 2003 33
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
7 October 2003 34
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process
A multi-dimensional model of a data warehouse
Multidimensional data model
Star schema, snowflake schema, fact constellations
A data cube consists of identifier dimensions & measure dimension
Concept hierarchies
OLAP operations: drilling, rolling, slicing, dicing and pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
…
7 October 2003 36