Note2 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

CSCI6405 Fall 2003

Dta Mining and Data Warehousing


„ Instructor: Qigang Gao, Office: CS219,
Tel:494-3356, Email: qggao@cs.dal.ca
„ Teaching Assistant: Christopher Jordan,
Email: cjordan@cs.dal.ca
„ Office Hours: TR, 1:30 - 3:00 PM

7 October 2003 1
Lectures Outline
„ Pat I: Overview on DM and DW
1. Introduction (ch1) Ass1 Due: Sep 23 Tue
2. Data preprocessing (ch3)
„ Part II: DW and OLAP
3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14
„ Part III: Data Mining Methods/Algorithms
4. Data mining primitives (ch4)
5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21
6. Association data mining (ch6) Ass4: Oct 21 – Nov 5
7. Characterization data mining (ch5)
8. Clustering data mining (ch8)
„ Part IV: Mining Complex Types of Data
9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)
„ Project Presentations
Project Due: Dec 8

7 October 2003 2
Reservation of the LCD Lab:

Wed: 8:30 am – 2:00 pm


Sat: 12:00 pm - 6:00 pm
Sun: 12:00 pm – 6:00 pm

7 October 2003 3
2. DATA WAREHOUSING AND OLAP
(Ch2)
„ Objectives of DW/OLAP
„ What is a DW?
„ Multidimensional Data Model
„ DW Schemas
„ Aggregations
„ OLAP Operations
„ DW Architecture
„ From data warehousing to data mining

7 October 2003 4
How to define DW schema: a data mining query
language: DMQL

„ Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]:
<measure_list>
„ Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
„ Special Case (Shared Dimension Tables)
„ First time as “cube definition”

„ define dimension <dimension_name> as

<dimension_name_first_time> in cube <cube_name_first_time>

7 October 2003 5
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
7 October 2003 6
Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

7 October 2003 7
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

7 October 2003 8
Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))

7 October 2003 9
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city
units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
7 October 2003 10
shipper_type
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

7 October 2003 11
How hierarchical data are materialized in a data warehouse ?

7 October 2003 12
Aggregations
- To measure a business event
What do I want to look at? What am I trying to compare?
* define a grouping (i.e. determine a cuboid of the data cube),
* measure the fact about the event (I.e., the cuboid)
|
retrieval a pre-calculated value, or invoke an aggregate function
* OLAP query: Dimension-value pairs.
E.g., dimension: <time="Q1", location="Vancouver", item="Computer">
value (measured): sales=sum (the data set).

A measure value is computed for a defined cuboid by aggregating the


data corresponding to the respective dimension-value pairs defining the
given event.

7 October 2003 13
Measures: Three Categories

* Distributive functions: A aggregate function is distributive if a set is divided into n


subsets, use the function to calculate the set and the subsets, and the result from the set
and the total result from the n subset are same.
E.g., count(), sum(), min(), max().

* Algebraic functions: A aggregate function is algebraic if it can be calculated by an


algebraic function with M arguments, and each argument is a distributive aggregation function.
E.g., ave() = sum() / count(), standard_deviation(), ...

* Holistic functions: A aggregate function is holistic if it characterizes a set element (s)


relative to other elements of the set without an algebraic calculation.
E.g., rank(), median(), ...

Distributive and algebraic aggregate functions are most frequently used and can be
calculated efficiently. In contrast holistic aggregate functions can not be efficiently calculated
in general which are not used in data warehouses.

7 October 2003 14
Pre-aggregation vs. On-line aggregation

Pre-aggregation: all needed calculations are done by batch process.

On-line aggregation: the aggregating computation is on-line.


The main issue is the data volume to be aggregated is normally very large.
On-line aggregation results in real time aggravation.

The manager's rule of thumb:


- An average aggregation should response from the data warehousing
system in 20 seconds or under.

7 October 2003 15
Efficient Data Cube Computation
„ Data cube can be viewed as a lattice of cuboids
„ The bottom-most cuboid is the base cuboid
„ The top-most cuboid (apex) contains only one cell
„ How many cuboids in an n-dimensional cube with L levels?
n
T = ∏ ( Li + 1)
i =1
E.g. The cube has 10 dimensions and 4 levels for each dimension:
5^10 = 9.8 x 10^6.
„ Materialization of data cube
„ Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization)
„ Selection of which cuboids to materialize
„ Based on size, sharing, access frequency, etc.

7 October 2003 16
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,item time,location item,location location,supplier


2-D cuboids
time,supplier item,supplier

time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier
7 October 2003 17
OLAP Operations
„ Roll up (drill-up): summarize data
„ by climbing up hierarchy or by dimension reduction
„ Drill down (roll down): reverse of roll-up
„ from higher level summary to lower level summary or detailed data, or
introducing new dimensions
„ Slice and dice:
„ project and select
„ Pivot (rotate):
„ reorient the cube, visualization, 3D to series of 2D planes.
„ Other operations
„ drill across: involving (across) more than one fact table, etc

7 October 2003 18
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
7 October 2003 19
7 October 2003 20
Example of data warehousing using MS SQL server 2000

7 October 2003 21
7 October 2003 22
7 October 2003 23
Drill down to see product categories.
7 October 2003 24
Drill down to see product “Clams” sales information
7 October 2003 25
DW Development Procedure

1. Choose a business process to model, understand the complexity of data, determine a


data schema to use, etc.
2. Decide subject(s), choose the measures that will populate each fact table record.
3. Choose fact table: the grain and measures of the subject:
The fundamental, atomic level of data to be represented in the fact table,
such as daily or weekly sales, etc.
4. Choose the dimensions that will apply to each fact table record.

7 October 2003 26
Data Warehouse Development: An
Incremental Approach

Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


7 October 2003 27
Data Warehouse Architecture
The architecture of data:
Abstraction level
Business rules
|
Metadata
|
Schema
|
Summary data
|
Operational data

The abstraction hierarchy of data and its description helps users navigate around a data
warehouse. As data gets more abstract, it generally gets less voluminous.

7 October 2003 28
The architecture of data (cont)

- Operational data: who, what, where, and when


- Summary data: summaries by who, what, where, and when
- Schema: physical layout of the data, tables, fields, indexes, types
- Metadata: logical model and mappings to physical layout and sources
(by defining the data in business terms)
- Business rules: what's been learned from the data

7 October 2003 29
Multitiers architecture:

• Client site: The end user can query and visualize data on the local computer or
connect up to a display server that has access to the DW.

• Middle server: Logically, OLAP engines present the users with multidimensional
data from DWs or data marts. However, the physical architecture implementation
issues must be considered for OLAP engines.
• DW server: Data warehouse generated from relational or operational databases,
gateways for extraction and integration of multiple data sources: ODBC (Open
Database Connection), and OLEDB (Open Linking and Embedding for Databases), and
JDBC (Java Database Connections), etc

7 October 2003 30
Multi-Tiered Architecture

Monitor
& OLAP Server
other Metadata
sources Integrator

Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


7 October 2003 31
Data Warehouse Back-End Tools and Utilities

„ Data extraction:
„ get data from multiple, heterogeneous, and external sources

„ Data cleaning:
„ detect errors in the data and rectify them when possible

„ Data transformation:
„ convert data from legacy or host format to warehouse format

„ Load:
„ sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
„ Refresh
„ propagate the updates from the data sources to the warehouse

7 October 2003 32
OLAP Server Architectures

„ Multidimensional OLAP (MOLAP)


„ Implemented as a large multidimensional array

„ Fast indexing to pre-computed summarized data (with built-in indexing)

„ Not proven to scale effectively to large, high-dimensionality data sets

„ Relational OLAP (ROLAP)

„ Implemented as a collection of relational tables

„ Can be processed and queried with traditional RDBMS technology (I.e. indexes and

joins etc)
„ Greater scalability

„ No “built-in” indexing

E.g. The same data stored in a multidimensional array for MOLAP, and multi-tables
for RLOAP (the distributed sheet).
„ Hybrid OLAP (HOLAP)
„ User flexibility, e.g., low level: relational, high-level: array

„ MS SQL Server 2000

7 October 2003 33
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
„ Why online analytical mining?
„ High quality of data in data warehouses
„ DW contains integrated, consistent, cleaned data

„ Available information processing structure surrounding data


warehouses
„ ODBC (Open Data Base Connectivity), Web accessing, service

facilities, reporting and OLAP tools


„ OLAP-based exploratory data analysis
„ mining with drilling, dicing, pivoting, etc.

„ On-line selection of data mining functions


„ integration and swapping of multiple mining functions,

algorithms, and tasks.


„ Architecture of OLAM

7 October 2003 34
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
7 October 2003 35
Summary

„ Data warehouse
„ A subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process
„ A multi-dimensional model of a data warehouse
„ Multidimensional data model
„ Star schema, snowflake schema, fact constellations
„ A data cube consists of identifier dimensions & measure dimension
„ Concept hierarchies
„ OLAP operations: drilling, rolling, slicing, dicing and pivoting
„ OLAP servers: ROLAP, MOLAP, HOLAP
„ …

7 October 2003 36

You might also like