Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Concepts and
Techniques
1
Knowledge discovery
3
Data mining system
4
Data mining system
Pattern evaluation
Data
Databases Warehouse
6
What is a data warehouse?
7
What is Data Warehouse?
9
Data Warehouse—Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
Ensure consistency in naming conventions,
15
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
16
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods,
18
From Tables and
Spreadsheets to Data
Cubes
Adata warehouse is based on a multidimensional
data model which views data in the form of a
data cube
A data cube (such as sales) allows data to be
modeled and viewed in multiple dimensions. It is
defined by dimensions and facts.
Dimensions are the perspectives or entities wrt
which an organization wants to keep records.
e.g. A sales data warehouse to keep records of
the store’s sales wrt dimensions time, item,
branch and location
19
From Tables and Spreadsheets to
Data Cubes
21
Cube: A Lattice of
Cuboids
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
22
Conceptual Modeling
of Data Warehouses
Modeling data warehouses: dimensions &
measures
Star schema: A fact table in the middle
connected to a set of dimension tables
Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake.
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
23
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
24
Example of Snowflake
Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
location
branch location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
25
Example of Fact
Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
from_location
branch_key
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 26
Data warehouse and data mart
<dimension_name_first_time> in cube
<cube_name_first_time>
28
Defining a Star Schema in
DMQL
33
A Concept Hierarchy: Dimension
(location)
all all
34
A Concept Hierarchy
year
country
quarter
Province_or_sta
te month week
city
Street
day
Partial order,
lattice
35
View of Warehouses and
Hierarchies
Specification of
hierarchies
Schema
hierarchy
day <
{month <
quarter;
week} <
year
Set_grouping
hierarchy
{1..10} <
inexpensi
ve
36
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi
Office Day
Month
37
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
c t
TV
du
PC U.S.A
o
Pr
VCR
Country
sum
Canada
Mexico
sum
38
Cuboids Corresponding to the
Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
39
Typical OLAP Operations
40
Typical OLAP Operations
Pivot (rotate):
reorient the cube, visualization, 3D to series
of 2D planes.
Other operations
drill across: involving (across) more than one
fact table
drill through: through the bottom level of the
cube to its back-end relational tables (using
SQL)
41
A Star-Net Model
location
customer
Country
State group
City category
street name
day item
month Name brand category
quarte type
y r
ear Each line consists of footprints
time (circles) representing abstraction
levels of dimension
42
Data warehouse architecture
43
Design of a Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data
warehouse
Top-down view
allows selection of the relevant information necessary
for the data warehouse
Data source view
exposes the information being captured, stored, and
managed by operational systems
Data warehouse view
consists of fact tables and dimension tables
Business query view
sees the perspectives of data in the warehouse from
44
Data Warehouse Design
Process
(mature)
Bottom-up: Starts with experiments and prototypes
(rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each
invoices, etc.
Choose the grain (atomic level of data) of the business
process 45
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
source Integrator
s Analysis
Operational Extract Query
Transform Data Serve Reports
DBs Load
Refresh
Warehouse Data mining
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
matrix techniques)
fast indexing to pre-computed summarized data
50
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one
cell
How many cuboids
n
T = ∏ ( Li +1)
in an n-dimensional cube
with L levels?
i =1
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
55
Multi-way Array Aggregation
for Cube Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
56
Multi-Way Array Aggregation for
Cube Computation (Cont.)
dimensions 59
Efficient Processing OLAP
Queries
technology
63
Discovery-Driven Exploration of
Data Cubes
65
Complex Aggregation at Multiple
Granularities: Multi-Feature Cubes
mining
67
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling,
pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical
models, performing classification and prediction,
and presenting the mining results using
68
From On-Line Analytical
Processing to On Line
Analytical Mining (OLAM)
Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
70