0 KDLVLP Đã G P

Data Warehousing and Mining
Lecture 1
• Course syllabus
• Overview of data warehousing and mining
Lecture slides modified from:

– Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
– Vipin Kumar (http://www-users.cs.umn.edu/~kumar/csci5980/index.html)
– Joshep Fong (http://www.cs.cityu.edu.hk/~jfong/course/cs5483/)
– Slobodan Vucetic (http://www.ist.temple.edu/~vucetic/cis526fall2004.htm)
1
Course Syllabus
Textbook:
1- J. Han, M. Kamber, Data Mining: Concepts and Techniques,
2006,second edition.
2- Joseph Fong: Information Systems Reengineering and
Integration, 2006, ISBN 978-1-84628-382-6, Second edition.
Topics:
– Overview of data warehousing and mining
– Data warehouse and OLAP technology
– Data preprocessing
– Mining association rules
– Classification and prediction
– Cluster analysis
– Mining complex types of data
Grading:
– (10%) Homework Assignments and Quizs
– (30%) Project
– (60%) Final Exam.
2
Course Syllabus
Late Policy and Academic Honesty:
The projects and homework assignments are due in class, on the specified due
date. NO LATE SUBMISSIONS will be accepted. For fairness, this policy will be
strictly enforced.
Academic honesty is taken seriously. You must write up your own solutions and
code. For homework problems or projects you are allowed to discuss the
problems or assignments verbally with other class members. You MUST
acknowledge the people with whom you discussed your work. Any other sources
(e.g. Internet, research papers, books) used for solutions and code MUST also
be acknowledged. In case of doubt PLEASE contact the instructor.
Disability Disclosure Statement

Any student who has a need for accommodation based on the impact of a disability
should contact me privately to discuss the specific situation as soon as possible.
Contact Disability Resources and Services at 215-204-1280 in 100 Ritter Annex
to coordinate reasonable accommodations for students with documented
disabilities.
3
Motivation:
“Necessity is the Mother of Invention”
• Data explosion problem
– Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
• We are drowning in data, but starving for knowledge!
• Solution: Data warehousing and data mining
– Data warehousing and on-line analytical processing
– Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases
4
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected

and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful

• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
5
Why Mine Data? Scientific Viewpoint
• Data collected and stored at

enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw
data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
6
What Is Data Mining?
• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
• Alternative names and their “inside stories”:

– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, business intelligence, etc.
7
Examples: What is (not) Data Mining?
What is not Data What is Data Mining?

Mining?
– Look up phone – Certain names are more

number in phone prevalent in certain US locations
directory (O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Query a Web – Group together similar

documents returned by search
search engine for
engine according to their context
information about
(e.g. Amazon rainforest,
“Amazon”
Amazon.com,)
8
Data Mining: Classification Schemes
• Decisions in data mining

– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
• Data mining tasks

– Descriptive data mining
– Predictive data mining
9
Decisions in Data Mining
• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc. 10
Data Mining Tasks
• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.
Common data mining tasks

– Classification [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
– Regression [Predictive]
– Deviation Detection [Predictive]
11
Classification: Definition
• Given a collection of records (training set )

– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of
the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
12
Classification Example
Tid Refund Marital Taxable Refund Marital Taxable

Status Income Cheat Status Income Cheat
1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
7 Yes Divorced 220K
10
Set
No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10
Set Classifier
13
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
14
• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
15
• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be
lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
16
• Sky Survey Cataloging

– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars,
some of the farthest objects that are difficult to find!
17
Classifying Galaxies
Class: Attributes:
Early
• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
18
Clustering Definition
• Given a set of data points, each having a set of

attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one
another.
– Data points in separate clusters are less similar to
one another.
• Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
19
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distances Intercluster distances

are minimized are maximized
20
Clustering: Application 1
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
21
Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
22
Association Rule Discovery: Definition
• Given a set of records each of which contain some

number of items from a given collection;
– Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk
{Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
23
Association Rule Discovery: Application 1
• Marketing and Sales Promotion:

– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
– Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
24
Association Rule Discovery: Application 2
• Supermarket shelf management.

– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
– A classic rule --
• If a customer buys diaper and milk, then he is very likely to
buy beer:
25
Regression
• Predict a value of a given continuous valued variable

based on the values of other variables, assuming a
linear or nonlinear model of dependency.
• Greatly studied in statistics, neural network fields.
• Examples:
– Predicting sales amounts of new product based on advetising
expenditure.
– Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
– Time series prediction of stock market indices.
26
Deviation/Anomaly Detection
• Detect significant deviations

from normal behavior
• Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
27
Data Mining and Induction Principle
Induction vs Deduction
• Deductive reasoning is truth-preserving:

1. All horses are mammals
2. All mammals have lungs
3. Therefore, all horses have lungs
• Induction reasoning adds information:

1. All horses observed so far have lungs.
2. Therefore, all horses have lungs.
28
The Problems with Induction
From true facts, we may induce false models.
Prototypical example:
– European swans are all white.
– Induce: ”Swans are white” as a general rule.
– Discover Australia and black Swans...
– Problem: the set of examples is not random and representative
Another example: distinguish US tanks from Iraqi tanks

– Method: Database of pictures split in train set and test set;
Classification model built on train set
– Result: Good predictive accuracy on test set;Bad score on
independent pictures
– Why did it go wrong: other distinguishing features in the pictures
(hangar versus desert)
29
Hypothesis-Based vs. Exploratory-Based
• The hypothesis-based method:

– Formulate a hypothesis of interest.
– Design an experiment that will yield data to test this hypothesis.
– Accept or reject hypothesis depending on the outcome.
• Exploratory-based method:
– Try to make sense of a bunch of data without an a priori
hypothesis!
– The only prevention against false results is significance:
• ensure statistical significance (using train and test etc.)
• ensure domain significance (i.e., make sure that the results make
sense to a domain expert)
30
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery Data Mining
process.
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases 31
Steps of a KDD Process
• Learning the application domain:

– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
32
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts

OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP 33
Data Mining: On What Kind of Data?
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW 34
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
35
Data Mining vs. Statistical Analysis
Statistical Analysis:
• Ill-suited for Nominal and Structured Data Types
• Completely data driven - incorporation of domain knowledge not
possible
• Interpretation of results is difficult and daunting
• Requires expert user guidance
Data Mining:
• Large Data sets
• Efficiency of Algorithms is important
• Scalability of Algorithms is important
• Real World Data
• Lots of Missing Values
• Pre-existing data - not user generated
• Data not static - prone to updates
• Efficient methods for data retrieval available for use 36
Data Mining vs. DBMS
• Example DBMS Reports

– Last months sales for each service type
– Sales per service grouped by customer sex or age
bracket
– List of customers who lapsed their policy
• Questions answered using Data Mining

– What characteristics do customers that lapse their
policy have in common and how do they differ from
customers who renew their policy?
– Which motor insurance policy holders would be
potential customers for my House Content Insurance
policy?
37
Data Mining and Data Warehousing
• Data Warehouse: a centralized data repository which

can be queried for business benefit.
• Data Warehousing makes it possible to
– extract archived operational data
– overcome inconsistencies between different legacy data formats
– integrate data throughout an enterprise, regardless of location,
format, or communication requirements
– incorporate additional or expert information
• OLAP: On-line Analytical Processing
• Multi-Dimensional Data Model (Data Cube)
• Operations:
– Roll-up
– Drill-down
– Slice and dice
– Rotate
38
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration Database API Filtering

Layer1
Data cleaning Data
Databases Data
Data integration
Warehouse 39
Repository
DBMS, OLAP, and Data Mining
DBMS OLAP Data Mining

Knowledge discovery
Extraction of detailed Summaries, trends and
Task of hidden patterns
and summary data forecasts
and insights
Type of result Information Analysis Insight and Prediction
Multidimensional data Induction (Build the

Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)
What is the average Who will buy a

Who purchased
income of mutual mutual fund in the
Example question mutual funds in
fund buyers by next 6 months and
the last 3 years?
region by year? why?
40
Example of DBMS, OLAP and Data
Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no 41
• By querying a DBMS containing the above table we may

answer questions like:
• What was the temperature in the sunny days? {85, 80,
72, 69, 75}
• Which days the humidity was less than 75? {6, 7, 9, 11}
• Which days the temperature was greater than 70? {1, 2,
3, 8, 10, 11, 12, 13, 14}
• Which days the temperature was greater than 70 and the
humidity was less than 75? The intersection of the above
two: {11}
42
OLAP:
• Using OLAP we can create a Multidimensional Model of our data
(Data Cube).
• For example using the dimensions: time, outlook and play we can
create the following model.
9/5 sunny rainy overcast
Week 1 0/2 2/1 2/0
Week 2 2/1 1/1 2/0
43
Data Mining:
• Using the ID3 algorithm we can produce the following

decision tree:
• outlook = sunny
– humidity = high: no
– humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
– windy = true: no
– windy = false: yes
44
Major Issues in Data Warehousing and
Mining
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
45
Major Issues in Data Warehousing and
Mining
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem
– Protection of data security, integrity, and privacy
46
Multi-Tiered Architecture of data warehousing and data mining
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Integration
Warehouse Data mining
Refresh
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

2008/1/9 1
What is XML?
•EXtensible Markup Language (XML)

•World Wide Web Consortium (W3C) recommendation Version 1.0 as of
10/02/1998.
•Describes data, rather than instructing a system on how to process it.
•Provides powerful capabilities for data integration and data-driven
styling.
•Introduces new processing paradigms and requires new ways of thinking
about Web development.
•A Meta-Markup Language, a set of rules for creating semantic tags used
to describe data.
2008/1/9 2
XML is Extensible
• The tags used to markup HTML documents and
the structure of HTML documents are
predefined.
• The author of HTML documents can only use

tags that are defined in the HTML standard.
• XML allows the author to define his own tags

and his own document structure.
2008/1/9 3
Benefits of using XML
• It is structured.
• Documents are easily committed to a persistence
layer.
• Platform independent, textual information.
• An open standard.
• Language independent.
• DOM and SAX are open, language-independent
set of interfaces.
• It is Web enabled.
2008/1/9 4
Typical XML System
XML Document
(Content)
XML
Parser XML Application
(Processor)
XML
DTD
(Rules)
•XML Document (content)
•XML Document Type Definition - DTD (structure definition; this is an
operational part)
•XML Parser (conformity checker)
•XML Application (uses the output of the Parser to achieve your unique
objectives)
2008/1/9 5
How XML can be used?
• XML can keep data separated from your
HTML document.
• XML can also store data inside HTML
documents (Data Islands).
• XML can be used to exchange data.
• XML can be used to store data.
2008/1/9 6
XML Syntax
• An example XML document.
<?xml version="1.0"?>
<note>
<to>Tan Siew Teng</to>
<from>Lee Sim Wee</from>
<heading>Reminder</heading>
<body>Don't forget the Golf Championship this
weekend!</body>
</note>
2008/1/9 7
Example (cont’d)
• The first line in the document: The XML
declaration should always be included.
• It defines the XML version of the document.
• In this case the document conforms to the 1.0
specification of XML.
<?xml version="1.0"?>
• The next line defines the first element of the
document (the root element):
<note>
2008/1/9 8
Example (cont’d)
• The next lines defines 4 child elements of the root
(to, from, heading, and body):
<to>Tan Siew Teng</to>

<from>Lee Sim Wee</from>
<heading>Reminder</heading>
<body>Don't forget the Golf Championship this
weekend!</body>
• The last line defines the end of the root element:
</note>
2008/1/9 9
What is an XML element?
• An XML element is made up of a start tag, an
end tag, and data in between.
<Sport>Golf</Sport>
• The name of the element is enclosed by the less
than and greater than characters, and these are
called tags.
• The start and end tags describe the data within the
tags, which is considered the value of the
element.
• For example, the following XML element is a
<player> element with the value “Tiger Wood.”
<player>Tiger Wood</player>
2008/1/9 10
There are 3 types of tags
• Start-Tag
– In the example <Sport> is the start tag. It defines
type of the element and possible attribute
specifications
<Player firstname=“Wood" lastname=“Tiger">
• End-Tag
– In the example </Sport> is the end tag. It
identifies the type of element that tag is ending.
Unlike start tag end tag cannot contain attribute
specifications.
• Empty Element Tag
– Like start tag this has attribute specifications but it
does not need an end tag. Denotes that element is
empty (does not have any other elements). Note
2008/1/9 the symbol for ending tag '/' before '> ' 11
<Player firstname=“Wood" lastname=“Tiger"/>
XML elements must have a closing
tag
In HTML some elements do not have to have a closing tag.
The following code is legal in HTML:

This is a paragraph
This is another paragraph
In XML all elements must have a closing tag like this:

This is a paragraph
This is another paragraph
2008/1/9 12
Rules for Naming Elements
• XML names should start with a letter or the
underscore character.
• Rest of the name can contain letters, digits,
dots, underscores or hyphens.
• No spaces in names are allowed.
• Names cannot start with 'xml' which is a
reserved word.
2008/1/9 13
XML tags are case sensitive
• XML tags are case sensitive. The tag <Message> is
different from the tag <message>.
• Opening and closing tags must therefore be

written with the same case:
<message>This is correct</message>
<Message>This is incorrect</message>
2008/1/9 14
All XML elements must be
properly nested
In HTML some elements can be improperly nested within each other
like this:
This text is bold and italic
In XML all elements must be properly nested within each other like this
This text is bold and italic
2008/1/9 15
XML documents must have a root tag
• Documents must contain a single tag pair to
define the root element.
• All other elements must be nested within the root
element.
• All elements can have sub (children) elements.
• Sub elements must be in pairs and correctly
nested within their parent element:
<root>
<child>
<subchild>
</subchild>
</child>
</root>
2008/1/9 16
XML Attributes
• XML attributes are normally used to describe
XML elements, or to provide additional
information about elements.
• An element can optionally contain one or more
attributes. An attribute is a name-value pair
separated by an equal sign (=).
• Usually, or most common, attributes are used to
provide information that is not a part of the
content of the XML document.
• Often the attribute data is more important to the
XML parser than to the reader.
2008/1/9 17
XML Attributes (cont’d)
• Attributes are always contained within the start
tag of an element. Here are some examples:
<Player firstname=“Wood" lastname=“Tiger“ />
Player - Element Name
Firstname - Attribute Name
Wood - Attribute Value
• HTML examples:
<img src="computer.gif">
<a href="demo.asp">
• XML examples:
<file type="gif">
<person id="3344">
2008/1/9 18
Attribute values must always be
quoted
• XML elements can have attributes in name/value
pairs just like in HTML.
• An element can optionally contain one or more
attributes.
• In XML the attribute value must always be
quoted.
• An attribute is a name-value pair separated by an
equal sign (=).
<CITY ZIP="01085">Westfield</CITY>
• ZIP="01085" is an attribute of the <CITY> element.
2008/1/9 19
What is a Comment ?
• Comments are informational help for the
reader.
• These are ignored by XML processors.
• They are enclosed within ""

tags.

2008/1/9 20
What is a Processing Instruction ?
• Processing Instructions provide a way to send
instructions to computer programs or
applications. They are enclosed within "<?" and
"?>" tags.
<? xml:stylesheet type="text/xsl" href="styler.xsl" ?>
xml:stylesheet - Application name

type="text/xsl" href="styler.xsl" - Instructions to the
application
2008/1/9 21
What is a DTD ?
• Document Type Declaration (DTD) is a
mechanism (set of rules) to describe the
structure, syntax and vocabulary of XML
documents.
• It is a modeling language for XML but it

does not follow the same syntax as XML.
2008/1/9 22
Document Type Definition (DTD)
• Define the legal building blocks of an XML
document.
• Set of rules to define document structure with a
list of legal elements.
• Declared inline in the XML document or as an
external reference.
• All names are user defined.
• Derived from SGML.
• One DTD can be used for multiple documents.
• Has ASCII format.
• DOCTYPE keyword.
2008/1/9 23
Element Declaration
• Following lines show the possible syntaxes for
element declaration
<!ELEMENT reports (employee*)>
<!ELEMENT employee (ss_number, first_name,
middle_name, last_name, email, extension, birthdate,
salary)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT extension EMPTY>
#PCDATA - Parsed Character Data, meaning that
elements can contain text. This requirement means that
no child elements may be present in the element within
which #PCDATA is specified
EMPTY - Indicates that this is the leaf element, cannot
contain any more children
2008/1/9 24
Occurrence
• There are notations to specify the number of
occurrences that a child element can occur within
the parent element.
• These notations can appear at the end of each
child element
+ - Element can appear one or several times
* - Element can appear zero or several times
? - Element can appear zero or one time
nothing - Element can appear only once (also, it
must appear once)
2008/1/9 25
Separators
, - Elements on the right and left of
comma must appear in the same
order.
| - Only one of the elements on the

left or right of this symbol must
appear.
2008/1/9 26
Attribute Declaration
Syntaxes for attribute declaration
<!ATTLIST customer ID CDATA #REQUIRED>
<!ATTLIST customer Preferred (true | false) "false">
Customer - Element name
ID - Attribute type ID uniquely identifies an element
IDREF - Attribute with type IDREF point to elements with an
ID attribute
Preferred - Attribute names
(true | false) - Possible attribute values
False - Default attribute value
CDATA - Character data
#REQUIRED- Attribute value must be provided
#IMPLIED - If no value is provided, application must use its
own default
#FIXED - Attribute value must be the one that is
provided in DTD
NMTOKEN - Name token consists of letters, digits, periods, underscores,
2008/1/9 hyphens and colon characters 27
Why use a DTD?
• XML provides an application independent way of
sharing data.
• With a DTD, independent groups of people can
agree to use a common DTD for interchanging
data.
• Your application can use a standard DTD to
verify that data that you receive from the outside
world is valid.
• You can also use a DTD to verify your own data.
2008/1/9 28
Well-formed Document
<?xml version=“1.0”?>
<TITLE>
<title>A Well-formed Documents</title>
<first>
This is a simple
<bold>well-formed</bold>
document.
</first>
<TITLE> Source:
L1.xml
2008/1/9 29
Rules for Well-formed
Documents
• The first line of a well-formed XML document
must be an XML declaration.
• All non-empty elements must have start tags and
end tags with matching element names.
• All empty elements must end with />.
• All document must contain one root element.
• Nested elements must be completely nested
within their higher-level elements.
• The only reserved entity references are &,
', > <, and &quot.
2008/1/9 30
DTD Graph
• Given the DTD information of the XML to be stored, we can
create a structure called the Data Type Definition Graph that
mirrors the structure of the DTD. Each node in the Data Type
Definition graph represents an XML element in rectangle, an XML
attribute in semi-cycle, and an operator in cycle. They are put
together in a hierarchical containment under a root element node,
with element nodes under a parent element node, separated by
occurrence indicator in cycle.
• Facilities are available to link elements together with an Identifier

(ID) and Identifier Reference (IDREF). An element with IDREF
refers to an element with ID. Each ID must have a unique address.
Nodes can refer to each other by using ID and IDREF such that
nodes with IDREF referring to nodes with ID.
2008/1/9 31
Extended Entity Relationship Model DTD Graph
Root
Entity A
+
1 1
R1 R4
Element E
1 1
n 1
n 1 n
Entity B Entity E
* *
1 1 1 1 Mapping
n
R2 R3 R3 R7 Element A Element F Element H
n n n n 1 1
Entity C Entity D Entity F Entity H * *

1
n n
R1
Element B Element G
n 1 1
Entity G * *
n n
Element C Element D
2008/1/9 32
An EER model for Customer Sales
Customer_no
Invoice_no Customer_name
Quantity Sex Year
Invoice_amount Postal_code Month
Invoice_date Telephone Quantity
Shipment_date Email Total
n 1 Monthly
Invoice R7 Customer 1
Sales
1
Item_no,
1 1 1
Item_name
Author
Publisher
R8 R1 R3 R2 R4 Item_price
n n n n
Invoice_no n n n
Item_no Customer_no Year Year,
Quantity 1
Invoice Customer Address Month Month
Item
Unit_price City
Address State Country
Customer Customer_no Item_no Item Sales R3 Item
Invoice_price Sales Quantity Quantity Total
Discount Is_default Total
n
1
R11
2008/1/9 33
A Sample DTD graph for Customer Sales
Root
+
Sales
* * * *
Invoice_no
Quantity Customer_no
Element Invoice_amount Element Customer_name Element Year
Invoice_date Sex Month
Shipment_date Postal_code Quantity
Invoice Shipment_type Customer Telephone Monthly Total
ID
Email Sales
idref ID Item_no,
Element Item_name
* * * *
Author
Publisher
Item_price
Item
Catalog_type
ID ID
Element Quantity Element Element
Unit_price Address Quantity Element
Invoice_price City idref Quantity Total
Invoice Discount Customer Customer Total
State Country Item Sales
Item Is_default Address Sales
idref
idref
2008/1/9 34
The mapped DTD from DTD Graph
<!ELEMENT Sales (Invoice*, Customer*, Item*, Monthly_sales*)>
<!ATTLIST Sales
Status (New | Updated | History) #required>
<!ELEMENT Invoice (Invoice_item*)>
<!ATTLIST Invoice
Invoice_no CDATA #REQUIRED
Quantity CDATA #REQUIRED
Invoice_amount CDATA #REQUIRED
Invoice_date CDATA #REQUIRED
Shipment_date CDATA #IMPLIED
Customer_idref IDREF #REQUIRED>
<!ELEMENT Customer (Customer_address*)>
<!ATTLIST Customer
Customer_id ID #REQUIRED
Customer_name CDATA #REQUIRED
Customer_no CDATA #REQUIRED
Sex CDATA #IMPLIED
Postal_code CDATA #IMPLIED
Telephone CDATA #IMPLIED
Email CDATA #IMPLIED>
<!ELEMENT Customer_address EMPTY>
<!ATTLIST Customer_address
Address_type (Home|Office) #REQUIRED
Address NMTOKENS #REQUIRED
City CDATA #IMPLIED
State CDATA #IMPLIED
Country CDATA #IMPLIED
Customer_idref
2008/1/16 IDREF #REQUIRED 35
Is_default (Y|N) “Y”>
<!ELEMENT Invoice_Item EMPTY>
<!ATTLIST Invoice_Item
Unit_price CDATA #REQUIRED
Invoice_price CDATA #REQUIRED
Discount CDATA #REQUIRED
Item_idref IDREF REQUIRED>
<!ELEMENT Item EMPTY>
<!ATTLIST Item
Item_id ID #REQUIRED
Item_name CDATA #REQUIRED
Author CDATA #IMPLIED
Publisher CDATA #IMPLIED
Item_price CDATA #REQUIRED>
<!ELEMENT Monthly_sales(Item_sales*, Customer_sales*)>
<!ATTLIST Monthly_sales
Year CDATA #REQUIRED
Month CDATA #REQUIRED
Total CDATA #REQUIRED>
<!ELEMENT Item_sales EMPTY>
<!ATTLIST Item_sales
Total CDATA #REQUIRED
Item_idref IDREF #REQUIRED>
<!ELEMENT Customer_sales EMPTY>
<!ATTLIST Customer_sales
Total CDATA #REQUIRED
Customer_idref
2008/1/9 IDREF #REQUIRED>
36
Review Question 1
What are the similarity and dissimilarity
between DTD and Well-formed Document?
2008/1/9 37
Tutorial question 1
Map the following Extended Entity Relationship Model into
an DTD Graph and a Document Type Definition (DTD)
Department Department_ID
1
Salary
has
n
Trip Trip_ID
1
taken
n
Staff_ID
Car_rental Car_model
An ER model for car rental
2008/1/9 38
Reading assignment
Chapter 3 Schema Translation in “Information
Systems Reengineering and Integration”
Second Edition, by Joseph Fong, published
by Springer, 2006, pp.142-154.
2008/1/9 39
Extended Entity Relationship (EER) Model
• ER model has been widely used but have

some shortcomings.
• Difficult to represent cases where an entity
may have varying attributes dependent upon
some property.
• ER model has been extended into Extended
Entity Relationship model which includes
more semantics such as generalization,
categorization and aggregation.
2023/2/5 1
Cardinality: One-to-one relationship
A one-to-one relationship between set A and set B is defined as: For all a in A, there exists at most one b
in B such that a and b are related, and vice versa.
Example
A president leads a nation.
Relational Model:
Relation President (President_name, Race, *Nation_name)
Relation Nation (Nation_name, Nation_size)
Where underlined are primary keys and "*" prefixed are foreign keys
Extended Entity Relationship model
2023/2/5 2
Cardinality: Many-to-one relationship
A many-to-one relationship from set A to set B is defined as: For all a in A, there exists at most one b in B
such that a and b are related, and for all b in B, there exists zero or more a in A such that a and b are related.
Example A director directs many movies.
Relational Model:
Relation Director (Director_name, Age)
Relation Movies (Movie_name, Sales_volume, *Director_name)
Extended entity relationship model:
2023/2/5 3
Cardinality: Many-to-many relationship
A many-to-many relationship between set A and set B is defined as: For all a in A, there exists zero
or more b in B such that a and b are related, and vice versa.
Example
Many students take many courses such that a student can take many courses and a course can be taken by
many students.
Relational Model:
Relation Student (Student_id, Student_name)
Relation Course (Course_id, Course_name)
Relation take (*Student_id, *Course_id)
2023/2/5 4
Data Semantic: Is-a (Subtype) relationship
The relationship A isa B is defined as: A is a special kind of B.
Example
Father is Male.
Relational Model:
Relation Male (Name, Height)
Relation Father (*Name, Birth_date)
Extended entity relationship model
2023/2/5 5
Data Semantic: Disjoint Generalization
-Generalization is to classify similar entities into a single entity. More than one is-a
relationship can form data abstraction (i.e. super-class and subclasses) among entities.
- A subclass entity is a subset of its super-class entity. There are two kinds of generalization.
- The first is disjoint generalization such that subclass entities are mutually exclusive.
- The second is overlap generalization such that subclass entities can overlap each other.
Example of Disjoint Generalization

A refugee and a non-refugee can both be a boat
person, but a refugee cannot be a non-refugee, and
vice versa.
Relational Model:
Relation Boat_person (Name, Birth_date,
Birth_place)
Relation Refugee (*Name, Open_center)
Relation Non-refugee (*Name, Detention_center)
2023/2/5 6
Data Semantic: Overlap Generalization
Example of Overlap Generalization
A computer programmer and a system analyst can both be a computer professional, and a computer
programmer can also be a system analyst, and vice versa.
Relational Model:
Relation Computer_professional (Employee_id, Salary)
Relation Computer_programmer (*Employee_id, Language_skill)
Relation System_analyst (*Employee_id, Application_system)
2023/2/5 7
Data Semantic: Categorization Relationship
In cases the need arises for modeling a single super-class/subclass relationship with more than
one super-class (es), where the super-classes represent different entity types. In this case, we
call the subclass a category.
Relational Model:
Relation Department (Borrower_card, Department_id)
Relation Doctor (Borrower_card, Doctor_name)
Relation Hospital (Borrower_card, Hospital_name)
Relation Borrower (*Borrower_card, Return_date, File_id)
Extended Entity Relationship Model
2023/2/5 8
Data Semantic: Aggregation Relationship
Aggregation is a method to form a composite object from its components. It aggregates
attribute values of an entity to form a whole entity.
Example
The process of a student taking a course can form a composite entity (aggregation) that may be
graded by an instructor if the student completes the course.
Relational Model:
Relation Student (Student_no, Student_name)
Relation Course (Course_no, Course_name)
Relation Takes (*Student_no, *Course_no, *Instructor_name)
Relation Instructor (Instructor_name, Department)
Extended Entity Relationship Model
2023/2/5 9
Data Semantic: Total Participation
An entity is in total participation with another entity provided that all data occurrences of the
entity must participate in a relationship with the other entity.
Example
An employee must be hired by a department.
Relational Model:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_id, Employee_name, *Department_id)
2023/2/5 10
Data Semantic: Partial Participation
An entity is in partial participation with another entity provided that the data occurrences of the
entity are not totally participate in a relationship with the other entity.
Example
An employee may be hired by a department.
Relational Model:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_no, Employee_name, &Department_id)
Where & means that null value is allowed
2023/2/5 11
Data Semantic: Weak Entity
The existence of a weak entity depends on its strong entity.
Example
A hotel room must concatenate hotel name for identification.
Relational Model:
Relation Hotel (Hotel_name, Ranking)
Relation Room (*Hotel_name, Room_no, Room_size)
Extended entity relationship model
2023/2/5 12
Cardinality: N-ary Relationship
Multiple entities relate to each other in an n-ary relationship.
Example
Employees use a wide range of different skills on each project they are associated with.
Relational Model:
Relation Engineer (Employee_id, Employee_name)
Relation Skill (Skill_name, Years_experience)
Relation Project (Project_id, Start_date, End_date)
Relation Skill_used (*Employee_id, *Skill_name, *Project_id)
2023/2/5 13
Architecture of multiple databases integration
Global
database
Relational
Database 1
: Step 3. Data Integration Look Up

Table
:
Relational
Database n Integrated
schema
Step 2. Schema Integration
EER Model ............ EER Model

1 n
Step 1. Reverse Engineering

(for Schema Translation)
............
Relational Relational
schema 1 Schema n
2023/2/5 14
Reverse engineering
relational schema into EER model
Step 1 Defining each relation, key and field:

1) The relations are preprocessed by making any
necessary candidate key substitutions as follows:
• Primary relation: describing entities.
– Primary relation - Type 1 (PR1). This is a relation whose
primary key does not contain a key of another relation.
– Primary relation - Type 2 (PR2). This is a relation whose
primary key does contain a key of another relation.
2023/2/5 15
• Secondary relation: a relation whose primary key is
fully or partially formed by concatenation of primary
keys of other relations.
– Secondary relation - Type 1 (SR1). If the key of the
secondary relation is formed by concatenation of primary
keys of primary relations, it is of Type 1 or SR1.
– Secondary relation - Type 2 (SR2). Secondary relations that
are not of Type 1.
• Key attribute - Primary (KAP). This is an attribute
in the primary key of a relation, and is also a
foreign key of another relation.
• Key attribute - General (KAG). These are all the
other primary key attributes in a secondary
relation that are not of the KAP type.
2023/2/5 16
• Foreign key attribute (FKA). This is a non-primary-
key attribute of a primary relation that is a foreign
key.
• Nonkey attribute (NKA). The rest of the non-
primary-key attribute.
Step 2 Map each PR1 into entity

• For each Type 1 primary relation (PR1), define a
corresponding entity type and identify it by the
primary key. Its nonkey attributes map to the
attributes of the entity types with the corresponding
domains.
2023/2/5 17
Step 3 Map each PR2 into a subclass entity or
weak entity
Case 1: CASE 2:
Relational schema EER model Relational schema EER model
PR1: Relation A (A1, A2) PR1: Relation A (A1, A2)

PR2: Relation B (*A1, B1, B2) A1 PR2: Relation B (*A1, B1) A1
A Strong entity A Superclass
A2 A2
1
R ISA
A1 n A1
Subclass
B1 Weak entity B1 B
B
B2
2023/2/5 18
Step 4 Map SR1 into binary/n-ary relationship
Relational schema EER model

PR1: Relation A (A1, A2)
PR1: Relation B (B1, B2)
SR1: Relation AB (*A1, *B1) A1 A
A2
m
R
n
B1
B2 B
2023/2/5 19
Step 5 Map SR2 into binary/n-ary relationship

PR1: Relation B (B1, B2) A1 B1
A B
SR2: Relation C (*A1, *B1, C1) A2 B2
1 1
R1 R2
n m
A1
B1 C
C1
2023/2/5 20
Step 6 Map each FKA into relationship

PR1: Relation B (B1, B2, *A1) A1 A
A2
1
R
n
B1
B2 B
2023/2/5 21
Step 7 Map inclusion dependency into semantic
If IDs have been derived between two relations, relation A with a as
primary key and b’ as foreign key, relation B with b as primary key
and a’ as foreign key, then
Case 1. Given ID: a’  a, then entity A is in 1:n relationship with entity
B.
Case 2. Given ID: a’ a, and b’  b, then entity A is in 1:1 relationship
with entity B.
Case 3. Given ID: a’ a, and b’ b, and a’b’ is a composite key, then
entity A is in m:n relationship with entity B.
Step 8 Draw EER model.

2023/2/5 22
Draw the derived EER model as a result of above steps.
An university enrollment system:
Relations:
Department (Dept#, Dept_name)
Instructor (*Dept#, Inst_name, Inst_addr)
Course (Course#, Course_location)
Prerequisite (Prer#, Prer_title, *Course#)
Student (Student#, Student_name)
Section (*Dept#, *Course#, *Inst_name, Section#)
Grade (*Dept#, *Inst_name, *Course#, *Student#,
*Section#, Grade)
2023/2/5 23
Relations and attributes classification table
Relation Rel Primary- KAP KAG FKA NKA
Name Type Key_____ ________ ________ ________ _________ _________
DEPT PR1 Dept# Dept_name
INST PR2 Dept# Dept# Inst_name Inst_addr
Inst_name
COUR PR1 Course# Course_location
STUD PR1 Student# Stud_name
PREP PR1 Prer Course# Prer_title
SECT SR2 Course# Course# Section# Inst_name
Dept# Dept#
Section#
Inst_name Inst_name
GRADE SR1 Inst_name Inst_name Grade
Course# Course#
Student# Student#
2023/2/5 Dept# Dept# 24
Section# Section#
Step 2. Map each PR1 into entity
Department Prerequisite Student Course
Dept# Pre# Student# Course#

Dept_name prer_title Student_name Course_Location
2023/2/5 25
Step 3. Map each PR2 into weak entity.
1 n
Department hire Instructor
Department hire
Dept# Dept#
Dept_name Inst_name
Inst_addr
2023/2/5 26
Step 4. Map SR1 into binary/n-
ary relationship.
m n
Student grade Section
Student# Grade Dept#

Section#
Student_name
Inst_name
Course#
Section#
2023/2/5 27
Step 5. Map SR2 into binary/n-ary
relationship
Instructor Course
Dept# 1 1 Course#
Inst_name
Course_Location
Inst_addr
teach has
n n
Dept#
Inst_Name
Section
Section
Course#
Section#
2023/2/5 28
Step 6. Map each FKA into relationship
Course pre-course Prerequisite

1 1
Course# Prer#
Course_Location Prer_title
2023/2/5 29
Step 7. Map each inclusion dependency
into semantics (binary/n-ary relationship)
Given derived inclusion dependency Derived Semantics
Instructor.Dept#  Department.Dept# n:1 relationship between entities
Instructor and Department
Section.Dept#  Department.Dept#
Section.Inst_name Instructor.Inst_name
Section.Course#  Course.Course# 1:n relationship between entities
Instructor and Section
and between Course and Section.
Grade.Dept#  Section.Dept#
Grade.Inst_name  Section.Inst_name
Grade.Course#  Section.Course#
Grade.Student#  Student.Student# m:n relationship between
relationship Section and entity Student.
Prerequisite.Course#  Course.Course#
Course.Prer#  Prerequisite.Prer# 1:1 relationship between Course and
2023/2/5 Prerequisite 30
Step 8. Draw EER model.
Department Prerequisite
1 Dept# 1 Prer#
Dept_name Prer_title
pre-course
hire
n 1
Instructor Student Course
Dept# 1 m Student# 1 Course#

Inst_name Student_name Course_location
Inst_addr
teach grade has
Grade
n n
n
Dept#
Section
Section Inst_Name
Course#
2023/2/5 Section# 31
Reading assignment
Chapter 3 Schema Translation in “Information
Systems Reengineering and Integration” by
Joseph Fong, published by Springer Verlag,
2006, pp. 115-121.
2023/2/5 32
Review question 2
What are the major differences between
Generalization and Categorization in terms
of data volume (data occurrences) in their
related superclass entity/entities and
subclass entity/entities?
Is there any special case such that these two

data semantics can be overlapped?
2023/2/5 33
Tutorial Question 2
In a reverse engineering approach, translate the following relational
schema to an Entity-relationship model.
Order (Order_code, Order_type, Our_reference, Order_date,

Approved_date, *Head, *Supplier_code)
Supplier (Supplier_code, Supplier_name)
Product (Product_code, Product_description)
Department (Department_code, Department_name)
Head (Head, *Department_code, Title)
Order_Product (*Order_code, *Product_code, Qty, Others, Amount)
Note (*Order_code, Sequence#, Note)
where underlined are primary keys and prefixed with ‘*’ are foreign keys.
2023/2/5 34
Architecture of multiple databases integration
Global
database
Relational
Database 1
: Step 3. Data Integration Look Up

Table
:
Relational
Database n Integrated
schema
Step 2. Schema Integration
EER Model ............ EER Model

1 n
Step 1. Reverse Engineering

(for Schema Translation)
............
Relational Relational
schema 1 Schema n
2023/2/5 1
Data Integration Concepts
• Data semantics define the relationships between data for
user’s data requirements.
• Data semantics are presented in database conceptual schema
such as EER model and DTD Graph.
• Only relevant data can be integrated for an application.
• Data relevance depends on user’s data requirements for an
application.
• Data consistency are the standard of data domain value and
format. Inconsistent data must be transformed before data
integration.
• User supervision means users input for user’s data
requirements, and which are for database design and schema
integration
2023/2/5 2
Schema integration of EER models
Begin For each existing database do

Begin If its conceptual schema does not exist
then reconstruct its EER model by reverse engineering;
For each pair of existing databases’ EER models schema A and schema B do
begin resolve semantic conflicts between EER model A and EER model B;/step1/
Merge classes between EER model A and EER model B; /*step2*/
Merge relationships between EER model A and EER model B; /*step3*/
end
end
2023/2/5 3
Step 1 Resolve conflicts among EER model
Resolve conflicts on synonyms and homonyms
Schema A Schema B Schema A' Schema B'
Entitya Entityb ====> Entitya Entitya

User
supervision
Attributex Attributex Attributex1 Attributex2
(note: Classa and Classb are synonyms, Attributex are

homonyms)
2023/2/5 4
Example
Schema A Schema B Schema A' Schema B'
Loan
Customer ====> Customer Customer
Borrower
name name name name
date date open-acct-date loan-begin-date
2023/2/5 5
Resolve conflicts on data types
S chem a A S chem a B T ransfo rm ed S chem a A '
n 1
Ax Ca Cx == ==> Ca R Cx
S chem a A S chem a B T ransfo rm ed S chem a A '
1 1
Ax Ca .C x == ==> Ca R .C x
S ch em a A S ch em a B Tra nsform ed S chem a A '
Ax n m
Ay .C a Cx == ==> Ay Ca R Cx
2023/2/5 6
Example
S chem a A S chem a B
T ra n s fo rm e d s c h e m a A
A B A A'
n 1
cu sto m e r Loan nam e
==>
ty p e Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t
S chem a A S chem a B T ra n s fo rm e d s c h e m a A
A B A A'
1 1
Loan ty p e
cu sto m e r
nam e
==> Loan book
nam e
S chem a A S chem a B T ra n s fo rm e d s c h e m a A
A B A A'
n m
Loan ty p e
cu sto m e r nam e == > Loan book
nam e
2023/2/5 7
Resolve conflicts on key
Schema A Schema B Transformed Schema X

Ay Ay Ay
Az Cx Cyx Az ====>
User Cx
.Az
supervision
2023/2/5 8
Example
Schem a A Schem a B T r a n s fo r m e d s c h e m a A
A B A
nam e nam e
C u s to m e r
nam e
C u s to m e r c u s to m e r#
===> C u s to m e r c u s to m e r#
c u s to m e r#
2023/2/5 9
Resolve conflicts on cardinality
Schema A Schema B Transformed Schema A'
1 1 1 n 1 n
Entity
x
R Entity
y
Entity
x
R Entity
y ====> Entity
x R Entity
y
2023/2/5 10
Example
2023/2/5 11
Resolve conflicts on weak entity
Schema A Schema B Transformed schema A'
A1 B1 A1
Entityx Attra Entityx Attra Entityx Attra

1 1
1 1
1 =====>
R
A R R
A A
A2 n A2
n
A2 n
Entityy Attrb Attra Attra
Entityyy Attrb Entityyy
Attrb
2023/2/5 12
Example
2023/2/5 13
Resolve conflicts on subtype entity
Schema A
A1 Schema B Schema X
1 A2 B1 B2 X1 X2
1 1
Entityx R Entityy Entityx R Entityy Entityx R Entityy
2023/2/5 14
Example
2023/2/5 15
Step 2 Merge entities
Merge classes by Union
Schema A Schema B Schema X

Attra
Attra Entityx Entityx Attrb ===> Entityx
Attrb
2023/2/5 16
Example
customer#
A B x customer#
address customer# address
phone Customer Customer age ==> Customer phone
age
2023/2/5 17
Merge EER models by Generalization
X
A B
Entityz Attrk
Entityx Attrk Entityy Attrk
======>
User d=disjoint
supervision d
X1 X2
Attrk Entityx Attrk Entityy

A B
Entityz Attrk
Entityx Attrk Entityy Attrk
======>
O=overlap
o
X1 X2
Attrk
2023/2/5 Entityx Attrk Entityy18
Example
S c h e m a A S c h e m a B S c h e m a X
n a m e x
L o c a l O v e rs e a s C u s to m e r
C u s to m e r C u s to m e r = = = = = = >
n a m e n a m e d = d is jo in t g e n e r a lis a t io n
lo c a l_ a d d r o v e rs e a s _ a d d r d x 2
x 1
L o c a l o v e rs e a s _ a d d r
O v e rs e a s
lo c a l_ a d d r
S c h e m a A S c h e m a B S c h e m a X
x
L o a n
C o m m e rc ia l M o rtg a g e n a m e
B o rro w e r
L o a n L o a n
B o rro w e r B o rro w e r = = = = = = >
n a m e n a m e
o = o v e r la p g e n e r a lis a t io n
lo a n _ a m t lo a n _ a m t
p r im e _ r a t e m o rtg a g e _ ra te
o
x 2
x 1
C o m m e ric a l m o rtg a g e _ ra te
M o rtg a g e
p r im e _ r a t e
lo a n _ a m t
L o a n L o a n
F ig u r e 8 M e rg e E E R m o d e ls b y g e n e a lis a t io n B o rr o w e r lo a n _ a m t B o rro w e r
2023/2/5 19
Merge EER models by Subtype
Relationship

A B X1 X2
==>
Entityx Entityy Entityx R Entityy
User
supervision
2023/2/5 20
Example
S chem a B S chem a X
S chem a A
B X1 X2
A
In te re s t F ix e d In te re s t K K F ix e d
K
R a te R a te K ====> R a te is a R a te
2023/2/5 21
Merge entities by Aggregation
Schema A Schema B
Schema X
A B
X1
1 n
1 n
Entityx Entityy R Entityz ==> Entityy R1 Entityz
R2
Entityx X2
2023/2/5 22
Example
Schema X
Schema A Schema B
A B2 X1
B1 B
1 n 1 n
Loan Loan Loan
C u sto mer book C u sto mer book
Security C ontract ==> C ontract
1
secured
by
n
Loan
X2
Security
2023/2/5 23
Merge entities by categorization
Schema A Schema B
Schema X
X1
B Xc2
A1 A2 Xc1
EntityX EntityY EntityZ ===> EntityX EntityY

User
supervision C=categorization
C
EntityZ X2
2023/2/5 24
Example
S ch e m a X
S ch e m a A S ch e m a B
X1
A1 A2 B
X c1 X c2
M o rtg a g e C o m m e rcia l Loan M o rtg a g e C o m m e rcia l
lo a n lo a n C o n tr a c t = => lo a n lo a n
C = ca t e g o risa t io n C
Loan
X2
C o n tr a c t
2023/2/5 25
Merge entities by Implied Binary
Relationship
X1 Schema X
Schema A Schema B AttrX X2
A B
n 1
AttrX
Entity a Entity b AttrX ====> Entity a R Entity b
Schema X
Schema A Schema B X1 AttrY AttrX
A B X2
AttrX 1 1
AttrY ====> R Entity b
Entity a Entity b AttrX Entity a
AttrY
2023/2/5 26
Example
X2
A B X1
Loan Loan n 1
Customer ==> book Customer
Contract Contract
loan#
customer# customer# loan# customer#

X2
A B X1
Loan Loan 1 1
Customer ==> book Customer
Contract Contract
loan# loan#
customer# customer# loan# customer#
2023/2/5 27
Step 3 Merge relationships
Merge relationships by subtype relationship
Schema A Schema B
Schema X
A1 A A2 B1 B2 X1 X X3
1 1 1 1
EntityX R EntityY EntityX R EntityY EntityX R EntityZ
isa
X2 EntityY
2023/2/5 28
Example
2023/2/5 29
Merge relationships by overlap generalization
SchemaA
SchemaB
SchemaX
A1 A A2 B1 B2
1 n 1 1 n X1
EntityX R1 EntityY EntityX R2 EntityY ===> EntityX
1 1
R1 R2
n n
X3 Entity X4 EntityZ2
Z1
O
EntityY
X2
2023/2/5 30
Example
A B X1
A1 A2 B1 B2
1 n 1 n Bank
Bank
home
Customer Bank
auto
Customer ===> 1
loan loan 1
home auto
loan loan
X3 n n X4
Home loan Auto loan
borrower borrower
o=overlap generalisation O
X2
Customer
2023/2/5 31
Absorbing Lower degree
Relationship into a Higher degree
Relationship
SchemaA
SchemaB
SchemaX
A1 A2 B1 B2
X1 X2
m : n 1 1 n
EntityX EntityY EntityX R2 EntityY ===> m : n
EntityX EntityY
m R1 m
m R1 m
:
: :
n :
n n
EntityZ n
2023/2/5 X3 EntityZ 32
Example
2023/2/5 33
Case Study: In a bank, there are existing databases with different
schemas: one for the customers, another for the mortgage loan contracts
and a third for the index interest rate. However, there is a need to
integrate them together for an international banking loan system.
loan contract#
begin_date
Mortgage mature_date ID#
ID# Customer customer_name
loan contract loan-status date
1 1
1 1
1
1
draw accure repaid balanced accumulate opened
1 1 (0,n) 1 n
n
Mortgage
Mortgage Loan
Loan Mortgage
Mortgage loan
loan
loan interest
interest loan
loan account
drawdown
drawdown type
type balance
balance history
loan contract# past loan# account#
loan contract# Loan_contract#
loan_bal_date past_loan_status
drawdown_date Interest_effective_date
drawdown_amt fixed_rate balance_amt
accurred_interest Mortgage
Mortgage
loan
loan
repayment
index Interest_effective_date
repayment
interest index_rate
loan contract# type accured interest
repayment_date
repayment
2023/2/5 34
Step 2 Merge entities by Implied binary relationship and Generalization
Since data ID# appears in entity Mortgage Loan Contract and entity
Customer, they are in one-to-many cardinality with foreign key on the
“many” side. loan contract#
begin_date m ID#
Mortgage 1
mature_date book Customer customer_name
loan contract date
Since Fixed and Index Interest Rate both have the same key,
Interest_effective_date, they can be generalized disjointed as either
fixed or index rate.
Loan
interest
type
Interest_effective_date
Interest_type
d
fixed index
interest interest
rate type
Interest_effective_date
Interest_effective_date Index_rate
2023/2/5 fixed_rate Index_name 35
accured_interest accured_interest
Thus, we can derive cardinality from the implied relationship between
these entities, and integrate the two schemas into one EER model.
loan contract#
begin_date m ID#
Mortgage 1
mature_date book Customer customer_name
loan contract date
1 1
1 1
1
1
draw accure repaid balanced accumulate opened

n
1 (0,n) 1 n
Mortgage
Mortgage n
loan
loan
Loan
Loan Mortgage
Mortgage Mortgage
Mortgage loan
drawdown
drawdown interest
interest loan
loan loan
loan account
loan contract# Loan_contract#
type
type repayment
repayment balance
balance history
drawdown_date Interest_effective_date loan contract# loan contract# account#
drawdown_amt Interest_type past loan#
repayment_date loan_bal_date
past_loan_status
d repayment balance_amt
fixed index
interest interest
rate type
Loan_contract# Loan_contract#
Interest_effective_date Interest_effective_date
fixed_rate Index_rate
accured_interest accured_interest
2023/2/5 36
Reading assignment
“Information Systems Reengineering and
Integration” by Joseph Fong, published by
Springer Verlag, 2006, pp. 282-310.
2023/2/5 37
Review question 3
Can two Relational Schemas be integrated into one
Relational Schema? Explain your answer.
Which steps need user supervision for integrating

two Extended Entity Relationship Models into one
Extended Entity Relationship Model? Explain
your answer.
2023/2/5 38
Tutorial question 3
Provide an integrated schema for the following two views which are merged to create a bibliographic database. During
identification of correspondences between the two views, the users discover the followings:
RESEARCHER and AUTHOR are synonyms,
CONTRIBUTED_BY and WRITTEN_IN are synonyms,
ARTICLES belongs to a SUBJECT.
ARTICLES and BOOK can be generalized as PUBLICATION.
Hints: Given two subclass entities have same relationship(s). The two subclasses entities can be generalized into a
superclass entity and the subclass relationship(s) can also be generalized into a superclass relationship..
View 1
Title
n 1 Volume
ARTICLE Published_in JOURNAL
Size Number
Contributed_by
Full_Name RESEARCHER
View 2
Title
n 1 Classification_id
BOOK Belongs_to SUBJECT
Publisher Name
n
Written_in
m
Full_Name AUTHOR
2023/2/5 39
Methodology for Data warehousing with OLAP
2008/2/19 1
Data conversion: Customerized Program Approach
A common approach to data conversion is to

develop customized programs to transfer data
from one environment to another. However, the
customized program approach is extremely
expensive because it requires a different program
to be written for each M source files and N target.
Furthermore, these programs are used only once.
As a result, totally depending on customized
program for data conversion is unmanageable,
too costly and time consuming.
2008/2/19 2
Interpretive Transformer approach of data
conversion
Definitions of
source, target,
and mapping
Interpretive
source target
transformer
2008/2/19 3
Interpretive transformer
For instance, the following Cobol structure
Level 0 PERSON
1 NAME AGE CARS
2 LIC#1 MAKE ACCIDENT
3 LIC#2 NAME
can be expressed in using these SDDL statements:
Data Structure PERSON <NAME, AGE>

Instance CARS <LIC#1, MAKE>
N-tuples ACCIDENTS <LIC#2, NAME>
Relationship PERSON-CAR <NAME, LIC#1>
N-tuples CAR-ACCIDENT <LIC#1, LIC#2>
2008/2/19 4
To translate the above three levels to the following two levels data structure:
Level 0 PERSON
1 NAME AGE CARS
2 LIC#1 MAKE LIC#2 NAME
the TDL statements are
FORM NAME FROM NAME

FORM LIC#1 FROM LIC#1
:
FORM PERSON IF PERSON
FORM CARS IF CAR AND ACCIDENT
2008/2/19 5
Translator Generator Approach of data
conversion
Definitions of
source, target,
and mapping
Translator
generator
Specialized
source target
program
2008/2/19 Translator generator 6

DEFINE and CONVERT compile phase
DEFINE PL/1
Source Program
Reader (DEFINE)
DEF S1 compiler phase Reader (S1)
DEF S2 Reader (S2)
: DEFINE :
: Compiler :
Convert
catalog
CONVERT PL/1
program Restructurer Procedure
(CONVERT)
Statement 1 compiler phase COP 1
Statement 2 COP 2
: DEFINE :
: Compiler :
CONVERT Execution
Catalogue Schedule
2008/2/19 7
As an example, consider the following hierarchical database:
DNO MGR BUDGET
ENO JOB PJNO LEADER
ITEMNO DESC
2008/2/19 8
Its DEFINE statements can be described in the following where
for each DEFINE statement, a code is generated to allocate a
new subtree in the internal buffer.
GROUP DEPT:
OCCURS FROM 1 TIMES;
FOLLOWED BY EOF;
PRECEDED BY HEX ‘01’;
:
END EMP;
GROUP PROJ:
OCCURS FROM 0 TIMES;
PRECEDED BY HEX ‘03’;
:
END PROJ;
2008/2/19 9
END DEPT;
For each user-written CONVERT statement, we can produce a customized
program. Take the DEPT FORM from the above DEFINE statement:
T1 = SELECT (FROM DEPT WHERE BUDGET GT ‘100’);
will produce the following program:
/* PROCESS LOOP FOR T1 */

DO WHILE (not end of file);
CALL GET (DEPT);
IF BUDGET > ‘100’
THEN CALL BUFFER_SWAP (T1, DEPT);
END
2008/2/19 10
Data conversion: Logical Level Translation
Approach
Lu?c d? Lu?c d?
quan h? quan h?
ngu?n dích
L? a ch?n?
Quá trình Chuy?n Các t?p Quá trình

CSDL quan dua ra Các t?p d?i dua vào CSDL quan
tu?n t?
h? ngu?n tu?n t? h? dích
dích
System flow diagram for data conversion from source relational to

2008/2/19
target relational 11
Case study of converting a relational database from DB2 to Oracle
Business requirements
A company has two regions A and B.
•Each region forms its own departments.
•Each department approves many trips in a year.
•Each staff makes many trips in a year.
•In each trip, a staff needs to hire cars for
transportation.
•Each hired car can carry many staff for each trip.
•A2008/2/19
staff can be either a manager or an engineer. 12
Data Requirements
Relation Department(*Department_id, Salary)

Relation Region_A (Department_id, Classification)
Relation Region_B (Department_id, Classification)
Relation Trip (Trip_id, *Car_model, *Staff_id, *Department_id)
Relation People (Staff_id, Name, DOB)
Relation Car (Car_model, Size, Description)
Relation Assignment(*Car_model, *Staff_id)
Relation Engineer (*Staff_id, Title)
Relation Manager (*Staff_id, Title)
ID: Department.Department_id  (Region_A.Department_id  Region_B.Department_id)

ID: Trip.Department_id  Department.Department_id
ID: Trip.Car_model  Assignment.Car_model
ID: Trip.Staff_id  Assignment.Staff_id
ID: Assignment.Car_model  Car.Car_model
ID: Assignment.Staff_id  People.Staff_id
ID: Engineer.Staff_id  People.Staff_id
ID: Manager.Staff_id  People.Staff_id
Where ID = Inclusion Dependence, Underlined are primary keys and “*” prefixed are foreign keys.
2008/2/19 13
Extended Entity Relationship Model for database Trip
Department_id
Department_id
Classification Region_A Region_B
Classification
Department_id
Department Salary
1
R1
m
Trip Trip_id
m
R2
1
Staff_id Car_model
Name People
m Assignment n Car Size
DOB Description
Staff_id Staff_id
Manager Engineer
Title Title
2008/2/19 14
Schema for target relational database Trip
Create Table Car
(CAR_MODEL character (10),
SIZE character (10),
DESCRIPT character (20),
STAFF_ID character (5),
primary key (CAR_MODEL))
Create Table Depart

(DEP_ID character (5),
SALARY numeric (8),
primary key (DEP_ID))
Create Table People

(STAFF_ID character (4),
NAME character (20),
DOB datetime,
primary key (STAFF_ID))
Create Table Reg_A

DESCRIP character (20),
Create Table Reg_B

2008/2/19 15
DESCRIPT character (20),
Schema for target relational database Trip (continue)
Create Table trip
(TRIP_ID character (5),
primary key (TRIP_ID))
Create Table Engineer

(TITLE character (20),
Foreign Key (STAFF_ID) REFERENCES People(STAFF_ID),
Primary key (STAFF_ID))
Create Table Manager

(TITLE character (20),
Primary key (STAFF_ID))
Create Table Assign

( CAR_MODEL character (10),
Foreign Key (CAR_MODEL) REFERENCES Car(CAR_MODEL),
primary key (CAR_MODEL,STAFF_ID))
ALTER TABLE trip ADD CAR_MODEL character (10) null
ALTER TABLE trip ADD STAFF_ID character (4) null

2008/2/19 16
ALTER TABLE trip ADD Foreign Key (CAR_MODEL,STAFF_ID) REFERENCES Assign(CAR_MODEL,STAFF_ID)
Data insertion for target relational database Trip
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('DA-02 ', '165 ', 'Long car ', 'A001 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('MZ-18 ', '120 ', 'Small sportics ', 'B004 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('R-023 ', '150 ', 'Long car ', 'A002 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('SA-38 ', '120 ', 'New dark blue ', 'D001 ')
INSERT INTO trip_ora.CAR (CAR_MODEL,CAR_SIZE,DESCRIPT,STAFF_ID) VALUES ('WZ-01 ', '1445 ', 'Middle Sportics ', 'B004 ')
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('AA001', 35670)
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('AB001', 30010)
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('BA001', 22500)
INSERT INTO trip_ora.DEPART (DEP_ID,SALARY) VALUES ('BB001', 21500)
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('A001', 'Alexender ', '7-1-1962')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('A002', 'April ', '5-24-1975')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B001', 'Bobby ', '12-5-1987')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B002', 'Bladder ', '1-3-1980')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B003', 'Brent ', '12-15-1979')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('B004', 'Brelendar ', '8-18-1963')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('C001', 'Calvin ', '4-3-1977')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('C002', 'Cheven ', '2-2-1974')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('C003', 'Clevarance ', '12-6-1987')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D001', 'Dave ', '8-17-1964')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D002', 'Davis ', '3-19-1988')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D003', 'Denny ', '8-5-1985')
INSERT INTO trip_ora.PEOPLE (STAFF_ID,NAME,DOB) VALUES ('D004', 'Denny ', '2-21-1998')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AA001', 'Class A Manager ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AB001', 'Class A Manager ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AC001', 'Class A Manager ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AD001', 'Class A Assistnat ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AE001', 'Class A Assistnat ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AF001', 'Class A Assistnat ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AG001', 'Class A Clark ')
INSERT INTO trip_ora.REG_A (DEP_ID,DESCRIP) VALUES ('AH001', 'Class A Assistant ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BA001', 'Class B Manager ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BB001', 'Class B Manager ')
INSERT2008/2/19
INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BC001', 'Class B Manager ') 17
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BD001', 'Class B Assistant ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BE001', 'Class B Assistant ')
Data insertion for target relational database Trip (continue)
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BF001', 'Class B Assistant ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BG001', 'Class B Clark ')
INSERT INTO trip_ora.REG_B (DEP_ID,DESCRIPT) VALUES ('BH001', 'Class B Assistant ')
INSERT INTO trip_ora.TRIP (TRIP_ID) VALUES ('T0001')

INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Electronic Engineer ', 'A001')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Senior Engineer ', 'B003')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Electronic Engineer ', 'B004')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Junior Engineer ', 'C002')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('System Engineer ', 'D001')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Material Engineer ', 'D002')
INSERT INTO trip_ora.ENGINEER (TITLE,STAFF_ID) VALUES ('Parts Engineer ', 'D003')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Sales Manager ', 'A002')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Marketing Manager ', 'B001')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Sales Manager ', 'B002')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('General Manager ', 'C001')
INSERT INTO trip_ora.MANAGER (TITLE,STAFF_ID) VALUES ('Marketing Manager ', 'C003')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('MZ-18 ', 'A002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('MZ-18 ', 'B001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('MZ-18 ', 'D003')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('R-023 ', 'B004')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('R-023 ', 'C001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('R-023 ', 'D001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('SA-38 ', 'A001')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('SA-38 ', 'A002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('SA-38 ', 'D002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('WZ-01 ', 'B002')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('WZ-01 ', 'B003')
INSERT INTO trip_ora.ASSIGN (CAR_MODEL,STAFF_ID) VALUES ('WZ-01 ', 'C002')
2008/2/19 18
Data insertion for target relational database Trip (continue)
UPDATE trip_ora52.TRIP
SET DEPT_ID= 'AA001'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'A002'
, STAFF_ID= 'A002'
WHERE TRIP_ID='T0001‘
SET DEPT_ID= 'AA001'
, STAFF_ID= 'B001'
, STAFF_ID= 'B001'
WHERE TRIP_ID='T0002'
SET DEPT_ID= 'AB001'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'A002'
, STAFF_ID= 'A002'
SET DEPT_ID= 'BB001'
, STAFF_ID= 'D002'
2008/2/19 19
, STAFF_ID= 'D002'
Data Integration
Step 1: Merge by Union
Relation Ra
A1 A2
Relation Rx
a11 a21
A1 A2 A3
a12 a22
a11 a21 null
Relation Rb ==> a12 a22 null
A1 A3 a13 null a31
a13 a31 a14 null a32
a14 a32
2008/2/19 20
Step 2: Merge classes by
generalization
Relation Ra Relation Rx1
A1 A2 A1 A2 Relation Rx
a11 a21 a11 a21
A1 A2 A3
a12 a22 a12 a22
a11 a21 null
==> a12 a22
Relation Rb Relation Rx2 null
A1 A3 A1 A3 a13 null a31
a13 a31 a13 a31 a14 null a32
a14 a32 a14 a32
2008/2/19 21
Step 3: Merge classes by
inheritance
Relation Rb Relation Rxb
A1 A2 A1 A2 A3
a11 a21 a11 a21 a31
a12 a22 a12 a22 null
==>
Relation Ra Relation Rxa
A1 A3 A1 A3
a11 a31 a11 a31
2008/2/19 22
Step 4 Merge classes by
aggregation
Relation Rx
Relation Ra Relation Rx’
Relation Rx1
A1 A2 A1 A3 A5 A1 A2 A1 A3 *A5
a11 a21 a11 a31 a11 a21
a51 a11 a31 a51
a12 a22 a12 a32 a12 a22
a51 a12 a32 a51
Relation Rb Relation Ry ==>

Relation Rx2 Relation Ry
A3 A4 A5 A6
A3 A4 A5 A6
a31 a41 a51 a61
a31 a41 a51 a61
a32 a42 a52 a62
a32 a42 a52 a62
2008/2/19 23
Step 5 Merge classes by
categorization
Relation Ra Relation Ra
A1 A2 A1 A2
Relation Rx
a11 a21 a11 a21
A1 A4
a12 a22 a12 a22
a11 a41
Relation Rb ==> a12 a42

Relation Rb
a13 a43
A1 A3 A1 A3
a13 a31 a14 a44 a13 a31
a14 a32 a14 a32
2008/2/19 24
Step 6 Merge classes by implied
relationship
Relation Ra Relation Rb Relation Xa Relation Xb

A1 A2 A3 A1 A1 A2 A3 *A1
a11 a21 a31 a11 ==> a11 a21 a31 a11
a12 a22 a32 a12 a12 a22 a32 a12
2008/2/19 25
Step 7 Merge relationship by
subtype
Relation Ra Relation Rb Relation Xa Relation Xc
A1 A2 A3 *A1 A1 A2 *A3 *A1
a11 a21 a31 a11 a11 a21 a31 a11
a12 a22 a32 a12 a12 a22 a32 a12
Relation Ra' Relation Rb' ==> Relation Xb

A1 A2 A3 A1 A3 A1
a11 a21 a31 a11 a31 a11
a12 a22 a33 null a32 a12
a33 null
2008/2/19 26
Data conversion from relational into XML
Step 1: Step 2: XML DTD &

Relational Reverse EER Model Schema DTD Graph
Engineering Translation
Schema
Step 3:
Data XML Document
Relational
Conversion
Databse
Architecture of Re-engineering Relational Database into XML Document

2008/2/19 27
Methodology of converting RDB into XML
As the result of the schema translation, we

translate an EER model into different views of
XML schemas based on their selected root
elements. For each translated XML schema, we
can read its corresponding source relation
sequentially by embedded SQL starting a parent
relation. The tuple can then be loaded into XML
document according to the mapped XML DTD.
Then we read the corresponding child relation
tuple(s), and load them into XML document.
2008/2/19 28
Algorithm of transforming RDB into XML
begin
while not end of element do
read an element from the translated target DTD;
read the tuple of a corresponding relation of the element from the source relational database;
load this tuple into a target XML document;
read the child elements of the element according to the DTD;
while not at end of the corresponding child relation in the source relational database do
read the tuple from the child relation such that the child's corresponding to the processed
parent relation's tuple;
load the tuple to the target XML document;
end loop // end inner loop
end loop // end outer loop
end
2008/2/19 29
Step 1: Reverse Engineering Relational
Schema into an EER Model
By use of classification tables to define the

relationship between keys and attributes in
all relations, we can recover their data
semantics in the form of an EER model.
2008/2/19 30
Step 2: Data Conversion from Relational into
XML document
We can map the data semantics in the EER

model into DTD-graph according to their
data dependencies constraints. These
constraints can then be transformed into
DTD as XML schema.
2008/2/19 31
Step 2.1 Defining a Root Element
To select a root element, we must put its relevant
information into an XML schema. Relevance
concerns with the entities that are related to a
selected entity by the user. The relevant classes
include the selected entity and all its relevant entities
that are navigable.
In an EER mode, we can navigate from entity to

another entity in correspondence to XML hierarchical
containment tree model.
2008/2/19 32
Root
EntityA
1 1
R1 R4 1 Element E
n n SelectedEntity
EntityB EntityE
1 n * n *
1 1 Element A Element F Element H
1 1 1
R2 R3 R5 R7 Mapping
1
n * n *
Element B Element G
n n n n
1
EntityC EntityD EntityF EntityH
n
* n *
1 Element C Element D
R6
n Relevant
EntityG Entities MappedXMLView
EERModel
2008/2/19 33
Step 2.1 Mapping Cardinality
from RDB to XML
In DTD, we translate one-to-one cardinality into

parent and child element and one-to-many
cardinality into parent and child element with
multiple occurrences. In many-to-many
cardinality, it is mapped into DTD of a hierarchy
structure with ID and IDREF.
2008/2/19 34
One-to-one cardinality
EER Model DTD Graph
A1 A1
Entity A Element A
A2 A2
1
1
Schema
B1 B1
Entity B Element B
B2 Translation B2
Relational Schema DTD
Relation A(A1, A2) <!ELEMENT A(B)>

Relation B(B1, B2, *A1) <!ATTLIST A A1 CDATA #REQUIRED>
<!ATTLIST A A2 CDATA #REQUIRED>
<!ELEMENT B EMPTY>
2008/2/19 <!ATTLIST B B1 CDATA #REQUIRED>
35
<!ATTLIST B B2 CDATA #REQUIRED>
One-to-one cardinality
Relation A XML Document

A1 A2
a11 a21 <A A1="a11" A2="a21">
a12 a22 Data 
</A>
Relation B Conversion
B1 B2 *A1 <A A1="a12" A2="a22">
b11 b21 a11 
b12 b22 a12 </A>
2008/2/19 36
One-to-many cardinality
EER Model DTD Graph
A1 A1
Entity A Element A
A2 A2
1
R *
n Schema
B1 Translation B1
Entity B Element B
B2 B2
Relational Schema DTD
Relation A(A1, A2) <!ELEMENT A(B)*>

Relation B(B1, B2, *A1) <!ATTLIST A A1 CDATA #REQUIRED>
<!ELEMENT B EMPTY>
2008/2/19 <!ATTLIST B B2 CDATA #REQUIRED>
37
One-to-many cardinality
Relation A XML Document

A1 A2
a11 a21
<A A1="a11" A2="a21">
a12 a22 Data 
</A>
Relation B Conversion
B1 B2 *A1 <A A1="a12" A2="a22">
b11 b21 a11 
b12 b22 a12 
b13 b23 a12 </A>
2008/2/19 38
Many-to-many cardinality
EER Model DTD Graph
A1
Entity A
A2
R A1 B1
n A2 Element A Element R Element B B2
A_id B_id
B1
Entity B
B2 Schema A_idref B_idref
Translation DTD
<!ELEMENT A EMPTY>
Relational Schema <!ATTLIST A A2 CDATA #REQUIRED>
<!ATTLIST A A_id ID #REQUIRED>
Relation A(A1, A2) <!ELEMENT R EMPTY>
Relation B(B1, B2) <!ATTLIST R A_idref IDREF #REQUIRED>
Relation R(*A1, *B1) <!ATTLIST R B_idref IDREF #REQUIRED>
<!ELEMENT B EMPTY>
2008/2/19 <!ATTLIST B B2 CDATA #REQUIRED> 39
<!ATTLIST B B_id ID #REQUIRED>
Many-to-many cardinality
Relation A
A1 A2 XML Document
a11 a21
<A A1="a11" A2="a21" A_id="1"></A>
a12 a22

<R A_idref="1" B_idref="2"></R>
Relation B
B1 B2 Data
<A A1="a12" A2="a22" A_id="3"></A>
b11 b21 Conversion 
b12 b22 <R A_id="3" B_idref="4"></R>
Relation R <R A_id=”1" B_idref=”4"></R>

*A1 *B1
a11 b11
a12 b12
a11 b12
2008/2/19 40
Case Study
Consider a case study of a Hospital Database
System. In this system, a patient can have
many record folders. Each record folder can
contain many different medical records of the
patient. A country has many patients. Once a
record folder is borrowed, a loan history is
created to record the details about it.
2008/2/19 41
Hospital Relational Schema
Relation Patient (HK_ID, Patient_Name)
Relation Record_Folder (Folder_No, Location, *HKID)
Relation Medical_Record (Medical_Rec_No,
Create_Date, Sub_Type, *Folder_No)
Relation Borrower (*Borrower_No, Borrower_Name)
Relation Borrow (*Borrower_No, *Folder_No)
Where underlined are primary keys, prefixed with “*”

are foreign keys
2008/3/1 42
Patient table
Relational
HK_ID
Data
Patient name
E3766849 Smith
Record_Folder table Folder no Location *HKID

F_21 Hong Kong E3766849
F_24 New Territories E3766849
Borrower table Folder no Borrower no

F_21 B1
F_21 B11
F_21 B21
F_21 B22
F_24 B22
2008/3/1 43
Borrower_Name Table Borrower_no Borrower_name
B1 Johnson
B11 Choy
B21 Fung
B22 Lok
Medical_Record Table Medical_Rec_no Create_Date Sub_type Folder_no

M_311999 Jan-01-1999 W F_21
M_322000 Nov-12-1998 W F_21
M_352001 Jan-15-2001 A F_21
M_362001 Feb-01-2001 A F_21
M_333333 Mar-03-2001 A F_24
2008/2/19 44
Step 1 Reverse engineer relational database into an EER model
Patient
belong
n
Medical n 1 Record
contain Borrower
Folder Folder
1 1
by has
n
m
Borrow
2008/2/19 45
Step 2 Translate EER model into
DTD Graph and DTD
In this case study, suppose we concern the
patient medical records, so the entity Patient is
selected. Then we define a meaningful name
for the root element, called Patient_Records.
We start from the entity Patient in the EER
model and then find the relevant entities for it.
The relevant entities include the related entities
that are navigable from the parent entity.
2008/2/19 46
Translated Document Type Definition Graph
Patient
Records
Patient
Record
Folder
* *
Medical
Borrow
Folder
Borrower
2008/2/19 47
Translated Document Type Definition
<!ELEMENT Patient_Records (Patient+)>
<!ELEMENT Patient (Record_Folder*)>

<!ATTLIST Patient HKID CDATA #REQUIRED>
<!ATTLIST Patient Patient_Name CDATA #REQUIRED>
<!ELEMENT Record_Folder (Borrow*, Medical_Record*)>

<!ATTLIST Record_Folder Folder_No CDATA #REQUIRED>
<!ATTLIST Record_Folder Location CDATA #REQUIRED>
<!ELEMENT Borrow (Borrower)>

<!ATTLIST Borrow Borrower_No CDATA #REQUIRED>
<!ELEMENT Medical_Record EMPTY>

<!ATTLIST Medical_Record Medical_Rec_No CDATA #REQUIRED>
<!ATTLIST Medical_Record Create_Date CDATA #REQUIRED>
<!ATTLIST Medical_Record Sub_Type CDATA #REQUIRED>
<!ELEMENT Borrower EMPTY>

<!ATTLIST Borrower Borrower_name CDATA #REQUIRED>
2008/3/1 48
Transformed XML document
Patient_Records>
<Patient Country_No="C0001" HKID="E3766849" Patient_Name="Smith">
<Record_Folder Folder_No="F_21" Location="Hong Kong">
<Borrow Borrower_No="B1">
<Borrower Borrower_name=“Johnson" />
</Borrow>
<Borrower Borrower_name=“Choy" />
</Borrow>
<Borrower Borrower_name=“Fung" />
</Borrow>
<Borrower Borrower_name=“Lok" />
</Borrow>
<Medical_Record Medical_Rec_No="M_311999" Create_Date="Jan-1-1999“, Sub_Type="W"></Medical_Record>
<Medical_Record Medical_Rec_No="M_322000" Create_Date="Nov-12-1998“, Sub_Type="W"></Medical_Record>
<Medical_Record Medical_Rec_No="M_352001" Create_Date="Jan-15-2001“, Sub_Type="A"></Medical_Record>
<Medical_Record Medical_Rec_No="M_362001" Create_Date="Feb-01-2001“, Sub_Type="A"></Medical_Record>
</Record_Folder>
<Record_Folder Folder_No="F_24" Location="New Territories">
<Borrower Borrower_name=“Lok" />
</Borrow>
<Medical_Record Medical_Rec_No="M_333333" Create_Date="Mar-03-01“, Sub_Type="A"></Medical_Record>
</Record_Folder>
</Patient>
2008/2/19 49
Reading Assignment
Chapter 4 Data Conversion of “Information
Systems Reengineering and Integration”
by Joseph Fong, Springer Verlag, pp.160-
198.
2008/2/19 50
Lecture review question 6
How do you compare the pros and cons of
using “Logical Level Translation Approach”
with “Customized Program Approach” in
data conversion?
2008/2/19 51
CS5483 Tutorial Question 6
Convert the following relational database into an XML document:
Relation Car_rental
Car_model Staff_ID *Trip_ID
MZ-18 A002 T0001
MZ-18 B001 T0002
R-023 B004 T0001
R-023 C001 T0004
SA-38 A001 T0003
SA-38 A002 T0001
Relation Trip
Trip_ID *Department_ID
T0001 AA001
T0002 AA001
T0003 AB001
T0004 BA001
Relation Department
Department_ID Salary
AA001 35670
AB001 30010
BA001 22500
2008/3/1 52
Fall 2004, CIS, Temple University
CIS527: Data Warehousing, Filtering, and

Mining
Lecture 2
◼ Data Warehousing and OLAP Technology for Data Mining
Lecture slides taken/modified from:

◼ Jiawei Han (http://www-sal.cs.uiuc.edu/~hanj/DM_Book.html)
1
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
◼ What is a data warehouse?
◼ A multi-dimensional data model
◼ Data warehouse architecture
◼ Data warehouse implementation
◼ Further development of data cube technology
◼ From data warehousing to data mining
2
What is Data Warehouse?
◼ Defined in many different ways, but not rigorously.

◼ A decision support database that is maintained
separately from the organization’s operational
database
◼ Support information processing by providing a solid
platform of consolidated, historical data for analysis.
◼ “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile (k mất đi) collection of data
in support of management’s decision-making process.”—
W. H. Inmon
◼ Data warehousing:
◼ The process of constructing and using data
warehouses
3
Data Warehouse—Subject-Oriented
◼ Provide a simple and concise (súc tích) view around

particular subject issues by excluding data that are not
useful in the decision support process.
◼ Organized around major subjects, such as customer,
product, sales.
◼ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
4
Data Warehouse—Integrated
◼ Constructed by integrating multiple, heterogeneous
data sources
◼ relational databases, flat files, on-line transaction
records
◼ Data cleaning and data integration techniques are
applied.
◼ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc.
◼ When data is moved to the warehouse, it is
converted.
5
Data Warehouse—Time Variant
◼ The time horizon for the data warehouse is significantly

longer than that of operational systems.
◼ Operational database: current value data.
◼ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
◼ Every key structure in the data warehouse
◼ Contains an element of time, explicitly or implicitly
◼ However, the key of operational data may or may not
contain “time element”.
6
Data Warehouse—Non-Volatile
◼ A physically separate store of data transformed from the

operational environment.
◼ Operational update of data does not necessarily occur in
the data warehouse environment.
◼ Does not require transaction processing, recovery,
and concurrency control mechanisms
◼ Often requires only two operations in data accessing:
◼ initial loading of data and access of data.
7
Data Warehouse vs. Heterogeneous DBMS
◼ Traditional heterogeneous DB integration:

◼ Build wrappers/mediators on top of heterogeneous databases
◼ Query driven approach
◼ A query posed to a client site is translated into queries
appropriate for individual heterogeneous sites; The results are
integrated into a global answer set
◼ Involving complex information filtering
◼ Competition for resources at local sources
◼ Data warehouse: update-driven, high performance
◼ Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
8
Data Warehouse vs. Operational DBMS
◼ OLTP (on-line transaction processing)
◼ Major task of traditional relational DBMS
◼ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
◼ OLAP (on-line analytical processing)
◼ Major task of data warehouse system
◼ Data analysis and decision making
◼ Distinct features (OLTP vs. OLAP):
◼ User and system orientation: customer vs. market
◼ Data contents: current, detailed vs. historical, consolidated
◼ Database design: ER + application vs. star + subject
◼ View: current, local vs. evolutionary, integrated
◼ Access patterns: update vs. read-only but complex queries
9
Why Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
◼ Different functions and different data:
◼ Decision support requires historical data which
operational DBs do not typically maintain
◼ Decision Support requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ Different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
10
11
A Multi-Dimensional Data Model
◼ A data warehouse is based on a multidimensional data model which

views data in the form of a data cube
◼ A data cube allows data to be modeled and viewed in multiple
dimensions
◼ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
◼ Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
◼ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
12
A Sample Data Cube
Total annual sales
Time of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
13
4-D Data Cube
Supplier 1
Supplier 2
Supplier 3
14
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time item location supplier

1-D cuboids
time,item time,location item,location location,supplier

2-D cuboids
time,supplier item,supplier
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
15
Conceptual Modeling of Data Warehouses
◼ Modeling data warehouses: dimensions & measures

◼ Star schema: A fact table in the middle connected to a
set of dimension tables
◼ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
◼ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
16
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
17
Example of Snowflake Schema
time item
time_key item_key supplier
day Sales Fact Table item_name supplier_key
day_of_the_week brand supplier_type
time_key type
month
quarter item_key supplier_key
year
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province
Measures
country
18
Example of Fact Constellation
time
time_key Shipping Fact Table
day item
day_of_the_week Sales Fact Table item_key time_key
month item_name
quarter time_key brand item_key
year type shipper_key
item_key supplier_type
branch_key from_location
branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
19
A Data Mining Query Language,
DMQL: Language Primitives
◼ Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
◼ Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
◼ Special Case (Shared Dimension Tables)
◼ First time as “cube definition”
◼ define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
20
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
21
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:

define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street,
city(city_key, province_or_state, country))
22
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
23
Measures: Three Categories
Measure: a function evaluated on aggregated data
corresponding to given dimension-value pairs.
Measures can be:
◼ distributive: if the measure can be calculated in a
distributive manner.
◼ E.g., count(), sum(), min(), max().
◼ algebraic: if it can be computed from arguments obtained
by applying distributive aggregate functions.
◼ E.g., avg()=sum()/count(), min_N(), standard_deviation().
◼ holistic: if it is not algebraic.
◼ E.g., median(), mode(), rank().
24
Measures: Three Categories
◼ Distributive and algebraic

measures are ideal for data
cubes.
◼ Calculated measures at lower
levels can be used directly at
higher levels.
◼ Holistic measures can be
difficult to calculate efficiently.
◼ Holistic measures could often
be efficiently approximated.
25
Browsing a Data Cube
◼ Visualization
◼ OLAP capabilities
◼ Interactive manipulation
26
A Concept Hierarchy
• Concept hierarchies allow data to be handled
at varying levels of abstraction
Dimensions: Product, Location, Time

Hierarchical summarization paths
Industry Region Year

Product
Category Country Quarter
Product City Month Week
Office Day
Month
27
Typical OLAP Operations (Fig 2.10)
◼ Roll up (drill-up): summarize data
◼ by climbing up concept hierarchy or by dimension reduction
◼ Drill down (roll down): reverse of roll-up
◼ from higher level summary to lower level summary or detailed
data, or introducing new dimensions
◼ Slice and dice:
◼ project and select
◼ Pivot (rotate):
◼ reorient the cube, visualization, 3D to series of 2D planes.
◼ Other operations
◼ drill across: involving (across) more than one fact table
◼ drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
28
Querying Using a Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
Each circle is
ORDER called a footprint
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion Organization
29
30
Data Warehouse Design Process
◼ Top-down, bottom-up approaches or a combination of both

◼ Top-down: Starts with overall design and planning (mature)
◼ Bottom-up: Starts with experiments and prototypes (rapid)
◼ From software engineering point of view

◼ Waterfall: structured and systematic analysis at each step before
proceeding to the next
◼ Spiral: rapid generation of increasingly functional systems, quick
modifications, timely adaptation of new designs and technologies
◼ Typical data warehouse design process
◼ Choose a business process to model, e.g., orders, invoices, etc.
◼ Choose the grain (atomic level of data) of the business process
◼ Choose the dimensions that will apply to each fact table record
◼ Choose the measure that will populate each fact table record
31
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

32
Three Data Warehouse Models
◼ Enterprise warehouse
◼ collects all of the information about subjects spanning
the entire organization

◼ Data Mart
◼ a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to

specific, selected groups, such as marketing data mart
◼ Independent vs. dependent (directly from warehouse) data mart
◼ Virtual warehouse
◼ A set of views over operational databases
◼ Only some of the possible summary views may be
materialized
33
OLAP Server Architectures
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and
manage warehouse data

◼ Include optimization of DBMS backend and additional tools
and services
◼ greater scalability
◼ Multidimensional OLAP (MOLAP)

◼ Array-based multidimensional storage engine (sparse
matrix techniques)
◼ fast indexing to pre-computed summarized data
◼ Hybrid OLAP (HOLAP)

◼ User flexibility (low level: relational, high-level: array)
◼ Specialized SQL servers

◼ specialized support for SQL queries over star/snowflake
schemas
34
35
Efficient Data Cube Computation
◼ Data cube can be viewed as a lattice of cuboids

◼ The bottom-most cuboid is the base cuboid
◼ The top-most cuboid (apex) contains only one cell
◼ How many cuboids in an n-dimensional cube with L
levels? n
T =  ( Li +1)
i =1
◼ Materialization of data cube

◼ Materialize every (cuboid) (full materialization), none
(no materialization), or some (partial materialization)
◼ Selection of which cuboids to materialize
◼ Based on size, sharing, access frequency, etc.
36
Cube Operation
◼ Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
◼ Transform it into a SQL-like language (with a new operator cube by,
introduced by Gray et al.’96) ()
SELECT item, city, year, SUM (amount)
FROM SALES (city) (item) (year)
CUBE BY item, city, year
◼ Need compute the following Group-Bys
(date, product, customer), (city, item) (city, year) (item, year)
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
() SELECT item, city, year, SUM (city, item, year)
(amount)
FROM SALES
GROUP BY item, year 37
Cube Computation: ROLAP vs. MOLAP
◼ ROLAP-based cubing algorithms

◼ Key-based addressing
◼ Sorting, hashing, and grouping operations are applied to the
dimension attributes to reorder and cluster related tuples
◼ Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
◼ MOLAP-based cubing algorithms
◼ Direct array addressing
◼ Partition the array into chunks that fit the memory
◼ Compute aggregates by visiting cube chunks
◼ Possible to exploit ordering of chunks for faster calculation
38
Multiway Array Aggregation for MOLAP
◼ Partition arrays into chunks (a small subcube which fits in memory).
◼ Compressed sparse array addressing: (chunk_id, offset)
◼ Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 39
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
1 2 3 4
b0
After scan {1,2,3,4}:
a0 a1 a2 a3
A • b0c0 chunk is computed
• a0c0 and a0b0 are not
computed
40
We need to keep a We need to keep 4
single b-c chunk in a-c chunks in
memory memory
C c3 61 62 63 64
c2 45 46 47 48 After scan 1-13:
c1 29 30 31 32
c0
B13
• a0c0 and b0c0
14 15 16 60
b3 44 chunks are
B b2 9
28 56 computed
40
24
b1 5
36
52 • a0b0 is not
20 computed (we will
b0 1 2 3 4
need to scan 1-49)
a0 a1 a2 a3
A
We need to keep 16
a-b chunks in
memory
41
◼ Method: the planes should be sorted and computed

according to their size in ascending order.
◼ The proposed scan is optimal if |C|>|B|>|A|
◼ See the details of Example 2.12 (pp. 75-78)
◼ MOLAP cube computation is faster than ROLAP

◼ Limitation of MOLAP: computing well only for a small
number of dimensions
◼ If there are a large number of dimensions use the
iceberg cube computation: process only “dense” chunks
42
Indexing OLAP Data: Bitmap Index
◼ Suitable for low cardinality domains
◼ Index on a particular column
◼ Each value in the column has a bit vector: bit-op is fast
◼ The length of the bit vector: # of records in the base table
◼ The i-th bit is set if the i-th row of the base table has the value
for the indexed column
Base table Index on Region Index on Type

Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1
43
Indexing OLAP Data: Join Indices
◼ Join index materializes relational join and speeds
up relational join — a rather costly operation
◼ In data warehouses, join index relates the values
of the dimensions of a start schema to rows in
the fact table.
◼ E.g. fact table: Sales and two dimensions
location and item
◼ A join index on location is a list of pairs
<loc_name,T_id> sorted by location

◼ A join index on location-and-item is a list
of triples <loc_name,item_name, T_id>

sorted by location and item names
◼ Search of a join index can still be slow
◼ Bitmapped join index allows speed-up by using
bit vectors instead of dimension attribute names
44
Online Aggregation
◼ Consider an aggregate query:

“finding the average sales by state“
◼ Can we provide the user with some information before
the exact average is computed for all states?
◼ Solution: show the current “running average” for each state as
the computation proceeds.
◼ Even better, if we use statistical techniques and sample tuples
to aggregate instead of simply scanning the aggregated table,
we can provide bounds such as “the average for Wisconsin is
2000±102 with 95% probability.
45
Efficient Processing of OLAP Queries
◼ Determine which operations should be performed on the

available cuboids:
◼ transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection
◼ Determine to which materialized cuboid(s) the relevant
operations should be applied.
◼ Exploring indexing structures and compressed vs. dense
array structures in MOLAP (trade-off between indexing and
storage performance)
46
Metadata Repository
◼ Meta data is the data defining warehouse objects. It has the following
kinds
◼ Description of the structure of the warehouse
◼ schema, view, dimensions, hierarchies, derived data definitions, data

mart locations and contents
◼ Operational meta-data
◼ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring information
(warehouse usage statistics, error reports, audit trails)
◼ The algorithms used for summarization
◼ The mapping from operational environment to the data warehouse
◼ Data related to system performance
◼ warehouse schema, view and derived data definitions
◼ Business data
◼ business terms and definitions, ownership of data, charging policies
47
Data Warehouse Back-End Tools and
Utilities
◼ Data extraction:
◼ get data from multiple, heterogeneous, and external
sources
◼ Data cleaning:
◼ detect errors in the data and rectify them when
possible
◼ Data transformation:
◼ convert data from legacy or host format to warehouse
format
◼ Load:
◼ sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
◼ Refresh
◼ propagate the updates from the data sources to the
warehouse
48
49
Discovery-Driven Exploration of Data
Cubes
◼ Hypothesis-driven: exploration by user, huge search space
◼ Discovery-driven (Sarawagi et al.’98)
◼ pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation
◼ Exception: significantly different from the value anticipated,
based on a statistical model
◼ Visual cues such as background color are used to reflect the
degree of exception of each cell
◼ Computation of exception indicator can be overlapped with cube
construction
50
Examples: Discovery-Driven Data Cubes
51
52
Data Warehouse Usage
◼ Three kinds of data warehouse applications
◼ Information processing
◼ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
◼ Analytical processing
◼ multidimensional analysis of data warehouse data
◼ supports basic OLAP operations, slice-dice, drilling, pivoting
◼ Data mining
◼ knowledge discovery from hidden patterns
◼ supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
◼ Differences among the three tasks
53
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
◼ Why online analytical mining?
◼ High quality of data in data warehouses
◼ DW contains integrated, consistent, cleaned data
◼ Available information processing structure surrounding data

warehouses
◼ ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools

◼ OLAP-based exploratory data analysis
◼ mining with drilling, dicing, pivoting, etc.
◼ On-line selection of data mining functions

◼ integration and swapping of multiple mining functions,
algorithms, and tasks.

◼ Architecture of OLAM
54
Summary
◼ Data warehouse
◼ A subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process
◼ A multi-dimensional model of a data warehouse
◼ Star schema, snowflake schema, fact constellations
◼ A data cube consists of dimensions & measures
◼ OLAP operations: drilling, rolling, slicing, dicing and pivoting
◼ OLAP servers: ROLAP, MOLAP, HOLAP
◼ Efficient computation of data cubes
◼ Partial vs. full vs. no materialization
◼ Multiway array aggregation
◼ Bitmap index and join index implementations
◼ Discovery-drive and multi-feature cubes
◼ From OLAP to OLAM (on-line analytical mining)
55

Mining
Lecture 3
• Data Warehousing and OLAP Technology for Data Mining

1
What is Data?
• Collection of data objects and
their attributes Attributes
• An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
– Examples: eye color of a
person, temperature, etc. 1 Yes Single 125K No
– Attribute is also known as 2 No Married 100K No
variable, field, characteristic, 3 No Single 70K No
or feature
4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
describe an object Objects
6 No Married 60K No
– Object is also known as
7 Yes Divorced 220K No
record, point, case, sample,
entity, or instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute
• Distinction between attributes and attribute values

– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of

values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
3
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
4
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */
– Nominal attribute: distinctness

– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
5
Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,

provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard

differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
6
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID numbers were

reassigned, would it make any
difference?
Ordinal An order preserving change of values, i.e., An attribute encompassing the

new_value = f(old_value) notion of good, better best can be
where f is a monotonic function. represented equally well by the
values {1, 2, 3} or by { 0.5, 1, 10}.
Interval new_value =a * old_value + b where a and Thus, the Fahrenheit and Celsius
b are constants temperature scales differ in terms
of where their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in meters

or feet.
7
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
8
Types of Data Sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Multi-Relational
– Star or snowflake schema
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
9
Important Characteristics of Structured
Data
– Dimensionality
• Number of attributes each object is described with
• Challenge: high dimensionality (curse of dimensionality)
– Sparsity
• Sparse data: values of most attributes are zero
• Challenge: sparse data call for special handling
– Resolution
• Data properties often could be measured with different
resolutions
• Challenge: decide on the most appropriate resolution (e.g.
“Can’t See the Forest for the Trees”)
10
Record Data
• Data that consists of a collection of records,
each of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
5 No Divorced 95K Yes
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
11
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there

are m rows, one for each object, and n columns, one for each
attribute
Projection Projection Distance Load Thickness

of x Load of y load
10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
12
Document Data
• Each document becomes a ‘term’ vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
13
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– E.g., consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products
that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk 14
Multi-Relational Data
• Attributes are objects themselves
15
Graph Data
• Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
16
Chemical Data
• Benzene Molecule: C6H6
17
Ordered Data
• Sequences of transactions
Items/Events
An element of
18
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
19
Ordered Data
• Spatial-Temporal Data
Average Monthly
Temperature of
land and ocean
20
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:

– Noise and outliers
– missing values
– duplicate data
21
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise

22
Outliers
• Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set
23
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
24
Duplicate Data
• Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
25
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
26
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
27
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.
• Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.
• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.
28
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the

same property (of interest) as the original set of data
29
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement

– As each item is selected, it is removed from the population
• Sampling with replacement

– Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked up
more than once
• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition 30
Sample Size
8000 points 2000 Points 500 Points
31
Sample Size
• What sample size is necessary to get at least one
object from each of 10 groups.
32
Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
• Definitions of density and

distance between points,
which is critical for
clustering and outlier
detection, become less • Randomly generate 500 points
meaningful • Compute difference between max and
min distance between any pair of
points
33
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
34
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
35
Dimensionality Reduction: PCA
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space
x2
x1 36
Dimensionality Reduction: ISOMAP
By: Tenenbaum, de Silva,
Langford (2000)
• Construct a neighbourhood graph

• For each pair of points in the graph, compute the
shortest path distances – geodesic distances
37
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
• Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA 38
Feature Subset Selection
• Techniques:
– Brute-force approch:
• Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find best
subset of attributes
39
Feature Creation
• Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
• Three general methodologies:

– Feature Extraction
• domain-specific
– Mapping Data to New Space
– Feature Construction
• combining features
40
Example: Mapping Data to a New
Space
• Fourier transform
• Wavelet transform
Two Sine Waves Two Sine Waves + Noise Frequency

41
Discretization and Binarization
• Different data mining applications require
specific data formats
– Categorical only (discretization)
– Binary only (binarization)
– Interval/Ratio only (binarization)
• Discretization: transforming interval attribute into
categorical
• Binarization: transforming non-binary attribute
into a set of binary attributes
42
Discretization Using Class Labels
• Entropy based approach
3 categories for both x and y 5 categories for both x and y

43
Attribute Transformation
• A function that maps the entire set of values of a given
attribute to a new set of replacement values such that
each old value can be identified with one of the new
values
– Simple functions: xk, log(x), ex, |x|
– Standardization and Normalization
44

Mining
Lecture 4
• Tutorial: Connecting SQL Server to Matlab using Database Matlab

Toolbox
• Association Rule MIning

Motivation: Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Applications: Association Rule Mining
• *  Maintenance Agreement
– What the store should do to boost Maintenance
Agreement sales
• Home Electronics  *
– What other products should the store stocks up?
• Attached mailing in direct marketing
• Detecting “ping-ponging” of patients
• Marketing and Sales Promotion
• Supermarket shelf management
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items TID Items
• Support count () 1 Bread, Milk

– Frequency of occurrence of an itemset 2 Bread, Diaper, Beer, Eggs
– E.g. ({Milk, Bread,Diaper}) = 2 3 Milk, Diaper, Beer, Coke
• Support 4 Bread, Milk, Diaper, Beer
– Fraction of transactions that contain an 5 Bread, Milk, Diaper, Coke
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
• Association Rule TID Items
– An implication expression of the form 1 Bread, Milk

X → Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Rule Evaluation Metrics
– Support (s) Example:
• Fraction of transactions that contain {Milk , Diaper }  Beer
both X and Y
– Confidence (c)
 (Milk , Diaper, Beer) 2
• Measures how often items in Y s= = = 0.4
appear in transactions that |T| 5
contain X
 (Milk, Diaper, Beer) 2
c= = = 0.67
 (Milk , Diaper ) 3
Association Rule Mining Task
• Given a set of transactions T, the goal of

association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
 d   d − k 
R =       
d −1 d −k
 k   j 
k =1 j =1
= 3 − 2 +1
d d +1
If d=6, R = 602 rules

Mining Association Rules: Decoupling
TID Items Example of Rules:
1 Bread, Milk {Milk,Diaper} → {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still

computationally expensive
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
w
– Match each transaction against every candidate

– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases
– Use a subsample of N transactions
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Reducing Number of Candidates: Apriori
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
• Apriori principle holds due to the following property

of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets
Illustrating Apriori Principle
Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2
or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Apriori: Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of
each candidate itemset
– To reduce the number of comparisons, store the candidates in a
hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
Buckets
Apriori: Implementation Using Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node
(if number of candidate itemsets exceeds max leaf size, split the node)
Hash function 234

3,6,9 567
1,4,7
145 136
2,5,8 345 356 367
357 368
124 159 689
125
457 458
Apriori: Implementation Using Hash Tree
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
3+ 56
13+ 56
234
15+ 6
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Apriori: Alternative Search Methods
• Traversal of Itemset Lattice

– General-to-specific vs Specific-to-general
Frequent
itemset Frequent
border null null itemset null
border
.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional
Apriori: Alternative Search Methods
• Traversal of Itemset Lattice

– Breadth-first vs Depth-first
(a) Breadth first (b) Depth first

Bottlenecks of Apriori
• Candidate generation can result in huge

candidate sets:
– 104 frequent 1-itemset will generate 107 candidate 2-
itemsets
– To discover a frequent pattern of size 100, e.g., {a1,
a2, …, a100}, one needs to generate 2100 ~ 1030
candidates.
• Multiple scans of database:
– Needs (n +1 ) scans, n is the length of the longest
pattern
ECLAT: Another Method for Frequent Itemset
Generation
• ECLAT: for each item, store a list of transaction
ids (tids); vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
TID-list
ECLAT: Another Method for Frequent Itemset
Generation
• Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
A B AB
1 1 1
 →
4 2 5
5 5 7
6 7 8
7 8
8 10
9
• 3 traversal approaches:
– top-down, bottom-up and hybrid
• Advantage: very fast support counting
• Disadvantage: intermediate tid-lists may become too
large for memory
FP-growth: Another Method for Frequent
Itemset Generation
• Use a compressed representation of the

database using an FP-tree
• Once an FP-tree has been constructed, it uses a

recursive divide-and-conquer approach to mine
the frequent itemsets
FP-Tree Construction
null
After reading TID=1:
TID Items
1 {A,B} A:1
2 {B,C,D}
3 {A,C,D,E}
B:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} After reading TID=2:
null
7 {B,C}
8 {A,B,C} A:1 B:1
9 {A,B,D}
10 {B,C,E}
B:1 C:1
D:1
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
Header table D:1

C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
FP-growth
Build conditional pattern

null base for E:
P = {(A:1,C:1,D:1),
B:3 (A:1,D:1),
A:7
(B:1,C:1)}
Recursively apply FP-
B:5 C:3 growth on P
C:1 D:1
C:3
D:1
D:1 E:1
D:1 E:1
E:1
D:1
FP-growth
Conditional tree for E:
Conditional Pattern base

null for E:
P = {(A:1,C:1,D:1,E:1),
B:1 (A:1,D:1,E:1),
A:2
(B:1,C:1,E:1)}
Count for E is 3: {E} is
C:1 frequent itemset
C:1 D:1
growth on P
D:1 E:1
E:1
E:1
FP-growth
Conditional tree for D
within conditional tree
for E:
Conditional pattern base
null for D within conditional
base for E:
P = {(A:1,C:1,D:1),
A:2
(A:1,D:1)}
Count for D is 2: {D,E} is
C:1 D:1 frequent itemset
growth on P
D:1
FP-growth
Conditional tree for C
within D within E:
Conditional pattern base
null for C within D within E:
P = {(A:1,C:1)}
A:1 Count for C is 1: {C,D,E}
is NOT frequent itemset
C:1
FP-growth
Conditional tree for A
within D within E:
Count for A is 2: {A,D,E}
null is frequent itemset
Next step:
A:2
Construct conditional tree
C within conditional tree
E
Continue until exploring
conditional tree for A
(which has only node A)
Benefits of the FP-tree Structure
• Performance study shows
– FP-growth is an order of
magnitude faster than
Apriori, and is also faster 100
than tree-projection 90 D1 FP-grow th runtime

D1 Apriori runtime
80
• Reasoning 70
Run time(sec.)
60
– No candidate generation, 50
no candidate test 40
30
– Use compact data structure 20
10
– Eliminate repeated 0
0 0.5 1 1.5 2 2.5 3
database scan Support threshold(%)
– Basic operation is counting

and FP-tree building
Complexity of Association Mining
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and
I/O costs may also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Compact Representation of Frequent
Itemsets
• Some itemsets are redundant because they have
identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
10 
• Number of frequent itemsets = 3    
10
k
k =1
• Need a compact representation

Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
null
Maximal A B C D E
Itemsets
ABCD ABCE ABDE ACDE BCDE
Infrequent
Itemsets Border
ABCD
E
Closed Itemset
• Problem with maximal frequent itemsets:

– Support of their subsets is not known – additional DB scans are
needed
• An itemset is closed if none of its immediate supersets
has the same support as the itemset
Itemset Support
{A} 4
TID Items Itemset Support
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A,B} 4
4 {A,B,D} {B,C,D} 2
{A,C} 2 {A,B,C,D} 2
5 {A,B,C,D}
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Maximal vs Closed Frequent Itemsets
Minimum support = 2 null
Closed but
not maximal
124 123 1234 245 345 Closed and
A B C D E maximal
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
TID Items
# Closed = 9
1 ABC 2 4
ABCD ABCE ABDE ACDE BCDE # Maximal = 4
2 ABCD
3 BCE
ABCDE
4 ACDE
5 DE
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Rule Generation
• Given a frequent itemset L, find all non-empty

subsets f  L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
• If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L →  and  → L)
Rule Generation
• How to efficiently generate rules from frequent

itemsets?
– In general, confidence does not have an anti-
monotone property
c(ABC →D) can be larger or smaller than c(AB →D)
– But confidence of rules generated from the same

itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC → D)  c(AB → CD)  c(A → BCD)
• Confidence is anti-monotone w.r.t. number of items on the

RHS of the rule
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
Presentation of Association Rules (Table Form)
Visualization of Association Rule Using Plane Graph
Visualization of Association Rule Using Rule Graph

Mining
Lecture 6
• Clustering

1
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
3
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an
earth observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
4
What Is Good Clustering?
• A good clustering method will produce high quality

clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
5
Requirements of Clustering in Data
Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
6
Data Structures in Clustering
 x11 ... x1f ... x1p 

 
• Data matrix  ... ... ... ... ... 
x ... x if ... x ip 
– (two modes)  i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 0 
 d(2,1) 
• Dissimilarity matrix  0 
 d(3,1) d ( 3,2) 0 
– (one mode)  
 : : : 
d ( n,1) d ( n,2) ... ... 0
7
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
8
Interval-valued variables
• Standardize data
– Calculate the mean squared deviation:
sf = 1
n (| x1 f
− m f
|2 + | x − m |2 +...+ | x − m |2)
2f f nf f
where mf = 1n (x1 f + x2 f + ... + xnf )

.
– Calculate the standardized measurement (z-score)

xif − m f
zif = sf
• Using mean absolute deviation could be more robust
than using standard deviation
9
Similarity and Dissimilarity Between
Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
10
Similarity and Dissimilarity Between
Objects
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
11
Mahalanobis Distance
mahalanobi s( p, q) = ( p − q) −1 ( p − q)T

 is the covariance matrix of
the input data X
1 n
 j ,k =  ( X ij − X j )( X ik − X k )
n − 1 i =1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

12
Mahalanobis Distance
Covariance Matrix:
0.3 0.2
= 
 0.2 0.3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
13
Cosine Similarity
• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
14
Correlation Measure
Scatter plots
showing the
similarity from
–1 to 1.
15
Binary Variables
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p
• Simple matching coefficient (invariant, if the binary variable is
symmetric):
d (i, j) = b+c
a +b+c + d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j) = b+c
a +b+c 16
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2 17
Nominal Variables
• A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
– m: # of matches, p: total # of variables
d (i, j) = p −
p
m
• Method 2: use a large number of binary variables

– creating a new binary variable for each of the M nominal states
18
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
– replacing xif by their rank rif {1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
rif −1
zif =
M f −1
– compute the dissimilarity using methods for interval-scaled
variables
19
Ratio-Scaled Variables
• Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
• Methods:
– treat them like interval-scaled variables — not a good choice!
(why?)
– apply logarithmic transformation
yif = log(xif)
– treat them as continuous ordinal data treat their rank as interval-
scaled.
20
Variables of Mixed Types
• A database may contain all the six types of variables
– symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio.
• One may use a weighted formula to combine their
effects.
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
z =
rif − 1
• and treat zif as interval-scaled if M −1
f
21
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters
22
Other Distinctions Between Sets of
Clusters
• Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities
23
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function

24
Types of Clusters: Well-Separated
• Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
25
Types of Clusters: Center-Based
• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
26
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or

Transitive)
– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
27
Types of Clusters: Density-Based
• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
28
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters

– Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
29
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
30
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
31
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the
cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
32
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
Optimal Clustering Sub-optimal Clustering

33
• Importance of choosing initial centroids
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)

– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE =   dist 2 (mi , x )
i =1 xCi
– x is a data point in cluster Ci and mi is the representative point for

cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase K, the number of
clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
34
Solutions to Initial Centroids Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine
initial centroids
• Select more than k initial centroids and then select
among these initial centroids
– Select most widely separated
• Postprocessing
• Bisecting K-means
– Not as susceptible to initialization issues
35
Handling Empty Clusters
• Basic K-means algorithm can yield empty

clusters
• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.
36
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
• ISODATA 37
Bisecting K-means
• Bisecting K-means algorithm

– Variant of K-means that can produce a partitional or a
hierarchical clustering
38
Bisecting K-means Example
39
Limitations of K-means
• K-means has problems when clusters are of

differing
– Sizes
– Densities
– Non-globular shapes
• K-means has problems when the data contains

outliers.
40
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
41
Limitations of K-means: Differing Density
42
Limitations of K-means: Non-globular
Shapes
43
Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters.

Find parts of clusters, but need to put together.
44
Overcoming K-means Limitations
Original Points K-means Clusters
45
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
• Handling a mixture of categorical and numerical data: k-
prototype method
46
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters

• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale well
for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
– draws multiple samples of the data set, applies PAM on each
sample, and gives the best clustering as the output
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995) 47
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits
6 5
0.2
4
0.15 3 4
2
5
0.1
2
0.05
1
3 1
0
1 3 2 5 4 6
48
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of

clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
• They may correspond to meaningful taxonomies

– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
49
Hierarchical Clustering
• Two main types of hierarchical clustering

– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)
• Traditional hierarchical algorithms use a similarity or

distance matrix
– Merge or split one cluster at a time
50
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique

• Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of

two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
51
Starting Situation
• Start with clusters of individual points and a proximity
matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
52
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
53
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 54
p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
55
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 56
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
objective function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
objective function
p1 p2 p3 p4 p5 ...
p1
  p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
objective function
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
61
Hierarchical Clustering: Time and Space
requirements
• O(N2) space since it uses the proximity matrix.
– N is the number of points.
• O(N3) time in many cases

– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for
some approaches
62
Hierarchical Clustering: Problems and
Limitations
• Once a decision is made to combine two
clusters, it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or

more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex
shapes
– Breaking large clusters
63
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q)
such that one point (p) is in the current tree but the other (q) is
not
– Add q to the tree and put an edge between p and q
64
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
65
More on Hierarchical Clustering Methods
• Major weakness of agglomerative clustering methods

– do not scale well: time complexity of at least O(n2), where n is the
number of total objects
– can never undo what was done previously
• Integration of hierarchical with distance-based clustering
– BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
– CURE (1998): selects well-scattered points from the cluster and
then shrinks them towards the center of the cluster by a specified
fraction
– CHAMELEON (1999): hierarchical clustering using dynamic
modeling
66
One Alternative: BIRCH
• Birch: Balanced Iterative Reducing and Clustering using

Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
– Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
– Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
• Scales linearly: finds a good clustering with a single scan
and improves the quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the
order of the data record.
67
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion),

such as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
68
DBSCAN
• DBSCAN is a density-based algorithm.

• Definitions:
– Density = number of points within a specified radius (Eps)
– A point is a core point if it has more than a specified number

of points (MinPts) within Eps
• These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in

the neighborhood of a core point
– A noise point is any point that is not a core point or a border

point.
69
DBSCAN: Core, Border, and Noise Points
70
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
71
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,

border and noise
Eps = 10, MinPts = 4

72
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
73
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
74
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther
distance
• So, plot sorted distance of every point to its kth
nearest neighbor
75
Graph-Based Clustering
• Graph-Based clustering uses the proximity

graph
– Start with the proximity matrix
– Consider each point as a node in a graph
– Each edge between two nodes has a weight which is
the proximity between the two points
– Initially the proximity graph is fully connected
– MIN (single-link) and MAX (complete-link) can be
viewed as starting with this graph
• In the simplest case, clusters are connected
components in the graph.
76
Graph-Based Clustering: Sparsification
• Clustering may work better

– Sparsification techniques keep the connections to the most
similar (nearest) neighbors of a point while breaking the
connections to less similar points.
– The nearest neighbors of a point tend to belong to the same
class as the point itself.
– This reduces the impact of noise and outliers and sharpens
the distinction between clusters.
• Sparsification facilitates the use of graph

partitioning algorithms (or algorithms based
on graph partitioning algorithms.
– Chameleon and Hypergraph-based Clustering
77
Sparsification in the Clustering Process
78
Limitations of Current Merging
Schemes
(a)
(b)
(c)
(d)
Closeness schemes Average connectivity schemes

will merge (a) and (b) will merge (c) and (d)
79
Model-Based Clustering Methods
• Attempt to optimize the fit between the data and some

mathematical model
• Statistical and AI approach
– Conceptual clustering
• A form of clustering in machine learning
• Produces a classification scheme for a set of unlabeled objects
• Finds characteristic description for each concept (class)
– COBWEB (Fisher’87)
• A popular a simple method of incremental conceptual learning
• Creates a hierarchical clustering in the form of a classification tree
• Each node refers to a concept and contains a probabilistic description
of that concept
80
Cluster Validity
• For supervised classification we have a variety of
measures to evaluate how good our model is
– Accuracy, precision, recall
• For cluster analysis, the analogous question is how to

evaluate the “goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?

– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
81
Clusters found in Random Data
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Random 0.6 0.6 DBSCAN

Points 0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means Complete
0.7 0.7
Link
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
82
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g., SSE or
entropy
• Sometimes these are referred to as criteria instead of indices
– However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
83
Internal Measures: Cohesion and
Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS =   ( x − mi )2
i xC i
– Separation is measured by the between cluster sum of squares
BSS =  Ci (m − mi )2
i
• Where |Ci| is the size of cluster i
84
External Measures of Cluster Validity:
Entropy and Purity
85
Final Comment on Cluster Validity
“The validation of clustering structures is the most

difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”
Algorithms for Clustering Data, Jain and Dubes
86
What Is Outlier Discovery?
• What are outliers?

– The set of objects are considerably dissimilar from the
remainder of the data
– Example: Sports: Michael Jordon, Wayne Gretzky, ...
• Problem
– Find top n outlier points
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
87
Outlier Discovery:
Statistical Approach
Assume a model underlying distribution that generates

data set (e.g. normal distribution)
• Use discordancy tests depending on
– data distribution
– distribution parameter (e.g., mean, variance)
– number of expected outliers
• Drawbacks
– most tests are for single attribute
– In many cases, data distribution may not be known
88
Outlier Discovery: Distance-Based
Approach
• Introduced to counter the main limitations imposed by

statistical methods
– We need multi-dimensional analysis without knowing data
distribution.
• Distance-based outlier: outlier is an object O in a dataset
T such that at least a fraction p of the objects in T lies at
a distance greater than D from O
• Algorithms for mining distance-based outliers
– Index-based algorithm
– Nested-loop algorithm
– Cell-based algorithm
89
Outlier Discovery: Deviation-Based
Approach
• Identifies outliers by examinining the main
characteristics of objects in a group
• Objects that “deviate” from this description are
considered outliers
• sequential exception technique
– simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
• OLAP data cube technique
– uses data cubes to identify regions of anomalies in large
multidimensional data
90

Mining
Lecture 7
Decision Trees
Lecture slides taken from:

© Vipin Kumar CSci 5980 Spring 2004 ‹#›

Classification: Definition
Given a collection of records (training set )

– Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.

Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set

Examples of Classification Task
Predicting tumor cells as benign or malignant
Classifying credit card transactions

as legitimate or fraudulent
Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil
Categorizing news stories as finance,

weather, entertainment, sports, etc

Classification Techniques
Decision Tree based Methods

Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines

Example of a Decision Tree
Splitting Attributes
Status Income Cheat

2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10
Training Data Model: Decision Tree

Another Example of Decision Tree
MarSt Single,
Married Divorced
Status Income Cheat
NO Refund
Yes No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
NO YES
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10

Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set

Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES

Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES

Test Data
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES

Decision Tree Classification Task

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

Induction
5 No Large 95K Yes
6 No Medium 60K No

9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set

Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

General Structure of Hunt’s Algorithm
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
General Procedure: 2 No Married 100K No
– If Dt contains records that 3 No Single 70K No

belong the same class yt, then t
is a leaf node labeled as yt 6 No Married 60K No
– If Dt is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes
class, yd 9 No Married 75K No
– If Dt contains records that 10

belong to more than one class, Dt

use an attribute test to split the
data into smaller subsets.
Recursively apply the ?
procedure to each subset.

Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
8 No Single 85K Yes
Refund Refund 9 No Married 75K No

Yes No Yes No 10 No Single 90K Yes
10
Don’t Don’t Marital

Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced
Don’t Taxable Don’t

Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Tree Induction
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
– Determine when to stop splitting

How to Specify Test Condition?
Depends on attribute types

– Nominal
– Ordinal
– Continuous
Depends on number of ways to split

– 2-way split
– Multi-way split

Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct

values.
CarType
Family Luxury
Sports
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}

Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct

values.
Size
Small Large
Medium
Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small,
{Large}
hay {Medium,
{Small}
Medium} Large}
Size
{Small,
What about this split? Large} {Medium}

Splitting Based on Continuous Attributes
Different ways of handling

– Discretization to form an ordinal categorical
attribute
◆ Static – discretize once at the beginning
◆ Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)

◆ consider all possible splits and finds the best cut
◆ can be more compute intensive

Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

Tree Induction
Greedy strategy.
Issues

How to determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Own Car Student

Car? Type? ID?
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

How to determine the Best Split
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Measures of Node Impurity
Gini Index
Entropy
Misclassification error

Measure of Impurity: GINI
Gini Index for a given node t :
GINI (t ) = 1 −  [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally

distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Examples for computing GINI
GINI (t ) = 1 −  [ p( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Splitting Based on GINI
Used in CART, SLIQ, SPRINT.

When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split =  GINI (i)
i =1 n
where, ni = number of records at child i,

n = number of records at node p.

Binary Attributes: Computing GINI
Index
Splits into two partitions

Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in

the dataset
Use the count matrix to make decisions
Multi-way split Two-way split

(find best partition of values)
CarType CarType CarType

Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one Tid Refund Marital Taxable

Status Income Cheat
value
Several Choices for the splitting value
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
Each splitting value has a count matrix 5 No Divorced 95K Yes
associated with it 6 No Married 60K No
– Class counts in each of the 7 Yes Divorced 220K No
partitions, A < v and A  v 8 No Single 85K Yes

9 No Married 75K No
Simple method to choose best v
– For each v, scan the database to 10
gather count matrix and compute Taxable

its Gini index Income
> 80K?
– Computationally Inefficient!
Repetition of work. Yes No

Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,

– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No

Taxable Income
Các giá trị được 60 70 75 85 90 95 100 120 125 220
sắp xếp
Các vị trí phân chia 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Alternative Splitting Criteria based on INFO
Entropy at a given node t:

Entropy(t ) = − p( j | t ) log p( j | t )
j
(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node.
◆ Maximum (log nc) when records are equally distributed
among all classes implying least information
◆ Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
Examples for computing Entropy
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Splitting Based on INFO...
Information Gain:
 n 
= Entropy ( p) − 
k
GAIN Entropy (i )  i
 n 
split i =1
Parent Node, p is split into k partitions;

ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.

Splitting Based on INFO...
Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = −  log
Split k
i i
split
SplitINFO n n i =1
Parent Node, p is split into k partitions

ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the

partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information
Gain
Splitting Criteria based on Classification Error
Classification error at a node t :
Error (t ) = 1 − max P(i | t ) i
Measures misclassification error made by a node.

◆ Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
◆ Minimum (0.0) when all records belong to one class, implying
most interesting information

Examples for Computing Error
Error (t ) = 1 − max P(i | t )

i
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Comparison among Splitting Criteria
For a 2-class problem:

Tree Induction
Greedy strategy.
Issues

Stopping Criteria for Tree Induction
Stop expanding a node when all the records

belong to the same class
Stop expanding a node when all the records have

similar attribute values
Early termination (to be discussed later)

Decision Tree Based Classification
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets

Example: C4.5
Simple depth-first construction.

Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
– Needs out-of-core sorting.
You can download the software from:

http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

Practical Issues of Classification
Underfitting and Overfitting
Missing Values
Costs of Classification

Underfitting and Overfitting (Example)
500 circular and 500

triangular data points.
Circular points:
0.5  sqrt(x12+x22)  1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1

Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large

Overfitting due to Noise
Decision boundary is distorted by noise point

Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Notes on Overfitting
Overfitting results in decision trees that are more

complex than necessary
Training error no longer provides a good estimate

of how well the tree will perform on previously
unseen records
Need new ways for estimating errors

Estimating Generalization Errors
Re-substitution errors: error on training ( e(t) )

Generalization errors: error on testing ( e’(t))
Methods for estimating generalization errors:
– Optimistic approach: e’(t) = e(t)
– Pessimistic approach:
◆ For each leaf node: e’(t) = (e(t)+0.5)
◆ Total errors: e’(T) = e(T) + N  0.5 (N: number of leaf nodes)
◆ For a tree with 30 leaf nodes and 10 errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
Generalization error = (10 + 300.5)/1000 = 2.5%
– Reduced error pruning (REP):
◆ uses validation data set to estimate generalization
error

Occam’s Razor
Given two models of similar generalization errors,

one should prefer the simpler model over the
more complex model
For complex models, there is a greater chance

that it was fitted accidentally by errors in data
Therefore, one should include model complexity

when evaluating a model

How to Address Overfitting
Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

How to Address Overfitting…
Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
– Can use MDL for post-pruning

Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30
Class = No 10 Training Error (After splitting) = 9/30
Error = 10/30 Pessimistic error (After splitting)

= (9 + 4  0.5)/30 = 11/30
PRUNE!
A?
A1 A4
A2 A3
Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5

Class = No 4 Class = No 4 Class = No 1 Class = No 1

Examples of Post-pruning
– Optimistic error? Case 1:
Don’t prune for both cases
C0: 11 C0: 2
C1: 3 C1: 4
– Pessimistic error?
Don’t prune case 1, prune case 2
– Reduced error pruning?

Case 2:
Depends on validation set
C0: 14 C0: 2
C1: 3 C1: 2

Expressiveness
Decision tree provides expressive representation for

learning discrete-valued function
– But they do not generalize well to certain types of
Boolean functions
◆ Example: parity function:
– Class = 1 if there is an even number of Boolean attributes with truth
value = True
– Class = 0 if there is an odd number of Boolean attributes with truth
value = True
◆ For accurate modeling, must have a complete tree
Not expressive enough for modeling continuous variables

– Particularly when test condition involves only a single
attribute at-a-time

Decision Boundary
1
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.5 y < 0.47? y < 0.33?

0.4
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time

Oblique Decision Trees
x+y<1
Class = + Class =
• Test condition may involve multiple attributes

• More expressive representation
• Finding optimal test condition is computationally expensive

Model Evaluation
Metrics for Performance Evaluation

– How to evaluate the performance of a model?
Methods for Performance Evaluation

– How to obtain reliable estimates?
Methods for Model Comparison

– How to compare the relative performance
among competing models?

Focus on the predictive capability of a model

– Rather than how fast it takes to classify or
build models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d

Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
Most widely-used metric:
a+d TP + TN
Đô chính xác = =
a + b + c + d TP + TN + FP + FN

Limitation of Accuracy
Consider a 2-class problem

– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
If model predicts everything to be class 0,

accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example

Cost Matrix
PREDICTED CLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)

ACTUAL
CLASS Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification
Cost PREDICTED CLASS

Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0
Model PREDICTED CLASS Model PREDICTED CLASS

M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200
Accuracy = 80% Accuracy = 90%

Cost = 3910 Cost = 4255
Cost vs Accuracy
Count PREDICTED CLASS Accuracy is proportional to cost if

1. C(Yes|No)=C(No|Yes) = q
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
Class=Yes a b
ACTUAL N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N
Cost PREDICTED CLASS

Cost = p (a + d) + q (b + c)
Class=Yes Class=No
= p (a + d) + q (N – a – d)
Class=Yes p q = q N – (q – p)(a + d)
ACTUAL
CLASS Class=No
= N [q – (q-p)  Accuracy]
q p

Cost-Sensitive Measures
a
Precision (p) =
a+c
a
Recall (r) =
a+b
2rp 2a
F - measure (F) = =
r + p 2a + b + c
Precision is biased towards C(Yes|Yes) & C(Yes|No)

Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
wa + w d
Weighted Accuracy = 1 4
wa + wb+ wc + w d
1 2 3 4

Model Evaluation




How to obtain a reliable estimate of

performance?
Performance of a model may depend on other

factors besides the learning algorithm:
– Class distribution
– Cost of misclassification
– Size of training and test sets

Learning Curve
Learning curve shows

how accuracy changes
with varying sample size
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)
Effect of small sample size:

- Bias in the estimate
- Variance of estimate

Methods of Estimation
Holdout
– Reserve 2/3 for training and 1/3 for testing
Random subsampling
– Repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Stratified sampling
– oversampling vs undersampling
Bootstrap
– Sampling with replacement
Model Evaluation




Test of Significance
Given two models:

– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
Can we say M1 is better than M2?

– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?

Confidence Interval for Accuracy
Prediction can be regarded as a Bernoulli trial

– A Bernoulli trial has 2 possible outcomes
– Possible outcomes for prediction: correct or wrong
– Collection of Bernoulli trials has a Binomial distribution:
◆ x  Bin(N, p) x: number of correct predictions
◆ e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Np = 50  0.5 = 25
Given x (# of correct predictions) or equivalently,

acc=x/N, and N (# of test instances),
Can we predict p (true accuracy of model)?

Area = 1 - 
For large test sets (N > 30),
– acc has a normal distribution
with mean p and variance
p(1-p)/N
acc − p
P( Z  Z )
p(1 − p) / N
 /2 1− / 2
= 1− Z/2 Z1-  /2
Confidence Interval for p:

2  N  acc + Z  Z + 4  N  acc − 4  N  acc
2 2 2
p=  /2  /2
2( N + Z ) 2
 /2

Consider a model that produces an accuracy of

80% when evaluated on 100 test instances:
– N=100, acc = 0.8 1- Z
– Let 1- = 0.95 (95% confidence)
0.99 2.58
– From probability table, Z/2=1.96
0.98 2.33
N 50 100 500 1000 5000 0.95 1.96
p(lower) 0.670 0.711 0.763 0.774 0.789 0.90 1.65
p(upper) 0.888 0.866 0.833 0.824 0.811

Comparing Performance of 2 Models
Given two models, say M1 and M2, which is

better?
– M1 is tested on D1 (size=n1), found error rate = e1
– M2 is tested on D2 (size=n2), found error rate = e2
– Assume D1 and D2 are independent
– If n1 and n2 are sufficiently large, then
e1 ~ N (1 ,  1 )
e2 ~ N (2 ,  2 )
e (1 − e )
– Approximate: ˆ =
i i
i
n i

Comparing Performance of 2 Models
To test if performance difference is statistically

significant: d = e1 – e2
– d ~ N(dt,t) where dt is the true difference
– Since D1 and D2 are independent, their variance
adds up:
 =  +   ˆ + ˆ
2
t
2
1 2
2
1
2 2
e1(1 − e1) e2(1 − e2)

= +
n1 n2
– At (1-) confidence level, d = d  Z ˆ

t  /2 t

An Illustrative Example
Given: M1: n1 = 30, e1 = 0.15

M2: n2 = 5000, e2 = 0.25
d = |e2 – e1| = 0.1 (2-sided test)
0.15(1 − 0.15) 0.25(1 − 0.25)

ˆ =
d
+ = 0.0043
30 5000
At 95% confidence level, Z/2=1.96
d = 0.100  1.96  0.0043 = 0.100  0.128

t
=> Interval contains 0 => difference may not be

statistically significant

0 KDLVLP Đã G P

Uploaded by

Copyright:

Available Formats

0 KDLVLP Đã G P

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0 KDLVLP Đã G P

Uploaded by

Copyright:

Available Formats

Data Warehousing and Mining

Lecture slides modified from:

Disability Disclosure Statement

• Data explosion problem

– Automated data collection tools and mature database technology

• We are drowning in data, but starving for knowledge!

• Solution: Data warehousing and data mining

– Data warehousing and on-line analytical processing

– Extraction of interesting knowledge (rules, regularities, patterns,

• Lots of data is being collected

• Computers have become cheaper and more powerful

• Data collected and stored at

• Data mining (knowledge discovery in databases):

• Alternative names and their “inside stories”:

What is not Data What is Data Mining?

– Look up phone – Certain names are more

– Query a Web – Group together similar

• Decisions in data mining

• Data mining tasks

Common data mining tasks

• Given a collection of records (training set )

Tid Refund Marital Taxable Refund Marital Taxable

1 Yes Single 125K No No Single 75K ?

• Sky Survey Cataloging

• Given a set of data points, each having a set of

Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances

• Given a set of records each of which contain some

• Marketing and Sales Promotion:

• Supermarket shelf management.

• Predict a value of a given continuous valued variable

• Detect significant deviations

• Deductive reasoning is truth-preserving:

• Induction reasoning adds information:

From true facts, we may induce false models.

Another example: distinguish US tanks from Iraqi tanks

• The hypothesis-based method:

• Learning the application domain:

Data Presentation Business

Data Warehouses / Data Marts

• Example DBMS Reports

• Questions answered using Data Mining

• Data Warehouse: a centralized data repository which

Data Cube API

Filtering&Integration Database API Filtering

DBMS OLAP Data Mining

Type of result Information Analysis Insight and Prediction

Multidimensional data Induction (Build the

What is the average Who will buy a

• By querying a DBMS containing the above table we may

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

• Using the ID3 algorithm we can produce the following

Data Sources Data Storage OLAP Engine Front-End Tools

•EXtensible Markup Language (XML)

• The author of HTML documents can only use

• XML allows the author to define his own tags

<to>Tan Siew Teng</to>