0 KDLVLP Đã G P
0 KDLVLP Đã G P
0 KDLVLP Đã G P
Lecture 1
• Course syllabus
• Overview of data warehousing and mining
3
Motivation:
“Necessity is the Mother of Invention”
6
What Is Data Mining?
7
Examples: What is (not) Data Mining?
• Databases to be mined
– Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc. 10
Data Mining Tasks
• Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
• Description Tasks
– Find human-interpretable patterns that describe the data.
12
Classification Example
Set
No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10
Set Classifier
13
Classification: Application 1
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and company-
interaction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
14
Classification: Application 2
• Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
15
Classification: Application 3
• Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be
lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the
day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
16
Classification: Application 4
17
Classifying Galaxies
Class: Attributes:
Early
• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
18
Clustering Definition
19
Illustrating Clustering
20
Clustering: Application 1
• Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
21
Clustering: Application 2
• Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
22
Association Rule Discovery: Definition
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk
{Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
23
Association Rule Discovery: Application 1
25
Regression
26
Deviation/Anomaly Detection
– Network Intrusion
Detection
27
Data Mining and Induction Principle
Induction vs Deduction
28
The Problems with Induction
Prototypical example:
– European swans are all white.
– Induce: ”Swans are white” as a general rule.
– Discover Australia and black Swans...
– Problem: the set of examples is not random and representative
• Exploratory-based method:
– Try to make sense of a bunch of data without an a priori
hypothesis!
– The only prevention against false results is significance:
• ensure statistical significance (using train and test etc.)
• ensure domain significance (i.e., make sure that the results make
sense to a domain expert)
30
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery Data Mining
process.
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Databases 31
Steps of a KDD Process
Data Exploration
Statistical Analysis, Querying and Reporting
• Relational databases
• Data warehouses
• Transactional databases
• Advanced DB and information repositories
– Object-oriented and object-relational databases
– Spatial databases
– Time-series data and temporal data
– Text databases and multimedia databases
– Heterogeneous and legacy databases
– WWW 34
Data Mining: Confluence of Multiple
Disciplines
Database
Statistics
Technology
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
35
Data Mining vs. Statistical Analysis
Statistical Analysis:
• Ill-suited for Nominal and Structured Data Types
• Completely data driven - incorporation of domain knowledge not
possible
• Interpretation of results is difficult and daunting
• Requires expert user guidance
Data Mining:
• Large Data sets
• Efficiency of Algorithms is important
• Scalability of Algorithms is important
• Real World Data
• Lots of Missing Values
• Pre-existing data - not user generated
• Data not static - prone to updates
• Efficient methods for data retrieval available for use 36
Data Mining vs. DBMS
Layer2
MDDB
MDDB
Meta Data
40
Example of DBMS, OLAP and Data
Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no 41
Example of DBMS, OLAP and Data
Mining: Weather Data
42
Example of DBMS, OLAP and Data
Mining: Weather Data
OLAP:
• Using OLAP we can create a Multidimensional Model of our data
(Data Cube).
• For example using the dimensions: time, outlook and play we can
create the following model.
43
Example of DBMS, OLAP and Data
Mining: Weather Data
Data Mining:
• outlook = sunny
– humidity = high: no
– humidity = normal: yes
• outlook = overcast: yes
• outlook = rainy
– windy = true: no
– windy = false: yes
44
Major Issues in Data Warehousing and
Mining
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
45
Major Issues in Data Warehousing and
Mining
• Issues relating to the diversity of data types
– Handling relational and complex types of data
– Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
• Intelligent query answering
• Process control and decision making
– Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem
– Protection of data security, integrity, and privacy
46
Multi-Tiered Architecture of data warehousing and data mining
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Integration
Warehouse Data mining
Refresh
Data Marts
2008/1/9 2
XML is Extensible
• The tags used to markup HTML documents and
the structure of HTML documents are
predefined.
2008/1/9 6
XML Syntax
• An example XML document.
<?xml version="1.0"?>
<note>
<to>Tan Siew Teng</to>
<from>Lee Sim Wee</from>
<heading>Reminder</heading>
<body>Don't forget the Golf Championship this
weekend!</body>
</note>
2008/1/9 7
Example (cont’d)
• The first line in the document: The XML
declaration should always be included.
• It defines the XML version of the document.
• In this case the document conforms to the 1.0
specification of XML.
<?xml version="1.0"?>
• The next line defines the first element of the
document (the root element):
<note>
2008/1/9 8
Example (cont’d)
• The next lines defines 4 child elements of the root
(to, from, heading, and body):
</note>
2008/1/9 9
What is an XML element?
• An XML element is made up of a start tag, an
end tag, and data in between.
<Sport>Golf</Sport>
• The name of the element is enclosed by the less
than and greater than characters, and these are
called tags.
• The start and end tags describe the data within the
tags, which is considered the value of the
element.
• For example, the following XML element is a
<player> element with the value “Tiger Wood.”
<player>Tiger Wood</player>
2008/1/9 10
There are 3 types of tags
• Start-Tag
– In the example <Sport> is the start tag. It defines
type of the element and possible attribute
specifications
<Player firstname=“Wood" lastname=“Tiger">
• End-Tag
– In the example </Sport> is the end tag. It
identifies the type of element that tag is ending.
Unlike start tag end tag cannot contain attribute
specifications.
• Empty Element Tag
– Like start tag this has attribute specifications but it
does not need an end tag. Denotes that element is
empty (does not have any other elements). Note
2008/1/9 the symbol for ending tag '/' before '> ' 11
<Player firstname=“Wood" lastname=“Tiger"/>
XML elements must have a closing
tag
In HTML some elements do not have to have a closing tag.
2008/1/9 12
Rules for Naming Elements
• XML names should start with a letter or the
underscore character.
• Rest of the name can contain letters, digits,
dots, underscores or hyphens.
• No spaces in names are allowed.
• Names cannot start with 'xml' which is a
reserved word.
2008/1/9 13
XML tags are case sensitive
• XML tags are case sensitive. The tag <Message> is
different from the tag <message>.
<message>This is correct</message>
<Message>This is incorrect</message>
2008/1/9 14
All XML elements must be
properly nested
In HTML some elements can be improperly nested within each other
like this:
In XML all elements must be properly nested within each other like this
2008/1/9 15
XML documents must have a root tag
• Documents must contain a single tag pair to
define the root element.
• All other elements must be nested within the root
element.
• All elements can have sub (children) elements.
• Sub elements must be in pairs and correctly
nested within their parent element:
<root>
<child>
<subchild>
</subchild>
</child>
</root>
2008/1/9 16
XML Attributes
• XML attributes are normally used to describe
XML elements, or to provide additional
information about elements.
• An element can optionally contain one or more
attributes. An attribute is a name-value pair
separated by an equal sign (=).
• Usually, or most common, attributes are used to
provide information that is not a part of the
content of the XML document.
• Often the attribute data is more important to the
XML parser than to the reader.
2008/1/9 17
XML Attributes (cont’d)
• Attributes are always contained within the start
tag of an element. Here are some examples:
<Player firstname=“Wood" lastname=“Tiger“ />
Player - Element Name
Firstname - Attribute Name
Wood - Attribute Value
• HTML examples:
<img src="computer.gif">
<a href="demo.asp">
• XML examples:
<file type="gif">
<person id="3344">
2008/1/9 18
Attribute values must always be
quoted
• XML elements can have attributes in name/value
pairs just like in HTML.
• An element can optionally contain one or more
attributes.
• In XML the attribute value must always be
quoted.
• An attribute is a name-value pair separated by an
equal sign (=).
<CITY ZIP="01085">Westfield</CITY>
• ZIP="01085" is an attribute of the <CITY> element.
2008/1/9 19
What is a Comment ?
• Comments are informational help for the
reader.
2008/1/9 21
What is a DTD ?
• Document Type Declaration (DTD) is a
mechanism (set of rules) to describe the
structure, syntax and vocabulary of XML
documents.
2008/1/9 22
Document Type Definition (DTD)
• Define the legal building blocks of an XML
document.
• Set of rules to define document structure with a
list of legal elements.
• Declared inline in the XML document or as an
external reference.
• All names are user defined.
• Derived from SGML.
• One DTD can be used for multiple documents.
• Has ASCII format.
• DOCTYPE keyword.
2008/1/9 23
Element Declaration
• Following lines show the possible syntaxes for
element declaration
<!ELEMENT reports (employee*)>
<!ELEMENT employee (ss_number, first_name,
middle_name, last_name, email, extension, birthdate,
salary)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT extension EMPTY>
#PCDATA - Parsed Character Data, meaning that
elements can contain text. This requirement means that
no child elements may be present in the element within
which #PCDATA is specified
EMPTY - Indicates that this is the leaf element, cannot
contain any more children
2008/1/9 24
Occurrence
• There are notations to specify the number of
occurrences that a child element can occur within
the parent element.
• These notations can appear at the end of each
child element
+ - Element can appear one or several times
* - Element can appear zero or several times
? - Element can appear zero or one time
nothing - Element can appear only once (also, it
must appear once)
2008/1/9 25
Separators
, - Elements on the right and left of
comma must appear in the same
order.
<?xml version=“1.0”?>
<TITLE>
<title>A Well-formed Documents</title>
<first>
This is a simple
<bold>well-formed</bold>
document.
</first>
<TITLE> Source:
L1.xml
2008/1/9 29
Rules for Well-formed
Documents
• The first line of a well-formed XML document
must be an XML declaration.
• All non-empty elements must have start tags and
end tags with matching element names.
• All empty elements must end with />.
• All document must contain one root element.
• Nested elements must be completely nested
within their higher-level elements.
• The only reserved entity references are &,
', > <, and ".
2008/1/9 30
DTD Graph
• Given the DTD information of the XML to be stored, we can
create a structure called the Data Type Definition Graph that
mirrors the structure of the DTD. Each node in the Data Type
Definition graph represents an XML element in rectangle, an XML
attribute in semi-cycle, and an operator in cycle. They are put
together in a hierarchical containment under a root element node,
with element nodes under a parent element node, separated by
occurrence indicator in cycle.
2008/1/9 31
Extended Entity Relationship Model DTD Graph
Root
Entity A
+
1 1
R1 R4
Element E
1 1
n 1
n 1 n
Entity B Entity E
* *
1 1 1 1 Mapping
n
R2 R3 R3 R7 Element A Element F Element H
n n n n 1 1
Customer_no
Invoice_no Customer_name
Quantity Sex Year
Invoice_amount Postal_code Month
Invoice_date Telephone Quantity
Shipment_date Email Total
n 1 Monthly
Invoice R7 Customer 1
Sales
1
Item_no,
1 1 1
Item_name
Author
Publisher
R8 R1 R3 R2 R4 Item_price
n n n n
Invoice_no n n n
Item_no Customer_no Year Year,
Quantity 1
Invoice Customer Address Month Month
Item
Unit_price City
Address State Country
Customer Customer_no Item_no Item Sales R3 Item
Invoice_price Sales Quantity Quantity Total
Discount Is_default Total
n
1
R11
2008/1/9 33
A Sample DTD graph for Customer Sales
Root
+
Sales
* * * *
Invoice_no
Quantity Customer_no
Element Invoice_amount Element Customer_name Element Year
Invoice_date Sex Month
Shipment_date Postal_code Quantity
Invoice Shipment_type Customer Telephone Monthly Total
ID
Email Sales
idref ID Item_no,
Element Item_name
* * * *
Author
Publisher
Item_price
Item
Catalog_type
ID ID
Element Quantity Element Element
Unit_price Address Quantity Element
Invoice_price City idref Quantity Total
Invoice Discount Customer Customer Total
State Country Item Sales
Item Is_default Address Sales
idref
idref
2008/1/9 34
The mapped DTD from DTD Graph
<!ELEMENT Sales (Invoice*, Customer*, Item*, Monthly_sales*)>
<!ATTLIST Sales
Status (New | Updated | History) #required>
<!ELEMENT Invoice (Invoice_item*)>
<!ATTLIST Invoice
Invoice_no CDATA #REQUIRED
Quantity CDATA #REQUIRED
Invoice_amount CDATA #REQUIRED
Invoice_date CDATA #REQUIRED
Shipment_date CDATA #IMPLIED
Customer_idref IDREF #REQUIRED>
<!ELEMENT Customer (Customer_address*)>
<!ATTLIST Customer
Customer_id ID #REQUIRED
Customer_name CDATA #REQUIRED
Customer_no CDATA #REQUIRED
Sex CDATA #IMPLIED
Postal_code CDATA #IMPLIED
Telephone CDATA #IMPLIED
Email CDATA #IMPLIED>
<!ELEMENT Customer_address EMPTY>
<!ATTLIST Customer_address
Address_type (Home|Office) #REQUIRED
Address NMTOKENS #REQUIRED
City CDATA #IMPLIED
State CDATA #IMPLIED
Country CDATA #IMPLIED
Customer_idref
2008/1/16 IDREF #REQUIRED 35
Is_default (Y|N) “Y”>
<!ELEMENT Invoice_Item EMPTY>
<!ATTLIST Invoice_Item
Quantity CDATA #REQUIRED
Unit_price CDATA #REQUIRED
Invoice_price CDATA #REQUIRED
Discount CDATA #REQUIRED
Item_idref IDREF REQUIRED>
<!ELEMENT Item EMPTY>
<!ATTLIST Item
Item_id ID #REQUIRED
Item_name CDATA #REQUIRED
Author CDATA #IMPLIED
Publisher CDATA #IMPLIED
Item_price CDATA #REQUIRED>
<!ELEMENT Monthly_sales(Item_sales*, Customer_sales*)>
<!ATTLIST Monthly_sales
Year CDATA #REQUIRED
Month CDATA #REQUIRED
Quantity CDATA #REQUIRED
Total CDATA #REQUIRED>
<!ELEMENT Item_sales EMPTY>
<!ATTLIST Item_sales
Quantity CDATA #REQUIRED
Total CDATA #REQUIRED
Item_idref IDREF #REQUIRED>
<!ELEMENT Customer_sales EMPTY>
<!ATTLIST Customer_sales
Quantity CDATA #REQUIRED
Total CDATA #REQUIRED
Customer_idref
2008/1/9 IDREF #REQUIRED>
36
Review Question 1
What are the similarity and dissimilarity
between DTD and Well-formed Document?
2008/1/9 37
Tutorial question 1
Map the following Extended Entity Relationship Model into
an DTD Graph and a Document Type Definition (DTD)
Department Department_ID
1
Salary
has
n
Trip Trip_ID
1
taken
n
Staff_ID
Car_rental Car_model
2008/1/9 38
Reading assignment
Chapter 3 Schema Translation in “Information
Systems Reengineering and Integration”
Second Edition, by Joseph Fong, published
by Springer, 2006, pp.142-154.
2008/1/9 39
Extended Entity Relationship (EER) Model
Example
A president leads a nation.
Relational Model:
Relation President (President_name, Race, *Nation_name)
Relation Nation (Nation_name, Nation_size)
Where underlined are primary keys and "*" prefixed are foreign keys
2023/2/5 2
Cardinality: Many-to-one relationship
A many-to-one relationship from set A to set B is defined as: For all a in A, there exists at most one b in B
such that a and b are related, and for all b in B, there exists zero or more a in A such that a and b are related.
Relational Model:
Relation Director (Director_name, Age)
Relation Movies (Movie_name, Sales_volume, *Director_name)
2023/2/5 3
Cardinality: Many-to-many relationship
A many-to-many relationship between set A and set B is defined as: For all a in A, there exists zero
or more b in B such that a and b are related, and vice versa.
Example
Many students take many courses such that a student can take many courses and a course can be taken by
many students.
Relational Model:
Relation Student (Student_id, Student_name)
Relation Course (Course_id, Course_name)
Relation take (*Student_id, *Course_id)
2023/2/5 4
Data Semantic: Is-a (Subtype) relationship
The relationship A isa B is defined as: A is a special kind of B.
Example
Father is Male.
Relational Model:
Relation Male (Name, Height)
Relation Father (*Name, Birth_date)
2023/2/5 5
Data Semantic: Disjoint Generalization
-Generalization is to classify similar entities into a single entity. More than one is-a
relationship can form data abstraction (i.e. super-class and subclasses) among entities.
- A subclass entity is a subset of its super-class entity. There are two kinds of generalization.
- The first is disjoint generalization such that subclass entities are mutually exclusive.
- The second is overlap generalization such that subclass entities can overlap each other.
Relational Model:
Relation Boat_person (Name, Birth_date,
Birth_place)
Relation Refugee (*Name, Open_center)
Relation Non-refugee (*Name, Detention_center)
2023/2/5 6
Data Semantic: Overlap Generalization
Example of Overlap Generalization
A computer programmer and a system analyst can both be a computer professional, and a computer
programmer can also be a system analyst, and vice versa.
Relational Model:
Relation Computer_professional (Employee_id, Salary)
Relation Computer_programmer (*Employee_id, Language_skill)
Relation System_analyst (*Employee_id, Application_system)
2023/2/5 7
Data Semantic: Categorization Relationship
In cases the need arises for modeling a single super-class/subclass relationship with more than
one super-class (es), where the super-classes represent different entity types. In this case, we
call the subclass a category.
Relational Model:
Relation Department (Borrower_card, Department_id)
Relation Doctor (Borrower_card, Doctor_name)
Relation Hospital (Borrower_card, Hospital_name)
Relation Borrower (*Borrower_card, Return_date, File_id)
2023/2/5 8
Data Semantic: Aggregation Relationship
Aggregation is a method to form a composite object from its components. It aggregates
attribute values of an entity to form a whole entity.
Example
The process of a student taking a course can form a composite entity (aggregation) that may be
graded by an instructor if the student completes the course.
Relational Model:
Relation Student (Student_no, Student_name)
Relation Course (Course_no, Course_name)
Relation Takes (*Student_no, *Course_no, *Instructor_name)
Relation Instructor (Instructor_name, Department)
Extended Entity Relationship Model
2023/2/5 9
Data Semantic: Total Participation
An entity is in total participation with another entity provided that all data occurrences of the
entity must participate in a relationship with the other entity.
Example
An employee must be hired by a department.
Relational Model:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_id, Employee_name, *Department_id)
2023/2/5 10
Data Semantic: Partial Participation
An entity is in partial participation with another entity provided that the data occurrences of the
entity are not totally participate in a relationship with the other entity.
Example
An employee may be hired by a department.
Relational Model:
Relation Department (Department_id, Department_name)
Relation Employee (Employee_no, Employee_name, &Department_id)
Where & means that null value is allowed
2023/2/5 11
Data Semantic: Weak Entity
The existence of a weak entity depends on its strong entity.
Example
A hotel room must concatenate hotel name for identification.
Relational Model:
Relation Hotel (Hotel_name, Ranking)
Relation Room (*Hotel_name, Room_no, Room_size)
2023/2/5 12
Cardinality: N-ary Relationship
Multiple entities relate to each other in an n-ary relationship.
Example
Employees use a wide range of different skills on each project they are associated with.
Relational Model:
Relation Engineer (Employee_id, Employee_name)
Relation Skill (Skill_name, Years_experience)
Relation Project (Project_id, Start_date, End_date)
Relation Skill_used (*Employee_id, *Skill_name, *Project_id)
2023/2/5 13
Architecture of multiple databases integration
Global
database
Relational
Database 1
Relational
Database n Integrated
schema
2023/2/5 15
• Secondary relation: a relation whose primary key is
fully or partially formed by concatenation of primary
keys of other relations.
– Secondary relation - Type 1 (SR1). If the key of the
secondary relation is formed by concatenation of primary
keys of primary relations, it is of Type 1 or SR1.
– Secondary relation - Type 2 (SR2). Secondary relations that
are not of Type 1.
• Key attribute - Primary (KAP). This is an attribute
in the primary key of a relation, and is also a
foreign key of another relation.
• Key attribute - General (KAG). These are all the
other primary key attributes in a secondary
relation that are not of the KAP type.
2023/2/5 16
• Foreign key attribute (FKA). This is a non-primary-
key attribute of a primary relation that is a foreign
key.
• Nonkey attribute (NKA). The rest of the non-
primary-key attribute.
2023/2/5 17
Step 3 Map each PR2 into a subclass entity or
weak entity
Case 1: CASE 2:
A2 A2
1
R ISA
A1 n A1
Subclass
B1 Weak entity B1 B
B
B2
2023/2/5 18
Step 4 Map SR1 into binary/n-ary relationship
R
n
B1
B2 B
2023/2/5 19
Step 5 Map SR2 into binary/n-ary relationship
R1 R2
n m
A1
B1 C
C1
2023/2/5 20
Step 6 Map each FKA into relationship
2023/2/5 23
Relations and attributes classification table
Relation Rel Primary- KAP KAG FKA NKA
Name Type Key_____ ________ ________ ________ _________ _________
DEPT PR1 Dept# Dept_name
INST PR2 Dept# Dept# Inst_name Inst_addr
Inst_name
COUR PR1 Course# Course_location
STUD PR1 Student# Stud_name
PREP PR1 Prer Course# Prer_title
SECT SR2 Course# Course# Section# Inst_name
Dept# Dept#
Section#
Inst_name Inst_name
GRADE SR1 Inst_name Inst_name Grade
Course# Course#
Student# Student#
2023/2/5 Dept# Dept# 24
Section# Section#
Step 2. Map each PR1 into entity
2023/2/5 25
Step 3. Map each PR2 into weak entity.
1 n
Department hire Instructor
Department hire
Dept# Dept#
Dept_name Inst_name
Inst_addr
2023/2/5 26
Step 4. Map SR1 into binary/n-
ary relationship.
m n
Student grade Section
2023/2/5 27
Step 5. Map SR2 into binary/n-ary
relationship
Instructor Course
Dept# 1 1 Course#
Inst_name
Course_Location
Inst_addr
teach has
n n
Dept#
Inst_Name
Section
Section
Course#
Section#
2023/2/5 28
Step 6. Map each FKA into relationship
2023/2/5 29
Step 7. Map each inclusion dependency
into semantics (binary/n-ary relationship)
Given derived inclusion dependency Derived Semantics
Instructor.Dept# Department.Dept# n:1 relationship between entities
Instructor and Department
Section.Dept# Department.Dept#
Section.Inst_name Instructor.Inst_name
Section.Course# Course.Course# 1:n relationship between entities
Instructor and Section
and between Course and Section.
Grade.Dept# Section.Dept#
Grade.Inst_name Section.Inst_name
Grade.Course# Section.Course#
Grade.Student# Student.Student# m:n relationship between
relationship Section and entity Student.
Prerequisite.Course# Course.Course#
Course.Prer# Prerequisite.Prer# 1:1 relationship between Course and
2023/2/5 Prerequisite 30
Step 8. Draw EER model.
Department Prerequisite
1 Dept# 1 Prer#
Dept_name Prer_title
pre-course
hire
n 1
2023/2/5 32
Review question 2
What are the major differences between
Generalization and Categorization in terms
of data volume (data occurrences) in their
related superclass entity/entities and
subclass entity/entities?
2023/2/5 33
Tutorial Question 2
In a reverse engineering approach, translate the following relational
schema to an Entity-relationship model.
where underlined are primary keys and prefixed with ‘*’ are foreign keys.
2023/2/5 34
Architecture of multiple databases integration
Global
database
Relational
Database 1
Relational
Database n Integrated
schema
2023/2/5 3
Step 1 Resolve conflicts among EER model
Loan
Customer ====> Customer Customer
Borrower
name name name name
date date open-acct-date loan-begin-date
2023/2/5 5
Resolve conflicts on data types
n 1
Ax Ca Cx == ==> Ca R Cx
1 1
Ax Ca .C x == ==> Ca R .C x
Ax n m
Ay .C a Cx == ==> Ay Ca R Cx
2023/2/5 6
Example
S chem a A S chem a B
T ra n s fo rm e d s c h e m a A
A B A A'
n 1
cu sto m e r Loan nam e
==>
ty p e Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t
S chem a A S chem a B T ra n s fo rm e d s c h e m a A
A B A A'
1 1
Loan ty p e
cu sto m e r
nam e
==> Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t
S chem a A S chem a B T ra n s fo rm e d s c h e m a A
A B A A'
n m
Loan ty p e
cu sto m e r nam e == > Loan book
nam e
C u s to m e r C u s to m e r
typ e C o n tr a c t C o n tr a c t
2023/2/5 7
Resolve conflicts on key
2023/2/5 8
Example
Schem a A Schem a B T r a n s fo r m e d s c h e m a A
A B A
nam e nam e
C u s to m e r
nam e
C u s to m e r c u s to m e r#
===> C u s to m e r c u s to m e r#
c u s to m e r#
2023/2/5 9
Resolve conflicts on cardinality
1 1 1 n 1 n
Entity
x
R Entity
y
Entity
x
R Entity
y ====> Entity
x R Entity
y
2023/2/5 10
Example
2023/2/5 11
Resolve conflicts on weak entity
Schema A Schema B Transformed schema A'
A1 B1 A1
A2 n A2
n
A2 n
Entityy Attrb Attra Attra
Entityyy Attrb Entityyy
Attrb
2023/2/5 12
Example
2023/2/5 13
Resolve conflicts on subtype entity
Schema A
A1 Schema B Schema X
1 A2 B1 B2 X1 X2
1 1
Entityx R Entityy Entityx R Entityy Entityx R Entityy
2023/2/5 14
Example
2023/2/5 15
Step 2 Merge entities
2023/2/5 16
Example
customer#
A B x customer#
address customer# address
phone Customer Customer age ==> Customer phone
age
2023/2/5 17
Merge EER models by Generalization
Schema A Schema B Schema X
X
A B
Entityz Attrk
Entityx Attrk Entityy Attrk
======>
User d=disjoint
supervision d
X1 X2
n a m e x
L o c a l O v e rs e a s C u s to m e r
C u s to m e r C u s to m e r = = = = = = >
n a m e n a m e d = d is jo in t g e n e r a lis a t io n
lo c a l_ a d d r o v e rs e a s _ a d d r d x 2
x 1
L o c a l o v e rs e a s _ a d d r
O v e rs e a s
lo c a l_ a d d r
C u s to m e r C u s to m e r
S c h e m a A S c h e m a B S c h e m a X
x
L o a n
C o m m e rc ia l M o rtg a g e n a m e
B o rro w e r
L o a n L o a n
B o rro w e r B o rro w e r = = = = = = >
n a m e n a m e
o = o v e r la p g e n e r a lis a t io n
lo a n _ a m t lo a n _ a m t
p r im e _ r a t e m o rtg a g e _ ra te
o
x 2
x 1
C o m m e ric a l m o rtg a g e _ ra te
M o rtg a g e
p r im e _ r a t e
lo a n _ a m t
L o a n L o a n
F ig u r e 8 M e rg e E E R m o d e ls b y g e n e a lis a t io n B o rr o w e r lo a n _ a m t B o rro w e r
2023/2/5 19
Merge EER models by Subtype
Relationship
2023/2/5 20
Example
S chem a B S chem a X
S chem a A
B X1 X2
A
In te re s t F ix e d In te re s t K K F ix e d
K
R a te R a te K ====> R a te is a R a te
2023/2/5 21
Merge entities by Aggregation
Schema A Schema B
Schema X
A B
X1
1 n
1 n
Entityx Entityy R Entityz ==> Entityy R1 Entityz
R2
Entityx X2
2023/2/5 22
Example
Schema X
Schema A Schema B
A B2 X1
B1 B
1 n 1 n
Loan Loan Loan
C u sto mer book C u sto mer book
Security C ontract ==> C ontract
1
secured
by
n
Loan
X2
Security
2023/2/5 23
Merge entities by categorization
Schema A Schema B
Schema X
X1
B Xc2
A1 A2 Xc1
EntityZ X2
2023/2/5 24
Example
S ch e m a X
S ch e m a A S ch e m a B
X1
A1 A2 B
X c1 X c2
M o rtg a g e C o m m e rcia l Loan M o rtg a g e C o m m e rcia l
lo a n lo a n C o n tr a c t = => lo a n lo a n
C = ca t e g o risa t io n C
Loan
X2
C o n tr a c t
2023/2/5 25
Merge entities by Implied Binary
Relationship
X1 Schema X
Schema A Schema B AttrX X2
A B
n 1
AttrX
Entity a Entity b AttrX ====> Entity a R Entity b
Schema X
Schema A Schema B X1 AttrY AttrX
A B X2
AttrX 1 1
AttrY ====> R Entity b
Entity a Entity b AttrX Entity a
AttrY
2023/2/5 26
Example
Schema A Schema B Schema X
X2
A B X1
Loan Loan n 1
Customer ==> book Customer
Contract Contract
loan#
customer# customer# loan# customer#
Loan Loan 1 1
Customer ==> book Customer
Contract Contract
loan# loan#
customer# customer# loan# customer#
2023/2/5 27
Step 3 Merge relationships
Merge relationships by subtype relationship
Schema A Schema B
Schema X
A1 A A2 B1 B2 X1 X X3
1 1 1 1
EntityX R EntityY EntityX R EntityY EntityX R EntityZ
isa
X2 EntityY
2023/2/5 28
Example
2023/2/5 29
Merge relationships by overlap generalization
SchemaA
SchemaB
SchemaX
A1 A A2 B1 B2
1 n 1 1 n X1
EntityX R1 EntityY EntityX R2 EntityY ===> EntityX
1 1
R1 R2
n n
X3 Entity X4 EntityZ2
Z1
O
EntityY
X2
2023/2/5 30
Example
Schema A Schema B Schema X
A B X1
A1 A2 B1 B2
1 n 1 n Bank
Bank
home
Customer Bank
auto
Customer ===> 1
loan loan 1
home auto
loan loan
X3 n n X4
Home loan Auto loan
borrower borrower
o=overlap generalisation O
X2
Customer
2023/2/5 31
Absorbing Lower degree
Relationship into a Higher degree
Relationship
SchemaA
SchemaB
SchemaX
A1 A2 B1 B2
X1 X2
m : n 1 1 n
EntityX EntityY EntityX R2 EntityY ===> m : n
EntityX EntityY
m R1 m
m R1 m
:
: :
n :
n n
EntityZ n
2023/2/5 X3 EntityZ 32
Example
2023/2/5 33
Case Study: In a bank, there are existing databases with different
schemas: one for the customers, another for the mortgage loan contracts
and a third for the index interest rate. However, there is a need to
integrate them together for an international banking loan system.
loan contract#
begin_date
Mortgage mature_date ID#
ID# Customer customer_name
loan contract loan-status date
1 1
1 1
1
1
1 1 (0,n) 1 n
n
Mortgage
Mortgage Loan
Loan Mortgage
Mortgage loan
loan
loan interest
interest loan
loan account
drawdown
drawdown type
type balance
balance history
loan contract# past loan# account#
loan contract# Loan_contract#
loan_bal_date past_loan_status
drawdown_date Interest_effective_date
drawdown_amt fixed_rate balance_amt
accurred_interest Mortgage
Mortgage
loan
loan
repayment
index Interest_effective_date
repayment
interest index_rate
loan contract# type accured interest
repayment_date
repayment
2023/2/5 34
Step 2 Merge entities by Implied binary relationship and Generalization
Since data ID# appears in entity Mortgage Loan Contract and entity
Customer, they are in one-to-many cardinality with foreign key on the
“many” side. loan contract#
begin_date m ID#
Mortgage 1
mature_date book Customer customer_name
loan contract date
Since Fixed and Index Interest Rate both have the same key,
Interest_effective_date, they can be generalized disjointed as either
fixed or index rate.
Loan
interest
type
Interest_effective_date
Interest_type
d
fixed index
interest interest
rate type
Interest_effective_date
Interest_effective_date Index_rate
2023/2/5 fixed_rate Index_name 35
accured_interest accured_interest
Thus, we can derive cardinality from the implied relationship between
these entities, and integrate the two schemas into one EER model.
loan contract#
begin_date m ID#
Mortgage 1
mature_date book Customer customer_name
loan contract date
1 1
1 1
1
1
fixed index
interest interest
rate type
Loan_contract# Loan_contract#
Interest_effective_date Interest_effective_date
fixed_rate Index_rate
accured_interest accured_interest
2023/2/5 36
Reading assignment
“Information Systems Reengineering and
Integration” by Joseph Fong, published by
Springer Verlag, 2006, pp. 282-310.
2023/2/5 37
Review question 3
Can two Relational Schemas be integrated into one
Relational Schema? Explain your answer.
2023/2/5 38
Tutorial question 3
Provide an integrated schema for the following two views which are merged to create a bibliographic database. During
identification of correspondences between the two views, the users discover the followings:
RESEARCHER and AUTHOR are synonyms,
CONTRIBUTED_BY and WRITTEN_IN are synonyms,
ARTICLES belongs to a SUBJECT.
ARTICLES and BOOK can be generalized as PUBLICATION.
Hints: Given two subclass entities have same relationship(s). The two subclasses entities can be generalized into a
superclass entity and the subclass relationship(s) can also be generalized into a superclass relationship..
View 1
Title
n 1 Volume
ARTICLE Published_in JOURNAL
Size Number
Contributed_by
Full_Name RESEARCHER
View 2
Title
n 1 Classification_id
BOOK Belongs_to SUBJECT
Publisher Name
n
Written_in
m
Full_Name AUTHOR
2023/2/5 39
Methodology for Data warehousing with OLAP
2008/2/19 1
Data conversion: Customerized Program Approach
Definitions of
source, target,
and mapping
Interpretive
source target
transformer
2008/2/19 3
Interpretive transformer
For instance, the following Cobol structure
Level 0 PERSON
3 LIC#2 NAME
2008/2/19 4
To translate the above three levels to the following two levels data structure:
Level 0 PERSON
Translator
generator
Specialized
source target
program
Convert
catalog
CONVERT PL/1
program Restructurer Procedure
(CONVERT)
Statement 1 compiler phase COP 1
Statement 2 COP 2
: DEFINE :
: Compiler :
CONVERT Execution
Catalogue Schedule
2008/2/19 7
As an example, consider the following hierarchical database:
ITEMNO DESC
2008/2/19 8
Its DEFINE statements can be described in the following where
for each DEFINE statement, a code is generated to allocate a
new subtree in the internal buffer.
GROUP DEPT:
OCCURS FROM 1 TIMES;
FOLLOWED BY EOF;
PRECEDED BY HEX ‘01’;
:
END EMP;
GROUP PROJ:
OCCURS FROM 0 TIMES;
PRECEDED BY HEX ‘03’;
:
END PROJ;
2008/2/19 9
END DEPT;
For each user-written CONVERT statement, we can produce a customized
program. Take the DEPT FORM from the above DEFINE statement:
Lu?c d? Lu?c d?
quan h? quan h?
ngu?n dích
L? a ch?n?
Business requirements
A company has two regions A and B.
•Each region forms its own departments.
•Each department approves many trips in a year.
•Each staff makes many trips in a year.
•In each trip, a staff needs to hire cars for
transportation.
•Each hired car can carry many staff for each trip.
•A2008/2/19
staff can be either a manager or an engineer. 12
Data Requirements
Where ID = Inclusion Dependence, Underlined are primary keys and “*” prefixed are foreign keys.
2008/2/19 13
Extended Entity Relationship Model for database Trip
Department_id
Department_id
Classification Region_A Region_B
Classification
Department_id
Department Salary
1
R1
m
Trip Trip_id
m
R2
1
Staff_id Car_model
Name People
m Assignment n Car Size
DOB Description
Staff_id Staff_id
Manager Engineer
Title Title
2008/2/19 14
Schema for target relational database Trip
Create Table Car
(CAR_MODEL character (10),
SIZE character (10),
DESCRIPT character (20),
STAFF_ID character (5),
primary key (CAR_MODEL))
UPDATE trip_ora52.TRIP
SET DEPT_ID= 'AA001'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'B001'
, CAR_MODEL= 'MZ-18 '
, STAFF_ID= 'B001'
WHERE TRIP_ID='T0002'
UPDATE trip_ora52.TRIP
SET DEPT_ID= 'AB001'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'A002'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'A002'
WHERE TRIP_ID='T0003'
UPDATE trip_ora52.TRIP
SET DEPT_ID= 'BB001'
, CAR_MODEL= 'SA-38 '
, STAFF_ID= 'D002'
, CAR_MODEL= 'SA-38 '
2008/2/19 19
, STAFF_ID= 'D002'
WHERE TRIP_ID='T0004'
Data Integration
Step 1: Merge by Union
Relation Ra
A1 A2
Relation Rx
a11 a21
A1 A2 A3
a12 a22
a11 a21 null
a14 a32
2008/2/19 20
Step 2: Merge classes by
generalization
Relation Ra Relation Rx1
A1 A2 A1 A2 Relation Rx
a11 a21 a11 a21
A1 A2 A3
a12 a22 a12 a22
a11 a21 null
==> a12 a22
Relation Rb Relation Rx2 null
2008/2/19 21
Step 3: Merge classes by
inheritance
Relation Rb Relation Rxb
A1 A2 A1 A2 A3
==>
Relation Ra Relation Rxa
A1 A3 A1 A3
2008/2/19 22
Step 4 Merge classes by
aggregation
Relation Rx
Relation Ra Relation Rx’
Relation Rx1
A1 A2 A1 A3 A5 A1 A2 A1 A3 *A5
a11 a21 a11 a31 a11 a21
a51 a11 a31 a51
a12 a22 a12 a32 a12 a22
a51 a12 a32 a51
2008/2/19 23
Step 5 Merge classes by
categorization
Relation Ra Relation Ra
A1 A2 A1 A2
Relation Rx
a11 a21 a11 a21
A1 A4
a12 a22 a12 a22
a11 a41
2008/2/19 24
Step 6 Merge classes by implied
relationship
2008/2/19 25
Step 7 Merge relationship by
subtype
Relation Ra Relation Rb Relation Xa Relation Xc
A1 A2 A3 *A1 A1 A2 *A3 *A1
a11 a21 a31 a11 a11 a21 a31 a11
a12 a22 a32 a12 a12 a22 a32 a12
a33 null
2008/2/19 26
Data conversion from relational into XML
Step 3:
Data XML Document
Relational
Conversion
Databse
2008/2/19 29
Step 1: Reverse Engineering Relational
Schema into an EER Model
2008/2/19 30
Step 2: Data Conversion from Relational into
XML document
2008/2/19 31
Step 2.1 Defining a Root Element
To select a root element, we must put its relevant
information into an XML schema. Relevance
concerns with the entities that are related to a
selected entity by the user. The relevant classes
include the selected entity and all its relevant entities
that are navigable.
2008/2/19 32
Root
EntityA
1 1
R1 R4 1 Element E
n n SelectedEntity
EntityB EntityE
1 n * n *
1 1 Element A Element F Element H
1 1 1
R2 R3 R5 R7 Mapping
1
n * n *
Element B Element G
n n n n
1
EntityC EntityD EntityF EntityH
n
* n *
1 Element C Element D
R6
n Relevant
EntityG Entities MappedXMLView
EERModel
2008/2/19 33
Step 2.1 Mapping Cardinality
from RDB to XML
2008/2/19 34
One-to-one cardinality
EER Model DTD Graph
A1 A1
Entity A Element A
A2 A2
1
1
Schema
B1 B1
Entity B Element B
B2 Translation B2
2008/2/19 36
One-to-many cardinality
EER Model DTD Graph
A1 A1
Entity A Element A
A2 A2
1
R *
n Schema
B1 Translation B1
Entity B Element B
B2 B2
2008/2/19 38
Many-to-many cardinality
EER Model DTD Graph
A1
Entity A
A2
R A1 B1
n A2 Element A Element R Element B B2
A_id B_id
B1
Entity B
B2 Schema A_idref B_idref
Translation DTD
<!ELEMENT A EMPTY>
<!ATTLIST A A1 CDATA #REQUIRED>
Relational Schema <!ATTLIST A A2 CDATA #REQUIRED>
<!ATTLIST A A_id ID #REQUIRED>
Relation A(A1, A2) <!ELEMENT R EMPTY>
Relation B(B1, B2) <!ATTLIST R A_idref IDREF #REQUIRED>
Relation R(*A1, *B1) <!ATTLIST R B_idref IDREF #REQUIRED>
<!ELEMENT B EMPTY>
<!ATTLIST B B1 CDATA #REQUIRED>
2008/2/19 <!ATTLIST B B2 CDATA #REQUIRED> 39
<!ATTLIST B B_id ID #REQUIRED>
Many-to-many cardinality
Relation A
A1 A2 XML Document
a11 a21
<A A1="a11" A2="a21" A_id="1"></A>
a12 a22
<B B1="b11" B2="b21" B_id="2"></B>
<R A_idref="1" B_idref="2"></R>
Relation B
B1 B2 Data
<A A1="a12" A2="a22" A_id="3"></A>
b11 b21 Conversion <B B1="b12" B2="b22" B_id="4"></B>
b12 b22 <R A_id="3" B_idref="4"></R>
2008/2/19 40
Case Study
Consider a case study of a Hospital Database
System. In this system, a patient can have
many record folders. Each record folder can
contain many different medical records of the
patient. A country has many patients. Once a
record folder is borrowed, a loan history is
created to record the details about it.
2008/2/19 41
Hospital Relational Schema
Relation Patient (HK_ID, Patient_Name)
Relation Record_Folder (Folder_No, Location, *HKID)
Relation Medical_Record (Medical_Rec_No,
Create_Date, Sub_Type, *Folder_No)
Relation Borrower (*Borrower_No, Borrower_Name)
Relation Borrow (*Borrower_No, *Folder_No)
2008/3/1 43
Borrower_Name Table Borrower_no Borrower_name
B1 Johnson
B11 Choy
B21 Fung
B22 Lok
2008/2/19 44
Step 1 Reverse engineer relational database into an EER model
Patient
belong
n
Medical n 1 Record
contain Borrower
Folder Folder
1 1
by has
n
m
Borrow
2008/2/19 45
Step 2 Translate EER model into
DTD Graph and DTD
In this case study, suppose we concern the
patient medical records, so the entity Patient is
selected. Then we define a meaningful name
for the root element, called Patient_Records.
We start from the entity Patient in the EER
model and then find the relevant entities for it.
The relevant entities include the related entities
that are navigable from the parent entity.
2008/2/19 46
Translated Document Type Definition Graph
Patient
Records
Patient
Record
Folder
* *
Medical
Borrow
Folder
Borrower
2008/2/19 47
Translated Document Type Definition
2008/3/1 48
Transformed XML document
Patient_Records>
<Patient Country_No="C0001" HKID="E3766849" Patient_Name="Smith">
<Record_Folder Folder_No="F_21" Location="Hong Kong">
<Borrow Borrower_No="B1">
<Borrower Borrower_name=“Johnson" />
</Borrow>
<Borrow Borrower_No="B11">
<Borrower Borrower_name=“Choy" />
</Borrow>
<Borrow Borrower_No="B21">
<Borrower Borrower_name=“Fung" />
</Borrow>
<Borrow Borrower_No="B22">
<Borrower Borrower_name=“Lok" />
</Borrow>
<Medical_Record Medical_Rec_No="M_311999" Create_Date="Jan-1-1999“, Sub_Type="W"></Medical_Record>
<Medical_Record Medical_Rec_No="M_322000" Create_Date="Nov-12-1998“, Sub_Type="W"></Medical_Record>
<Medical_Record Medical_Rec_No="M_352001" Create_Date="Jan-15-2001“, Sub_Type="A"></Medical_Record>
<Medical_Record Medical_Rec_No="M_362001" Create_Date="Feb-01-2001“, Sub_Type="A"></Medical_Record>
</Record_Folder>
<Record_Folder Folder_No="F_24" Location="New Territories">
<Borrow Borrower_No="B22">
<Borrower Borrower_name=“Lok" />
</Borrow>
<Medical_Record Medical_Rec_No="M_333333" Create_Date="Mar-03-01“, Sub_Type="A"></Medical_Record>
</Record_Folder>
</Patient>
2008/2/19 49
Reading Assignment
Chapter 4 Data Conversion of “Information
Systems Reengineering and Integration”
by Joseph Fong, Springer Verlag, pp.160-
198.
2008/2/19 50
Lecture review question 6
How do you compare the pros and cons of
using “Logical Level Translation Approach”
with “Customized Program Approach” in
data conversion?
2008/2/19 51
CS5483 Tutorial Question 6
Convert the following relational database into an XML document:
Relation Car_rental
Car_model Staff_ID *Trip_ID
MZ-18 A002 T0001
MZ-18 B001 T0002
R-023 B004 T0001
R-023 C001 T0004
SA-38 A001 T0003
SA-38 A002 T0001
Relation Trip
Trip_ID *Department_ID
T0001 AA001
T0002 AA001
T0003 AB001
T0004 BA001
Relation Department
Department_ID Salary
AA001 35670
AB001 30010
BA001 22500
2008/3/1 52
Fall 2004, CIS, Temple University
Lecture 2
1
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
2
What is Data Warehouse?
4
Data Warehouse—Integrated
◼ Constructed by integrating multiple, heterogeneous
data sources
◼ relational databases, flat files, on-line transaction
records
◼ Data cleaning and data integration techniques are
applied.
◼ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
◼ E.g., Hotel price: currency, tax, breakfast covered, etc.
◼ When data is moved to the warehouse, it is
converted.
5
Data Warehouse—Time Variant
6
Data Warehouse—Non-Volatile
7
Data Warehouse vs. Heterogeneous DBMS
8
Data Warehouse vs. Operational DBMS
◼ OLTP (on-line transaction processing)
◼ Major task of traditional relational DBMS
◼ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
◼ OLAP (on-line analytical processing)
◼ Major task of data warehouse system
◼ Data analysis and decision making
◼ Distinct features (OLTP vs. OLAP):
◼ User and system orientation: customer vs. market
◼ Data contents: current, detailed vs. historical, consolidated
◼ Database design: ER + application vs. star + subject
◼ View: current, local vs. evolutionary, integrated
◼ Access patterns: update vs. read-only but complex queries
9
Why Separate Data Warehouse?
◼ High performance for both systems
◼ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
◼ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
◼ Different functions and different data:
◼ Decision support requires historical data which
operational DBs do not typically maintain
◼ Decision Support requires consolidation (aggregation,
summarization) of data from heterogeneous sources
◼ Different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
10
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
11
A Multi-Dimensional Data Model
Country
sum
Canada
Mexico
sum
13
4-D Data Cube
Supplier 1
Supplier 2
Supplier 3
14
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
15
Conceptual Modeling of Data Warehouses
time item
time_key item_key supplier
day Sales Fact Table item_name supplier_key
day_of_the_week brand supplier_type
time_key type
month
quarter item_key supplier_key
year
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province
Measures
country
18
Example of Fact Constellation
time
time_key Shipping Fact Table
day item
day_of_the_week Sales Fact Table item_key time_key
month item_name
quarter time_key brand item_key
year type shipper_key
item_key supplier_type
branch_key from_location
<dimension_name_first_time> in cube
<cube_name_first_time>
20
Defining a Star Schema in DMQL
21
Defining a Snowflake Schema in DMQL
23
Measures: Three Categories
Measure: a function evaluated on aggregated data
corresponding to given dimension-value pairs.
Measures can be:
◼ distributive: if the measure can be calculated in a
distributive manner.
◼ E.g., count(), sum(), min(), max().
◼ algebraic: if it can be computed from arguments obtained
by applying distributive aggregate functions.
◼ E.g., avg()=sum()/count(), min_N(), standard_deviation().
◼ holistic: if it is not algebraic.
◼ E.g., median(), mode(), rank().
24
Measures: Three Categories
25
Browsing a Data Cube
◼ Visualization
◼ OLAP capabilities
◼ Interactive manipulation
26
A Concept Hierarchy
• Concept hierarchies allow data to be handled
at varying levels of abstraction
Office Day
Month
27
Typical OLAP Operations (Fig 2.10)
◼ Roll up (drill-up): summarize data
◼ by climbing up concept hierarchy or by dimension reduction
◼ Drill down (roll down): reverse of roll-up
◼ from higher level summary to lower level summary or detailed
data, or introducing new dimensions
◼ Slice and dice:
◼ project and select
◼ Pivot (rotate):
◼ reorient the cube, visualization, 3D to series of 2D planes.
◼ Other operations
◼ drill across: involving (across) more than one fact table
◼ drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
28
Querying Using a Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
Each circle is
ORDER called a footprint
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
Promotion Organization
29
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
30
Data Warehouse Design Process
◼ Choose the dimensions that will apply to each fact table record
◼ Choose the measure that will populate each fact table record
31
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
materialized
33
OLAP Server Architectures
◼ Relational OLAP (ROLAP)
◼ Use relational or extended-relational DBMS to store and
and services
◼ greater scalability
matrix techniques)
◼ fast indexing to pre-computed summarized data
schemas
34
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
35
Efficient Data Cube Computation
38
Multiway Array Aggregation for MOLAP
◼ Partition arrays into chunks (a small subcube which fits in memory).
◼ Compressed sparse array addressing: (chunk_id, offset)
◼ Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces
memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
A 39
Multiway Array Aggregation for MOLAP
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
1 2 3 4
b0
After scan {1,2,3,4}:
a0 a1 a2 a3
A • b0c0 chunk is computed
• a0c0 and a0b0 are not
computed
40
Multiway Array Aggregation for MOLAP
We need to keep a We need to keep 4
single b-c chunk in a-c chunks in
memory memory
C c3 61 62 63 64
c2 45 46 47 48 After scan 1-13:
c1 29 30 31 32
c0
B13
• a0c0 and b0c0
14 15 16 60
b3 44 chunks are
B b2 9
28 56 computed
40
24
b1 5
36
52 • a0b0 is not
20 computed (we will
b0 1 2 3 4
need to scan 1-49)
a0 a1 a2 a3
A
We need to keep 16
a-b chunks in
memory
41
Multiway Array Aggregation for MOLAP
42
Indexing OLAP Data: Bitmap Index
◼ Suitable for low cardinality domains
◼ Index on a particular column
◼ Each value in the column has a bit vector: bit-op is fast
◼ The length of the bit vector: # of records in the base table
◼ The i-th bit is set if the i-th row of the base table has the value
for the indexed column
44
Online Aggregation
45
Efficient Processing of OLAP Queries
49
Discovery-Driven Exploration of Data
Cubes
◼ Hypothesis-driven: exploration by user, huge search space
◼ Discovery-driven (Sarawagi et al.’98)
◼ pre-compute measures indicating exceptions, guide user in the
data analysis, at all levels of aggregation
◼ Exception: significantly different from the value anticipated,
based on a statistical model
◼ Visual cues such as background color are used to reflect the
degree of exception of each cell
◼ Computation of exception indicator can be overlapped with cube
construction
50
Examples: Discovery-Driven Data Cubes
51
Chapter 2: Data Warehousing and
OLAP Technology for Data Mining
52
Data Warehouse Usage
◼ Three kinds of data warehouse applications
◼ Information processing
◼ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
◼ Analytical processing
◼ multidimensional analysis of data warehouse data
◼ supports basic OLAP operations, slice-dice, drilling, pivoting
◼ Data mining
◼ knowledge discovery from hidden patterns
◼ supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
◼ Differences among the three tasks
53
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
◼ Why online analytical mining?
◼ High quality of data in data warehouses
◼ DW contains integrated, consistent, cleaned data
55
Fall 2004, CIS, Temple University
Lecture 3
• An attribute is a property or
characteristic of an object Tid Refund Marital Taxable
Status Income Cheat
– Examples: eye color of a
person, temperature, etc. 1 Yes Single 125K No
– Attribute is also known as 2 No Married 100K No
variable, field, characteristic, 3 No Single 70K No
or feature
4 Yes Married 120K No
• A collection of attributes 5 No Divorced 95K Yes
describe an object Objects
6 No Married 60K No
– Object is also known as
7 Yes Divorced 220K No
record, point, case, sample,
entity, or instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2
Attribute Values
• Attribute values are numbers or symbols assigned
to an attribute
4
Properties of Attribute Values
• The type of an attribute depends on which of the
following properties it possesses:
– Distinctness: =
– Order: < >
– Addition: + -
– Multiplication: */
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
6
Attribute Transformation Comments
Level
Interval new_value =a * old_value + b where a and Thus, the Fahrenheit and Celsius
b are constants temperature scales differ in terms
of where their zero value is and the
size of a unit (degree).
7
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
8
Types of Data Sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Multi-Relational
– Star or snowflake schema
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
9
Important Characteristics of Structured
Data
– Dimensionality
• Number of attributes each object is described with
• Challenge: high dimensionality (curse of dimensionality)
– Sparsity
• Sparse data: values of most attributes are zero
• Challenge: sparse data call for special handling
– Resolution
• Data properties often could be measured with different
resolutions
• Challenge: decide on the most appropriate resolution (e.g.
“Can’t See the Forest for the Trees”)
10
Record Data
• Data that consists of a collection of records,
each of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
12
Document Data
• Each document becomes a ‘term’ vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times
the corresponding term occurs in the document.
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
13
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– E.g., consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products
that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk 14
Multi-Relational Data
• Attributes are objects themselves
15
Graph Data
• Examples: Generic graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
2 <li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
16
Chemical Data
• Benzene Molecule: C6H6
17
Ordered Data
• Sequences of transactions
Items/Events
An element of
18
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
19
Ordered Data
• Spatial-Temporal Data
Average Monthly
Temperature of
land and ocean
20
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
21
Noise
• Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking
on a poor phone and “snow” on television screen
23
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
25
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
26
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
27
Sampling
• Sampling is the main technique employed for data
selection.
– It is often used for both the preliminary investigation
of the data and the final data analysis.
28
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
29
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Stratified sampling
– Split the data into several partitions; then draw random samples
from each partition 30
Sample Size
31
Sample Size
• What sample size is necessary to get at least one
object from each of 10 groups.
32
Curse of Dimensionality
• When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
• Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
34
Dimensionality Reduction: PCA
• Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
35
Dimensionality Reduction: PCA
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space
x2
x1 36
Dimensionality Reduction: ISOMAP
By: Tenenbaum, de Silva,
Langford (2000)
• Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
• Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA 38
Feature Subset Selection
• Techniques:
– Brute-force approch:
• Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
• Features are selected before data mining algorithm is run
– Wrapper approaches:
• Use the data mining algorithm as a black box to find best
subset of attributes
39
Feature Creation
• Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
40
Example: Mapping Data to a New
Space
• Fourier transform
• Wavelet transform
42
Discretization Using Class Labels
• Entropy based approach
44
Fall 2004, CIS, Temple University
Lecture 4
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 Bread, Milk {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Applications: Association Rule Mining
• * Maintenance Agreement
– What the store should do to boost Maintenance
Agreement sales
• Home Electronics *
– What other products should the store stocks up?
• Attached mailing in direct marketing
• Detecting “ping-ponging” of patients
• Marketing and Sales Promotion
• Supermarket shelf management
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items TID Items
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
d d − k
R =
d −1 d −k
k j
k =1 j =1
= 3 − 2 +1
d d +1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
Illustrating Apriori Principle
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Apriori: Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of
each candidate itemset
– To reduce the number of comparisons, store the candidates in a
hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
1+ 2356
2+ 356
12+ 356
3+ 56
13+ 56
234
15+ 6
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Apriori: Alternative Search Methods
.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional
Apriori: Alternative Search Methods
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
TID-list
ECLAT: Another Method for Frequent Itemset
Generation
• Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
A B AB
1 1 1
→
4 2 5
5 5 7
6 7 8
7 8
8 10
9
• 3 traversal approaches:
– top-down, bottom-up and hybrid
• Advantage: very fast support counting
• Disadvantage: intermediate tid-lists may become too
large for memory
FP-growth: Another Method for Frequent
Itemset Generation
D:1
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
C:1
FP-growth
Conditional tree for A
within D within E:
Count for A is 2: {A,D,E}
null is frequent itemset
Next step:
A:2
Construct conditional tree
C within conditional tree
E
Continue until exploring
conditional tree for A
(which has only node A)
Benefits of the FP-tree Structure
• Performance study shows
– FP-growth is an order of
magnitude faster than
Apriori, and is also faster 100
• Reasoning 70
Run time(sec.)
60
– No candidate generation, 50
no candidate test 40
30
10
– Eliminate repeated 0
0 0.5 1 1.5 2 2.5 3
database scan Support threshold(%)
10
• Number of frequent itemsets = 3
10
k
k =1
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
Closed Itemset
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
TID Items
# Closed = 9
1 ABC 2 4
ABCD ABCE ABDE ACDE BCDE # Maximal = 4
2 ABCD
3 BCE
ABCDE
4 ACDE
5 DE
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Rule Generation
Lecture 6
• Clustering
1
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
3
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an
earth observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
4
What Is Good Clustering?
5
Requirements of Clustering in Data
Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
6
Data Structures in Clustering
0
d(2,1)
• Dissimilarity matrix 0
d(3,1) d ( 3,2) 0
– (one mode)
: : :
d ( n,1) d ( n,2) ... ... 0
7
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
8
Interval-valued variables
• Standardize data
– Calculate the mean squared deviation:
sf = 1
n (| x1 f
− m f
|2 + | x − m |2 +...+ | x − m |2)
2f f nf f
9
Similarity and Dissimilarity Between
Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
10
Similarity and Dissimilarity Between
Objects
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j) 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
11
Mahalanobis Distance
1 n
j ,k = ( X ij − X j )( X ik − X k )
n − 1 i =1
0.3 0.2
=
0.2 0.3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
13
Cosine Similarity
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
14
Correlation Measure
Scatter plots
showing the
similarity from
–1 to 1.
15
Binary Variables
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p
• Simple matching coefficient (invariant, if the binary variable is
symmetric):
d (i, j) = b+c
a +b+c + d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j) = b+c
a +b+c 16
Dissimilarity between Binary Variables
• Example
d (i, j) = p −
p
m
18
Ordinal Variables
• An ordinal variable can be discrete or continuous
• order is important, e.g., rank
• Can be treated like interval-scaled
– replacing xif by their rank rif {1,..., M f }
– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
rif −1
zif =
M f −1
– compute the dissimilarity using methods for interval-scaled
variables
19
Ratio-Scaled Variables
20
Variables of Mixed Types
• A database may contain all the six types of variables
– symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio.
• One may use a weighted formula to combine their
effects.
pf = 1 ij( f ) dij( f )
d (i, j) =
pf = 1 ij( f )
– f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
– f is interval-based: use the normalized distance
– f is ordinal or ratio-scaled
• compute ranks rif and
z =
rif − 1
• and treat zif as interval-scaled if M −1
f
21
Notion of a Cluster can be Ambiguous
22
Other Distinctions Between Sets of
Clusters
• Exclusive versus non-exclusive
– In non-exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
– Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities
23
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
3 well-separated clusters
25
Types of Clusters: Center-Based
• Center-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
26
Types of Clusters: Contiguity-Based
8 contiguous clusters
27
Types of Clusters: Density-Based
• Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
28
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
29
Major Clustering Approaches
31
K-means Clustering – Details
• Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the
cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’
• Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
32
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
35
Handling Empty Clusters
• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest SSE
– If there are several empty clusters, the above can be
repeated several times.
36
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
• ISODATA 37
Bisecting K-means
38
Bisecting K-means Example
39
Limitations of K-means
40
Limitations of K-means: Differing Sizes
41
Limitations of K-means: Differing Density
42
Limitations of K-means: Non-globular
Shapes
43
Overcoming K-means Limitations
45
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
• Handling a mixture of categorical and numerical data: k-
prototype method
46
The K-Medoids Clustering Method
6 5
0.2
4
0.15 3 4
2
5
0.1
2
0.05
1
3 1
0
1 3 2 5 4 6
48
Strengths of Hierarchical Clustering
49
Hierarchical Clustering
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
52
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
53
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 54
p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
55
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 56
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 57
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 58
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 59
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
• MIN .
• MAX .
• Group Average .
Proximity Matrix
• Distance Between Centroids
• Other methods driven by an
objective function
– Ward’s Method uses squared error 60
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
61
Hierarchical Clustering: Time and Space
requirements
• O(N2) space since it uses the proximity matrix.
– N is the number of points.
62
Hierarchical Clustering: Problems and
Limitations
• Once a decision is made to combine two
clusters, it cannot be undone
64
MST: Divisive Hierarchical Clustering
65
More on Hierarchical Clustering Methods
66
One Alternative: BIRCH
68
DBSCAN
69
DBSCAN: Core, Border, and Noise Points
70
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
71
DBSCAN: Core, Border and Noise Points
• Resistant to Noise
• Can handle clusters of different shapes and sizes
73
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
74
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther
distance
• So, plot sorted distance of every point to its kth
nearest neighbor
75
Graph-Based Clustering
78
Limitations of Current Merging
Schemes
(a)
(b)
(c)
(d)
79
Model-Based Clustering Methods
80
Cluster Validity
• For supervised classification we have a variety of
measures to evaluate how good our model is
– Accuracy, precision, recall
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means Complete
0.7 0.7
Link
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
82
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g., SSE or
entropy
• Sometimes these are referred to as criteria instead of indices
– However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
83
Internal Measures: Cohesion and
Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS = ( x − mi )2
i xC i
– Separation is measured by the between cluster sum of squares
BSS = Ci (m − mi )2
i
• Where |Ci| is the size of cluster i
84
External Measures of Cluster Validity:
Entropy and Purity
85
Final Comment on Cluster Validity
86
What Is Outlier Discovery?
87
Outlier Discovery:
Statistical Approach
89
Outlier Discovery: Deviation-Based
Approach
• Identifies outliers by examinining the main
characteristics of objects in a group
• Objects that “deviate” from this description are
considered outliers
• sequential exception technique
– simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
• OLAP data cube technique
– uses data cubes to identify regions of anomalies in large
multidimensional data
90
Fall 2004, CIS, Temple University
Lecture 7
Decision Trees
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Don’t Cheat
Cheat
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Tree Induction
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Size
{Small,
What about this split? Large} {Medium}
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Gini Index
Entropy
Misclassification error
GINI (t ) = 1 − [ p( j | t )]2
j
GINI (t ) = 1 − [ p( j | t )]2
j
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
Information Gain:
n
= Entropy ( p) −
k
GAIN Entropy (i ) i
n
split i =1
Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = − log
Split k
i i
split
SplitINFO n n i =1
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Missing Values
Costs of Classification
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
© Vipin Kumar CSci 5980 Spring 2004 ‹#›
Notes on Overfitting
Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
– Can use MDL for post-pruning
A1 A4
A2 A3
C0: 11 C0: 2
C1: 3 C1: 4
– Pessimistic error?
Don’t prune case 1, prune case 2
C0: 14 C0: 2
C1: 3 C1: 2
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
x+y<1
Class = + Class =
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
a+d TP + TN
Đô chính xác = =
a + b + c + d TP + TN + FP + FN
PREDICTED CLASS
wa + wb+ wc + w d
1 2 3 4
acc − p
P( Z Z )
p(1 − p) / N
/2 1− / 2
p= /2 /2
2( N + Z ) 2
/2
e1 ~ N (1 , 1 )
e2 ~ N (2 , 2 )
e (1 − e )
– Approximate: ˆ =
i i
i
n i
= + ˆ + ˆ
2
t
2
1 2
2
1
2 2