Paper - Xvii Data Mining and Warehousing
Paper - Xvii Data Mining and Warehousing
Paper - Xvii Data Mining and Warehousing
(PRIDE)
PERIYAR UNIVERSITY
SALEM - 636 011.
1
Prepared By
S.Aranganayagi M.C.A.,
Associate Professor In Computer Science,
J.K.K.Nataraja College Of Arts & Science,
Komarapalayam – 638 183.
2
CONTENTS
UNIT – I
1. INTRODUCTION
1.1Basic Data Mining Tasks
1.1.1 Classification
1.1.2 Regression
1.1.3 Time Series Analysis
1.1.4 Prediction
1.1.5 Clustering
1.1.6 Summarization
1.1.7 Association Rules
1.1.8 Sequence Discovery
1.2 Data Mining Versus Knowledge Discovery In
Databases
1.3 Data Mining Issues
1.4 Data Mining Metrics
2. DATA PREPROCESSING
2.1 Data Cleaning
2.1.1 Missing values
2.1.2 Noisy Data
2.1.3 Inconsistent Data
2.2 Data Integration and Transformation
2.2.2 Data Integration
2.2.2 Data Transformation
2.3 Data Reduction
2.3.1 Data Cube Aggregation
2.3.2Dimensionality Reduction
2.3.3 Data Compression
2.3.4 Numerosity Reduction
3
3.2.4 Hypothesis Testing
3.2.5 Regression and Correlation
3.3 Similarity Measures
UNIT – II
4. CLASSIFICATION
4.1 Introduction
4.1.1 Issues in Classification
4.2 Statistical-Based Algorithms
4.2.1 Regression
4.2.2 Bayesian Classification
4.3 Distance-Based Algorithms
4.3.1 Simple Approach
4.3.2 K Nearest Neighbors
4.4. Decision Tree-Based Algorithms
4.4.1 ID3
4.5 Neural Network-Based Algorithms
4.5.1 Propagation
4.6 Rule-Based Algorithms
4.6.1 Generating Rules from A Decision Tree
4.7 Combining Techniques
UNIT - III
5. CLUSTERING
5.1 Introduction
5.2 Similarity and Distance Measures
5.3 Outliers
5.4 Hierarchical Algorithms
5.4.1 Agglomerative Algorithms
5.4.2 Divisive Clustering
5.5 Partitional Algorithms
5.5.1 K-Means Clustering
5.6 Clustering Large Databases
5.6.1 Birch
5.7 Clustering With Categorical Attributes
4
UNIT - IV
6. ASSOCIATION RULES
6.1 Introduction
6.2 Large Itemsets
6.3 Basic Algorithms
6.3.1 Apriori Algorithm
6.3.2 Partitioning
6.4 Advanced Association Rule Techniques
6.4.1 Generalized Association Rules
6.4.2 Multiple-Level Association Rules
6.4.3 Quantitative Association Rules
6.4.4 Using Multiple Minimum Supports
6.4.5 Correlation Rules
6.5 Measuring the Quality of Rules
7. WEB MINING
7.1 Web Content Mining
7.1.1 Pattern Discovery
8. SPATIAL MINING
8.1 Introduction
8.1.1 Spatial Data Mining Primitives
8.2 Text Mining
5
UNIT - V
9. DATA WAREHOUSE
9.1 Introduction
9.2 What is A Data Warehouse?
9.3 Multidimensional Data Model
9.4 Data Cube
9.5 Dimension Modelling
9.6 Lattice of Cuboids
9.7 Summary Measures
9.8 OLAP Operations
9.9 Warehouse Schema
9.10 Data Warehousing Architecture
9.11 Ware House Server
9.12 Metadata
9.13 OLAP Engine
9.14 Data Warehouse Backend Process
6
UNIT – I
CHAPTER – 1
INTRODUCTION
d
b
7
Data Mining
Predictive Descriptive
Sequence
Classification Regression Time series Clustering Summarization discovery
analysis
Association
Rules
Prediction
8
1.1.3 Time Series Analysis
With time series analysis, the value of an attribute is examined as it
varies over time. The values usually are obtained as every spiced time3 points
(daily, weekly, hourly, etc.) A time series plot is used to visualize the time
series. There are three basic functions performed in time series analysis. In one
case, distance measures are used to determine the similarity between different
time series. In the second case, the structure of the line is examined to
determine its behavior. A third application would be to use the historical time
series plot to predict future values.
1.1.4 Prediction
Many real-world data mining applications can be seen as predicting
future data states based on past and current data. Predication can be viewed as a
type of classification. (Note: This is a data mining task that is different from the
prediction model, although the prediction task is a type of prediction model).
The difference is that prediction is predicting a future state rather than a current
state. Here we are referring to a type of application rather than to a type of data
mining modeling approach, as discussed earlier. Prediction applications include
flooding, speech recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or
regression techniques, other approaches may be used as well.
Example1.2:
Predicting flooding is a complicated problem. One approach is using a
monitor at different places at various points in the river. These monitors collect
data relevant to water level, rain amount, time, humidity and so on. Then the
water level at a potential flooding point can be predicted based on the past data.
1.1.5 Clustering
Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone. Clustering is alternatively
referred to as unsupervised learning or segmentation. It can be thought of as
partitioning or segmenting the data into groups that might or might not be
disjointed. The clustering is usually accomplished by determining the similarity
among the data on predefined attributes. The most similar data are grouped into
clusters. Since the clusters are not predefined, a domain expert is often required
to interpret the meaning of the created clusters.
A special type of clustering is called segmentation. With segmentation a
database is partitioned into disjointed groupings of similar objects called
segments. Segmentation is often viewed as being identical to clustering.
1.1.6 Summarization
Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or generalization. It
extracts or derives representative information about the database. This may be
accomplished by actually retrieving portions of the data. Alternatively,
9
summary type information can be derived from the data. The summarization
succinctly characterized the contents of the database.
1.1.7 Association Rules
Link analysis, alternatively referred to as affinity analysis or
association, refers to the data mining task of uncovering relationships among
data. The best example of this type of application is to determine association
rules. An association rule is a model that identifies specific types of data
associations. These associations are often used in the retail sales community to
identify items that are frequently purchased together. Example 1.8 illustrates
the use of association rules in market basket analysis. Here the data analyzed
consist of information about what items a customer purchases. Associations are
also used in many other applications such as predicting the failure of
telecommunication switches.
Users of association rules must be cautioned that these are not causal
relationships. They do not represent any relationship inherent in the actual data
or in the real world. There probably is no relationship between bread and
pretzels that causes them to be purchased together. And there is no guarantee
that this association will apply in the future. However, association rules can be
used to assist retail store management in effective advertising, and inventory
control.
1.1.8 Sequence Discovery
Sequential analysis or sequence discovery is used to determine
sequential patterns in data. These patterns are based on a time sequence of
actions. These patterns are similar to associations in the data (or events) are
found to be related, but the relationship is based on time. Unlike a market
basket analysis, which requires the items to be purchased at the same time, in
sequence discovery the items are purchased over time in some order. Example
1.9 illustrates the discovery of some simple patterns. A similar type of
discovery can be seen in the sequence within which data are purchased. For
example, most people who purchase CD players may be found to purchase CDs
within one week.
Example 1.3: Web log analysis of XYZ Corp. Web master periodically
analyses the web data to find the frequently accessed pages.
1.2 DATA MINING VERSUS KNOWLEDGE DISCOVERY IN
DATABASES
In fact, there have been many other names given to this process of
discovering useful (hidden) patterns in data: knowledge extraction, information
discovery, exploratory data analysis, information harvesting, and unsupervised
pattern recognition.
Definition1.1. Knowledge discovery in databases (KDD) is the
process of finding useful information and patterns in data.
Definition1.2. Data mining is the use of algorithms to extract the
information and patterns derived by the KDD process.
10
The KDD process if often said to be nontrivial; however, we take the
larger view that KDD is an all-encompassing concept. A traditional SQL
database query can be viewed as the data mining part of a KDD process.
Indeed, this may be viewed as somewhat simple and trivial. However, this was
not the case 30 years ago. If we were to advance 30 years into the future, we
might find that processes thought of today as nontrivial and complex will be
viewed as equally simple.
KDD is a process that involves many different steps. The input to this
process is the data, and the output is the useful information desired by the
users. However, the objective may be unclear or inexact. The process itself is
interactive and may require much elapsed time. To ensure the usefulness and
accuracy of the results of the process, interaction throughout the process with
both domain experts and technical experts might be needed. Figure 1.3
illustrates the overall KDD process.
The KDD process consists of the following five steps:
Selection: The data needed for the data mining process may be
obtained from many different and heterogeneous data sources. This first step
obtains the data from various database, files, and no electronic sources.
Preprocessing: The data to be used by the process may have
incorrect or missing data. There may be anomalous data from multiple sources
involving different data types and metrics. Erroneous data may be corrected or
removed, whereas missing data must be supplied or predicted (often using data
mining tools.)
Transformation: Data from different sources must be
converted into a common format for processing. Some data may be encoded or
transformed into more usable formats. Data reduction may be used to reduce
the number of possible data values being considered.
Data mining: Based on the data mining tasks being performed,
this step applies algorithms to the transformed data to generate the desired
results.
Interpretation/evaluation: How the data mining results are
presented to the user is extremely important because the usefulness of the result
is dependent on it. Various visualization and GUI strategies are used at this last
step.
11
Selection Preprocessing transformation Data mining Interpretation
12
Pixel-based: With these techniques each data value is shown as a
uniquely colored pixel.
Hierarchical: These techniques hierarchically divide the display area
(screen) into regions based on data values.
Hybrid: The preceding approaches can be combined into one display.
Any of these approaches may be two-dimensional or three-dimensional.
Visualization tools can be used to summarize data as a data mining technique
itself. In addition, visualization can be used to show the complex results of data
mining tasks.
The data mining process itself is complex. Discovered patterns must be
correctly interpreted and properly evaluated to ensure that the resulting
information is meaningful and accurate.
Information
retrieval
Databases
Statistics
DATA
MINING
Algorithms
Machine
learning
13
fitting occurs when the model does not fit future states. This may be caused by
assumptions that are made about the data or may simply be caused by the small
size of the training database. For example, a classification model for an
employee database may be is quite small, the model might erroneously indicate
that a short person is anyone under five feet eight inches because there is only
one entry in the training database under five feet eight. In this case, many
future employees would be erroneously classified as short. Overfitting can arise
under other circumstances as well, even though the data are not changing.
Outliers: There are often many data entries that do not fit nicely into
the derived model. This becomes even more of an issue with very large
databases. If a model is developed that includes these outliers, then the model
may not behave well for data that are not outliers.
Interpretation of results: Currently, data mining output may require
experts to correctly interpret the results, which might otherwise be meaningless
to the average database user.
Visualization of results: To easily view and understand the output of
data mining algorithms, visualization of the results is helpful.
Large datasets: The massive datasets associated with data mining
create problems when applying algorithms designed for small datasets. Many
modeling applications grow exponentially on the dataset size and thus are too
inefficient for larger datasets. Sampling and parallelization are effective tools to
attack this scalability problem.
High dimensionality: A conventional database schema may be
composed of many different attributes. The problem here is that not all
attributes may be needed to solve a given data mining problem. In fact, the use
of some attributes may interfere with the correct completion of a data mining
task. The use of other attributes may simply increase the overall complexity
and decrease the efficiency of an algorithm. This problem is sometimes
referred to as the dimensionality curse, meaning that there are many attributes
(dimensions) involved and it is difficult to determine which ones should be
used. One solution to this high dimensionality problem is to reduce the number
of attributes, which is known as dimensionality reduction. However,
determining which attributes not needed is not always easy to do.
Multimedia data: Most previous data mining algorithms are targeted to
traditional data types (numeric, character, text, etc.). The use of multimedia
data such as is found in GIS databases complicates or invalidates many
proposed algorithms.
Missing data: During the preprocessing phase of KDD, missing data
may be replaced with estimates. This and other approaches to handling missing
data can lead to invalid results in the data mining step.
Irrelevant data: Some attributes in the database might not be of
interest to the data mining task being developed.
14
Noisy data: Some attribute values might be invalid or incorrect. These
values are often corrected before running data mining applications.
Changing data: Databases cannot be assumed to be static. However,
most data mining algorism do assume a static database. This requires that the
algorithm is completely rerun anytime the database changes.
Integration: The KDD process is not currently integrated into normal
data processing activities. KDD requests may be treads special, unusual, or
one-time needs. This makes them inefficient, ineffective, and not general
enough to be used on an ongoing basis. Integration of data mining functions
into traditional DBMS systems is certainly a desirable goal.
Application: Determining the intended use for the information obtained
from the data mining function is a challenge. Indeed, how business executives
can effectively use the output is sometimes considered the more difficult part,
not the running of the algorithms themselves. Because the data are of a type
that has not determine how to effectively use the information uncovered.
These issues should be addressed by data mining algorithms and
products.
1.4 DATA MINING METRICS
Measuring the effectiveness or usefulness of a data mining approach is
not always straightforward. In fact, different metrics could be used for different
techniques and also based on the interest level. From an overall business or
usefulness perspective, a measure such as return on investment (ROI) could be
used. ROI examines the difference between what the data mining techniques
costs and what the savings or benefits from its use are. Our objective is to
compare different alternatives to implementing a specific data mining task. The
metrics used include the traditional metrics of space and time based on
complexity analysis. In some cases, such as accuracy in classification, more
specific metrics targeted to a data mining task may be used.
CHAPTER – 2
DATA PREPROCESSING
There are a number of data preprocessing techniques. Data cleaning
can be applied to remove noise and correct inconsistencies in the data. Data
integration manages data from multiple sources into a coherence data store,
such as a data warehouse or a data cube. Data transformations, such as
normalization, may be applied. For example, normalization may improve the
accuracy and efficiency of mining algorithms involving distance
measurements. Data reduction can reduce the data size by derogating,
elimination of redundant features, or clustering, for instance.
2.1 Data Cleaning
Real-world data tend to be incomplete, noisy and inconsistent. Data
cleaning attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
15
2.1.1 Missing values
Consider the data set of an electronics store, where the customers are
not willing to fill all the values, such as the attribute Income may contain
NULL values for many tuples. There are so many methods to fill the missing
value some of it is given below.
Ignore the tuple: If several attributes contains missing values then
ignore the whole tuple. When the class label is missing, this method is useful
provided the task is classification. This method is not very effective. It is
especially poor when the percentage of missing values per attributes varies
considerably.
Fill in the missing value manually: This is a time consuming
approach. It is not feasible for a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown” or @.
Drawback of this method is that we may get grouping of tuples based on
“Unknown”.
Use the attribute mean to fill in the missing value: Average of an
attribute is used to fill the missing value. If the average of income attribute is
20000Rs., then replace the missing income with this mean.
Use the attribute mean for all samples belonging to the same class
as the given tuple: Instead of filling with average of whole data set, mean of
the attribute of a respective class can be used to fill the missing value. If the
target class related to credit risk such as yes or no, then the mean of attribute
income belongs to yes class could be used to fill the missing value.
Use the most probable value to fill in the missing value: The
statistical tools such as regression, Bayesian information or Decision tree
induction can be used to predict the missing value.
Methods based on average/ Mean bias the data, whereas statistical
based methods estimates the missing values based on the values of all the
attributes.
2.1.2 Noisy Data
Noise is a random error or variance in a measured variable. Given are
some of the smoothing methods to remove the noise.
Binning: Binning methods smooth a sorted data value by
consulting its “neighborhood”, that is, the values around it. The sorted values
are distributed into a number of “buckets,” or bins. Smoothing using binning is
carried out in three ways such as
Smoothing by bin means – Each value in a bin is replaced by bin mean
Smoothing by bin medians – Each value in bin is replaced by bin
medians
Smoothing by bin boundaries – Each value in bin is replaced by the
closest boundary value.
16
Eg. Sorted data 4, 8,15,21,24,25,28,34
Partition into equidepth bins
Bin1 : 4,8,15
Bin2 : 21,21,24
Bin3: 25,28,34
Smoothing by bin means:
Mean of bin 1 = (4+8+15)/3 = 9
Mean of bin 2 = ( 21+21+24)/3 = 22
Mean of bin 3 = ( 25+28+34)/3 = 29
Bin 1: 9,9,9
Bin2 : 22,22,22
Bin 3 : 29,29,29
Smoothing by bin median:
Median of bin 1 = 8
Median of bin 2 = 21
Median of bin 3 = 28
Bin 1: 8,8,8
Bin2 : 21,21,21
Bin 3 : 28,28,28
Smoothing by bin boundaries:
Bin1: 4,4,15
Bin 2: 21,21,24
Bin 3: 25,25,34
Clustering: Outliers may be detected by clustering, where similar
values are organized into groups, or clusters. Values that fall outside of the set
of clusters may be considered outliers.
Combined computer and human inspection: Outliers may be
identified through a combination of computer and human inspection. Using
some measure, the outlier patterns can be identified, from which manually the
user can find out the real garbage entries or outliers. For example consider hand
written recognition problem, using information-theoretic measure, identify
outlier patterns in a handwritten character database for classification. Patterns
whose surprise content is above a threshold are output to a list. A human can
then sort through the patterns in the list to identify the actual garbage ones.
This is much faster than having to manually search through the entire database.
The garbage patterns can than be excluded from use in subsequent data mining.
Regression: Data can be smoothed by fitting the data to function,
such as with regression. Linear regression involves finding the “best” line to fit
two variables, so that one variable can be used to predict the other. Multiple
17
linear regressions is an extension of linear regression, where more than two
variables are involved and the data are fit to multidimensional surface. Using
regression to find a mathematical equation to fit the data helps smooth out the
noise.
The data smoothing methods also can be used as a data reduction
technique.
2.1.3 Inconsistent Data
There may be inconsistencies in the data recorded for some
transactions. Some data inconsistencies may be corrected manually using
external references. For example, efforts made at data entry may be corrected
by performing a paper trace Knowledge engineering tools may also be used to
detect the violation of known data constraints. For example, known functional
dependencies between attributes can be used to find values contradicting the
functional constraints.
There may also be inconsistencies due to data integration, where a
given attribute can have different names in different databases. Redundancies
may also exist.
2.2 Data Integration and Transformation
Data mining often requires data integration—the merging of data from
multiple data stores. The data may also need to be transformed into forms
appropriate for mining.
2.2.1 Data Integration
Data integration is the method to combine data from multiple data
stores into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files.
There are a number of issues to consider during data integration.
Schema integration
Entry identification problem
For example it is not possible to find automatically the attribute
customer_id and cust_number in different databases represent the
same. Databases and data warehouses typically have metadata—that
is, data about the data. Such metadata can be used to help avoid errors
in schema integration.
Redundancy
Redundancy is another important issue. An attribute may be redundant
if it can be “derived” from another table, such as annual revenues.
Inconsistencies in attribute or dimension naming can also cause redundancies
in the resulting data set.
Some redundancies can be detected by correlation analysis. For
example, given two attributes, such analysis can measure how strongly one
18
attribute implies the other, based on the available data. The correlation between
attributes A and B can be measured by
Where ‘n’ is the number of tuples, and are the means of values of A
and B, and and are the respective standard deviations of A and B. If the
value of r is greater than 0, then A and B are positively correlated, meaning that
the values of A increase as the values of B increase. The higher the value, the
more each attribute implies the other. Hence, a high value may indicate that A
(or B) may be removed as a redundancy. If it is equal to 0, then A and B are
independent and there is no correlation between them. If ‘r’ is less than 0, then
A and B are negatively correlated, means that the values of one attribute
increases as the values of the other attribute decrease. This means that each
attribute discourages the other.
In addition to detecting redundancies between attributes, duplication
should also be detected at the tuple level. (e.g., where there are two or more
identical tuples for a given unique data entry case).
A third important issue in data integration is the detection and
resolution of data value conflicts. For example, for the same real-world entity,
attribute values from different sources may differ. This may be due to
differences in representation, scaling, or encoding. For instance, a weight
attribute may be stored in metric units in one system and British imperial units
in another. Careful integration of the data from multiple sources can help
reduce and avoid redundancies and inconsistencies in the resulting data set.
This can help improve the accuracy and speed of the subsequent mining
process.
2.2.2 Data Transformation
In data transformation, the data are transformed or consolidated into
forms appropriate for mining. Data transformations can involve the following:
Smoothing: Removes the noise from data. Such techniques
include binning, clustering, and regression.
Aggregation: Where summary or aggregation operations are
applied to the data. For example, the daily sales data may be aggregated
so as to compute monthly and annual total amounts.
Generalization of the data: where low-level or “primitive”
(raw) data are replaced by higher-level concepts through the use of
concept hierarchies. For example, categorical attributes, like street, can
be generalized to higher-level concepts, like city or country. Similarly,
values for numeric attributes, like age, may be mapped to higher-level
concepts, like young, middle-aged, and senior.
19
Normalization: where the attribute data are scaled so as to fall
within a small specified range, such as – 1.0 to 1.0 or 0.0 to 1.0.
Attribute construction (or feature construction), where new
attributes are constructed and added from the given set of attributes to
help the mining process.
Smoothing is a form of data cleaning and was discussed earlier in this
section. Aggregation and generalization are data reduction techniques, thus we
can discuss the details in data reduction.
An attribute is normalized by scaling its values so that they fall within a
small specified range, such as 0.0 to 1.0. Normalization is particularly useful
for classification algorithms involving neural networks, or distance
measurements such as nearest neighbor classification and clustering. If using
the neural network back-propagation algorithm for classification mining,
normalizing the input values for each attribute measured in the training samples
will help to speed up the learning phase. For distance-based methods,
normalization helps prevent attributes with initially large range (e.g., income)
from outweighing attributes with initially smaller ranges (E.g., binary
attributes). There are many methods for data normalization such as, min-max
normalization, z-score normalization, and normalization by decimal scaling.
Min-max normalization performs a linear transformation on the original
data. Let minA and maxA be the minimum and maximum values of the attribute
A. Min-max normalization maps a value of v to v̕ in the range [newminA,
newmaxA] by computing
20
maximum of attribute A are unknown or when there are ouliers that dominate
the min-max normalization.
Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved depends on
the maximum absolute value of A. A value of v of A is normalized by v̕ by
computing
21
2.3.1 Data Cube Aggregation
Consider the sales data of an Electronics store for the year 1997 quarter
wise. Compute the annual sales for analysis. Thus the data can be aggregated
so that the resulting data summarize the total sales year instead of per quarter.
This aggregation is illustrated in Figure 3.4. The resulting data set is smaller in
volume, without loss of information necessary for the analysis tasks.
Year = 1999
Year = 1998
Year Sales
Year = 1997
1997 1568000
Quarter Sales
Q1 1998 2356000
Q1 224000 1999 3594000
Q2 408000
Q3 350000
Q4 586000
Figure 2.1 – Sales data of an electronics store for the years 1997 – 1999.
Data cubes store multidimensional aggregated information. For
example, Figure 2.2 shows a data cube for multidimensional analysis of sales
data with respect to annual sales per item type for each AllElectronics branch.
Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space. Concept hierarchies may exist for each attribute,
allowing the analysis of data at multiple levels of abstraction. For example, a
hierarchy for branch could allow branches to be grouped into regions, based on
their address. Data cubes provide fast access to recomputed, summarized data,
thereby benefiting on line analytical processing as well as data mining.
The cube created at the lowest level of abstraction is referred to as the
base cuboids. A cube for the highest level of abstraction is the apex cuboid. For
the sales data of figure 2.2, the apex cuboids would give one total—the total
sales for all three years, for all item types, and for all branches. Data cubes
created for varying levels of abstraction are often referred to as cuboids, so that
a data cube may instead refer to a lattice of cuboids. Each higher level of
abstraction further reduces the resulting data size.
The base cuboids should correspond to an individual entity of interest,
such as sales or customer. In other words, the lowest level should be usable, or
useful for the analysis. Since data cubes provide fast accessing to recomputed,
summarized data, they should be used when possible to reply to queries
regarding aggregated information. When replying to such OLAP queries or
data mining requests, the smallest available cuboids relevant to the given task
should be used.
22
B
D
ranch
C
B
A
Home
entertainment 568
computer
750
Home type
phone 150
security
50
Year
23
selection. These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be that best choice at the time.
Their strategy is to make a locally optimal choice in the hope that this will lead
to a globally optimal solution. Such greedy methods are effective in practice
and may come close to estimating an optimal solution;
The “best” (and “worst”) attributes are typically determined using test
of statistical significance, which assume that the attributes are independent of
one another. Many other attribute evaluation measures can be used, such as the
information gain measure used in building decision tress for classification.
Following are the common heuristic technique used in attribute
selection.
1. Stepwise forward selection: The procedure starts with an
empty set of attributes. The best of the original attributes is determined
and added to the set. At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with
the full set of attributes. At each step, it removes the worst attribute
remaining in the set.
3. Combination of forward selection and backward
elimination: The step wise forward selection and backward elimination
methods can be combined so that, at each step, the procedure selects the
best attribute and removes the worst from among the remaining
attributes.
The stopping criteria for methods 1 to 3 may vary. The procedure
may employ a threshold on the measure used to determine when to stop the
attributes selection process.
Decision tree induction: Decision tree algorithms, such as ID3 and
C4.5, were originally intended for classification. Decision tree induction
constructs a flow-chart-like structure where each internal (non leaf) node
denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best” attribute to partition the data into individual
classes.
When decision tree induction is used for attribute subset selection, a
tree is constructed from the given data. All attributes that do not appear in the
tree are assumed to be irrelevant. The set of attributes appearing in the tree
form the reduced subset of attributes.
2.3.3 Data Compression
In data compression, data encoding or transformations are applied so as
to obtain a reduced or “compressed” representation of the original data. If the
original data can be reconstructed from the compressed data without any loss of
information, the data compression technique used is called lossless. If instead,
we can reconstruct only an approximation of the original data, then the data
24
compression technique is called lossy. There are several well-tuned algorithms
for string compression. Although they are typically lossless, they allow only
limited manipulation of the data. Two effective lossy compression methods are
wavelet transforms and principal components analysis.
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing
technique that when applied to data vector D, transforms it to numerically
different vector, D̕, of wavelet coefficients. The two vectors are of the same
length. A compressed approximation of the data can be retained by storing only
a small fraction of the strongest of the wavelet coefficients. All wavelet
coefficients larger than the threshold level can be retained. This technique can
also be used to remove the noise.
The DWT is closely related to the Discrete Fourier transform (DFT), a
signal processing technique involving sines and cosines. In general, however,
the DWT achieves better lossy compression. That is, if the same number of
coefficients is retained for a DWT and a DFT of given data vector, the DWT
version will provide a more accurate approximation of the original data. Hence,
for an equivalent approximation, the DWT requires less space than the DFT.
Unlike the DFT, wavelets are quite localized in space, contributing to the
conservation of local detail.
There is only one DFT, yet there are several families of DWTs. The
general procedure for applying a discrete wavelet transform uses a hierarchical
pyramid algorithm that halves the data at each iteration, resulting in fast
computational speed. The method is as follows:
1. The length, L, of the input data vector must be an integer
power of 2. This condition can be met by padding the data vector with
zeros, as necessary.
2. Each transform involves applying two functions. The
first applies some data smoothing, such as a sum or weighted average.
The second performs a weighted difference, which acts to bring out the
detailed features of the data.
3. The two functions are applied to pairs of the input data,
resulting in two set of the data of length L/2. In general, these represent
a smoothed or low frequency version of the input data, and the high
frequency content of it, respectively.
4. The two functions are recursively applied to the sets of
data obtained in the previous loop, until the resulting data sets obtained
are of length 2.
5. A selection of values from the data set obtained in the
above iterations are designated the wavelet coefficient of the
transformed data.
Equivalently, a matrix multiplication can be applied to the input data in
order to obtain the wavelet coefficients, where the matrix used depends of the
25
given DWT. The matrix must be orthonormal, meaning that the columns are
unit vectors and are mutually orthogonal, so that the matrix inverse is just its
transpose By factoring the matrix used into a product of a few sparse matrices,
the resulting “fast DWT” algorithm has complexity of O(n) for an input vector
of length n.
Wavelet transforms can be applied to multidimensional data, such as a
data cube. This is done by first applying the transform to the first dimension,
then to the second, and so on. The computational complexity involved is linear
with respect to the number of cells in the cube. Wavelet transforms give good
results on sparse or skewed data and on data with ordered attributes. Loss
compression by wavelets is reportedly better than JPEG compression, the
current commercial standard. Wavelet transforms have many real-world
applications, including the compression of fingerprint images, computer vision,
analysis of time-series data, and data cleaning.
Principal Component analysis
PCA searches for c k-dimensional orthogonal vectors that can best be
used to represent the data where c≤ k. The original data are projected onto a
much smaller space, resulting in data compression. PCA is computationally
inexpensive can be applied to ordered and unordered attributes and can handle
sparse data and slewed data.
2.3.4 Numerosity Reduction
Numerosity reduction deals with reducing the size or volume of the
data. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data
parameters need be stored, instead of the actual data. Log-linear models, which
estimate discrete multidimensional probability distributions, are an example.
Nonparametric methods for storing reduced representations of the data include
histograms, clustering, and sampling.
Regression and Log-Linear Models
Regression and log-linear models can be used to approximate the given
data. In linear regression, the data are modeled to fit a straight line. For
example, a random variable, Y (called a response variable), can be modeled as
a linear function of another random variable, X (called a predictor variable),
with the equation
26
Log-linear models approximate discrete multidimensional probability
distributions. The method can be used to estimate the probability of each cell in
a base cuboid for a set of discretized attributes, based on the smaller cuboids
making up the data cube lattice. This allows higher-order data cubes to be
constructed from lower-order ones. Log-linear models are therefore also useful
for data compression (since the smaller-order cuboids together typically occupy
less space than the base cuboid) and data smoothing.
Regression and log-linear models can both be used on sparse data
although their application may be limited. While both methods can handle
skewed data, regression does exceptionally well. Regression can be
computationally intensive when applied to high-dimensional data, while log-
linear models show good scalability for up to 10 or so dimensions.
Histograms
Histograms use binning to approximate data distributions and are a
popular form of data reduction. A histogram for an attribute A partitions the
data distribution of A into disjoint subsets, or buckets. The buckets are
displayed on a horizontal axis, while the height (and area) of a bucket typically
reflects the average frequency of the values represented by the bucket. If each
bucket represents only a single attribute-value/frequency pair, the buckets are
called singleton buckets. Often, buckets instead represent continuous ranges for
the given attribute.
There are several partitioning rules, including the following:
Equiwidth: In an equiwidth histogram, the width of
each bucket range is uniform.
Equidepth (or equiheight): In an equidepth histogram,
the buckets are created so that, roughly, the frequency of each
bucket is constant.
V-Optimal: If we consider all of the possible histograms
for a given number of buckets, the V-Optimal histogram is the one
with the least variance. Histogram variance is a weighted sum of the
original values that each bucket represents, where bucket weight is
equal to the number of values in the bucket.
MaxDiff: In a MaxDiff histograms, we consider the
difference between each pair of adjacent values. A bucket boundary
is established between each pair for pairs having the – 1 larges
differences, where is user-specified.
V-Optimal and MaxDiff histograms tend to be the most accurate and
practical. Histograms are highly effective at approximating both sparse and
dense data, as well as highly skewed and uniform data. The histograms
described above for single attributes can be extended for multiple attributes.
Multidimensional histograms can capture dependencies between attributes.
Such histograms have been found effective in approximating data with up to
five attributes. More studies are needed regarding the effectiveness of
27
multidimensional histograms for very high dimensions. Singleton buckets are
useful for storing outliers with high frequency.
Clustering
Clustering techniques consider data tuples as objects. They partition the
objects into groups or clusters, so that objects within a cluster are “similar” to
one another and “dissimilar” to objects in other clusters. Similarity is
commonly defined in terms of how “close” the objects are in space, based on a
distance function. The “quality” of a cluster may be represented by its
diameter, the maximum distance between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and is defined as
the average distance of each cluster object from the cluster centroid. In data
reduction, the cluster representations of the data rare used to replace the actual
data. The effectiveness of this technique depends on the nature of the data. It is
much more effective for data that can be organized into distinct clusters than
for smeared data.
Sampling
Sampling can be used as a data reduction technique since it allows a
large data set to be represented by a much smaller random sample or subset of
the data. Suppose that a large data set, D, contains N tuples.
Simple random sample without replacement (SRSWOR) of
size n: This is created by drawing n of the N tuples from D (n < N),
where the probability of drawing any tupels in D is 1/N, that is, all
tuples are equally likely.
Simple random sample with replacement (SRSWR) of size
n: This is similar to SRSWOR, except that each time a tuple is draw
from D, it is recorded and then replaced. That is, after a tuple is drawn,
it is placed back in D so that it may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually
disjoint “clusters,” then an SRS of m clusters can be obtained, where m
< M. for example, tuples in a database are usually retrieved a page at a
time, so that each page can be considered a cluster. A reduced data
representation can be obtained by applying, say, SRSWOR to the pages,
resulting in a cluster sample of the tuples.
Stratified sample: If D is divided into mutually disjoint parts
called strata, a stratified sample of D is generated by obtaining an SRS
at each stratum. This helps to ensure a representative sample, especially
when the data are skewed. For example, a stratified sample may be
obtained from customer data, where a stratum is created for each
customer age group. In this way, the age group having the smallest
number of customers will be sure to be represented.
28
CHAPTER - 3
DATA MINING TECHNIQUES
3.1 INTRODUCTION
There are many different methods used to perform data mining tasks.
These techniques not only require specific types of data structures, but also
imply certain types of algorithmic approaches. In this chapter we briefly
introduce some of the common data mining techniques.
Parametric models describe the relationship between input and output
through the use of algebraic equations where some parameters are not
specified. These unspecified parameters are determined by providing input
examples. Parametric modeling is either too simplistic or requires more
knowledge about the data involved than is available. Thus, for real-world
problems, these parametric models may not be useful.
Nonparametric techniques are more appropriate for data mining
applications. A nonparametric model is one that is data-driven. No explicit
equations are used to determine the model. This means that the modeling
process adapts to the data at hand. Unlike parametric modeling, where a
specific model is assumed ahead of time, the nonparametric technique creates a
model based on the input. While the parametric methods require more
knowledge about the data before the modeling process, the nonparametric
technique requires a large amount of data as input to the modeling process
itself. The modeling process then creates the model by sifting through the data.
Recent nonparametric methods have employed machine learning techniques to
be able to learn dynamically as data are added to the input. Thus, the more data,
the better the model created. Also, this dynamic learning process allows the
model to be created continuously as the data is input. These features make
nonparametric techniques particularly suitable to database applications with
large amounts of dynamically changing data. Nonparametric techniques include
neural networks, decision trees, and genetic algorithms.
3.2 A STATISTICAL PERSPECTIVE ON DATA MINING
There have been many statistical concepts that are the basis for data
mining techniques.
3.2.1 Point Estimation
Point estimation refers to the process of estimating a population
parameter , by an estimate of the parameter ˆ . This can be done to estimate
mean, variance, standard deviation, or any other statistical parameter. The
estimate of the population parameter is calculated by the parameter of the
population sample. An estimator technique may also be used to estimate
(predict) the value of missing data. The bias of an estimator is the difference
between the expected value of the estimator and the actual value:
Bias = E( ˆ ) -
29
An unbiased estimator is one whose bias is 0. While point estimators
for small data sets may actually be unbiased, for larger database applications
we would expect that most estimators are biased.
One measure of the effectiveness of an estimate is the mean squared
error (MSE), which is defined as the expected value of the squared difference
between the estimate and the actual value:
MSE ( ˆ ) = E( ˆ - )2
The squared error is often examined for a specific prediction to measure
accuracy rather than to look at the average difference. For example, if the true
value for an attribute was 10 and the prediction was 5, the squared error would
be (5-10)2 = 25. The squaring is performed to ensure that the measure is always
positive and to give a higher weighting to the estimates that are grossly
inaccurate. The MSE is commonly used in evaluation the effectiveness of data
mining prediction techniques. It is also important in machine learning. At
times, instead of predicting a simple point estimate for a parameter, one may
determine a range of values within which the true parameter value should fall.
This range is called a confidence interval.
The root mean square (RMS) may also be used to estimate error or as
another statistic to describe a distribution. Calculating the mean does not
indicate the magnitude of the values. The RMS can be used for this purpose.
Given a set of n values X x1 , x2 ,... xn , the RMS is defined by
n
x j 1
2
j
RMS
n
An alternative use is to estimate the magnitude of the error. The root
mean square error (RMSE) is found by taking the square root of the MSE.
A popular estimating technique is the jackknife estimate. With this
approach, the estimate of a parameter, ˆ , is obtained by omitting one value
from the set of observed values. Suppose that there is a set of n values
X x1 , x2 ,... xn . An estimate for the mean would be
i 1 n
xj
j 1
x
j i 1
j
ˆ (i)
n 1
Here the subscript (i) indicates that this estimate is obtained by omitting
the ith value. Given a set of jackknife estimates, ˆ( i ) , these can in turn be used to
obtain an overall estimate
n
ˆ ( j)
ˆ(i ) j 1
30
Another technique for point estimation is called the maximum
likelihood estimate (MLE). Likelihood can be defined as a value proportional
to the actual probability that with a specific distribution the given sample
exists. So the sample gives us an estimate for a parameter from the distribution.
The higher the likelihood value, the more likely the underlying distribution will
produce the results observed. Given a sample set of values X x1 , x2 ,... xn
from a known distribution function f ( xi | ) , the MLE can estimate parameters
for the population from which the sample is drawn. The approach obtains
parameter estimates that maximize the probability that the sample data occur
for the specific model. It looks at the joint probability for observing the sample
data by multiplying the individual probabilities. The likelihood function, L, is
thus defined as
n
L( | x1 , x2 ,... xn ) f ( xi | )
i 1
31
The expectation-maximization (EM) algorithm is an approach that
solves the estimation problem with incomplete data. The EM algorithm finds
an MLE for a parameter (such as a mean) using a two-step process: estimation
and maximization. The basic EM algorithm is shown in Algorithm 3.1. An
initial set of estimates for the parameters is obtained. Given these estimates and
the training data as input, the algorithm then calculates a value for the missing
data. For example, it might use the estimated mean to predict a missing value.
These data (with the new value added) are then used to determine an estimate
for the mean that maximizes the likelihood. These steps are applied iteratively
until successive parameter estimates converge. Any approach can be used to
find the initial parameter estimates.
The expectation part of the algorithm estimates the missing values using
the current estimates of . This can initially be done by finding a weighted
average of the observed data. The maximization step then finds the new
estimates for the parameters that maximize the likelihood by using those
estimates of the missing data.
Example3.1:
Find the mean µ for the data that follow the normal distribution.
Let X = { 1, 5, 10, 4} with the two data items missing. Hence n =6 and
k= 4. Assume the initial guess ̂ 0 = 3. MLE estimate for the mean is
k n
xi x i
(1 5 10 4) 3 3
̂
1 i 1
i k 1
= = 3.33 + 1
n n 6 6
= 4.33
4.33 4.33
̂ 2 = 3.33 = 4.77
6
Repeat this until the difference between successive estimates is less
than 0.05 or 0.005.
3.2.2 Models Based on Summarization
There are many basic concepts that provide an abstraction and
summarization of the data as a whole. The basic well-known statistical
concepts such as mean, variance, standard deviation, median, and mode are
simple models of the underlying population. Fitting a population to a specific
frequency distribution provides an even better model of the data. But this is not
suitable for large and complex data.
There are also many well-known techniques to display the structure of
the data graphically. For example, a histogram shows the distribution of the
data. A box plot is a more sophisticated technique that illustrates several
different features of the population at once. Figure 3.1 shows a sample box
plot. The total range of the data values is divided into four equal parts called
quartiles. The box in the center of the figure shows the range between the first,
second, and third quartiles. The line in the box shows the median. The lines
32
extending from either end of the box are the values that are a distance of 1.5 of
the interquartile range from the first and third quartiles, respectively. Outliers
are shown as points beyond these values.
2nd quartile
12
10
8
6
4
2
0
0 5 10 15 20
33
Definition 3.1. Bayes Rule or Bayes Theorem is
P( xi | h1 ) P(h1 )
P(h1 / xi )
P( xi )
where
m - number of classes or hypothesis
P(h1 / xi ) - Posteriori probability
P(h1 ) - priori probability associated with the hypothesis h1
P( xi | h1 ) - is the conditional probability that given a hypothesis the tuple
satisfies it
P ( xi ) - Probability of the occurrence of data value xi
TABLE 3.1: Training Data for Example 3.2
______________________________________________
ID Income Credit Class xi
______________________________________________
1 4 Excellent h1 x4
2 3 Good h1 x7
3 2 Excellent h1 x2
4 3 Good h1 x7
5 4 Good h1 x8
6 2 Excellent h1 x2
7 3 Bad h2 x11
8 2 Bad h2 x10
9 3 Bad h3 x11
10 1 Bad h4 x9
______________________________________________
Example 3.2
Suppose that a credit loan authorization problem can be associated with
four hypotheses: H = {h1, h2, h3, h4} where h1 = authorize purchase, h2 =
authorize after further identification, h3 = do not authorize, and h4 = do not
authorize but contact police. The training data for this example are shown in
Table 3.1. From training data, we find that P(h1) = 60%, P(h2) = 20%, P(h3) =
10%, P(h4) = 10%. To make our predictions, a domain expert has determined
that the attributes we should be looking at are income and credit category.
Assume that income, I, has been categorized by ranges [0, $10,000), [$10,000),
$50,000), [$50,000, $100,000), and [$100,000, ∞). These ranges are encoded
and are shown in Table 3.1 as 1, 2, 3, and 4, respectively. Suppose that credit is
34
categorized as excellent, good, or bad. By combining these, we then have 12
values in the data space: D = {x1, …, x12}.
TABLE 3.2: xi Assignments for Example 3.2
___________________________________
1 2 3 4
___________________________________
Excellent x1 x2 x3 x4
Good x5 x6 x7 x8
Bad x9 x10 x11 x12
___________________________________
Example 3.2 uses the training data in Table 3.1 to illustrate the use of
Bayes rule. Example 3.2 also illustrates that we may take advantage of other
probability laws to determine combinations of proximities. For example, we
may find P (hi) = P (I < $10,000^Good) by instead finding P(I<$10,000)
P(Good | < $10,000).
Number of tuples under the hypothesis h1 = 6
Number of tuples under the hypothesis h2 = 2
Number of tuples under the hypothesis h3 = 1
Number of tuples under the hypothesis h4 = 1
P( x1 | h1 ) 0 / 6 = 0
P( x7 | h1 ) 2 / 6 P( x4 | h1 ) 1 / 6 P( x2 | h1 ) 2 / 6 P( x8 | h1 ) 1 / 6
for all other values of I it is zero.
Estimate the hypothesis for the tuple with values (4, excellent). Thus
compute P(h j | x4 ) for all j. Select the class with maximum posteriori
probability.
P( x4 | h1 ) P(h1 )
P(h1 / x4 )
P( x4 )
4
P( x4 ) P( x4 | h j ) P(h j ) = (1/6 ) (6/10) + 0 +0 +0
j 1
(1 / 6)(6 / 10)
P(h1 / x4 ) =1
(1 / 6)(6 / 10)
Similarly compute for all other hypothesis. In this example P(h1 / x4 ) is
with the maximum value of 1, thus the given tuple is placed in the hypothesis
h1 .
Here the sample of the data set is considered to build a model. Problem
in sampling, the results depends on the sample chosen.
35
3.2.4 Hypothesis Testing
Hypothesis testing attempts to find a model that explains the observed
data by first creating a hypothesis and then testing the hypothesis against the
data. This is in contrast to most data mining approaches, which create the
model from the actual data without guessing what it is first. The actual data
itself derive the model creation. The hypothesis usually is verified by
examining a data sample. If the hypothesis holds for the sample, it is assumed
to hold for the population in general. Given a population, the initial (assumed)
hypothesis to be tested, H0, is called the null hypothesis. Rejection of the null
hypothesis causes another hypothesis, H1, called the alternative hypothesis, to
be made.
One technique to perform hypothesis testing is based on the use of the
chi-squared statistic. Actually, there is a set of procedures referred to as chi
squared. These procedures can be used to test the association between two
observed variable values and to determine if a set of observed variable values is
statistically significant (i.e., if it differs from the expected case). A hypothesis
is first made, and then the observed values are compared based on this
hypothesis. Assuming that O represents the observed data and E is the expected
values based on the hypothesis, the chi-squared statistic, χ2, is defined as:
(O E ) 2
2
E
When comparing a set of observed variable values to determine its
statistical significance, the values are compared to those of the expected case.
This may be the uniform distribution.
3.2.5 Regression and Correlation
Both bivariate regression and correlation can be used to evaluate the
strength of a relationship between two variables. Regression is generally used
to predict future values based on past values by fitting a set of points to a curve.
Correlation, however, is used to examine the degree to which the values for two
variables behave similarly or to check the association between variables.
Linear regression assumes that a linear relationship exists between the
input data and the output data. The common formula for a linear relationship is
used in this model:
y c0 c1 x1 ... cn xn
Here there are n input variables, which are called predictors or
regressors; one output variable (the variable being predicted), which is called
the response; and n+1 constants, which are chosen during the modeling process
to match the input example (or sample). This is sometimes called multiple
linear regressions because there is more than one predictor.
The line generated by the linear regression technique is shown in the
figure. Note, however, that the actual data points usually do not fit the linear
model exactly. Thus, this model is an estimate of what the actual input-output
36
relationship is. We can use the generated linear model to predict an output
value given an input value.
Two different data variables X and Y, may behave very similarly.
Correlation is the problems of determining how much alike the two variables
actually are. One standard formula to measure linear correlation is the
correlation coefficient r. Given two variables, X and Y, the correlation
coefficient is a real value r [-1, 1]. A positive number indicates a positive
correlation, whereas a negative number indicates a negative correlation. Here
negative correlation indicates that one variable increases while the other
decreases in value. The closer the value of r is to 0, the smaller the correlation.
A perfect relationship exists with a value of 1 or -1, whereas no correlation
exists with a value of 0. When looking at a scatter plot of the two variables, the
closer the values are to a straight line, the closer the r value is to 1 or -1. The
value for r is defined as
r
( x X )( y Y )
i i
(x X ) ( y Y )
2 2
i i
37
ti D, sim(ti, ti) = 1
ti, tj D, sim(ti, tj) = 0 if ti and tj are not alike at all
ti, tj, tk D, sim(ti, tj) < sim(ti, tk) if ti is more like tj than tk
Defining similarity is a complicated task. When the idea of similarity
measure is used in classification where the classes are predefined, this problem
is somewhat simpler than when it is used for clustering where the classes are
not known in advance. Consider the IR(Information Retrieval) example. Each
IR query provides the class definition in the form of the IR query itself. So the
classification problem then becomes one of determining similarity not among
all tuples in the database but between each tuple and the query. This makes the
problem an O(n) problem rather than an O(n)2 problem.
Here is some of the more common similarity measures used in
traditional IR systems and more recently in Internet search engines:
k
2 tiht jh
Dice : sim(ti , t j ) k
h 1
k
tih2 t 2jh
h 1 h 1
t ih jht
Jaccard : sim(ti , t j ) k
h 1
k k
t t
ih jh
Cosine : sim(ti , t j ) h 1
k k
tih2 t 2jh
h 1 h 1
t t
ih jh
Overlap: sim(ti , t j ) h 1
k k
min( tih2 , t 2jh )
h 1 h 1
38
used to measure the overlap of two sets as related to the whole set caused by
their union. The cosine coefficient relates the overlap to the geometric average
of the two sets. The overlap metric determines the degree to which the two sets
overlap.
Distance or dissimilarity measures are often used instead of similarity
measures. As implied, these measure how “unlike” items are. Traditional
distance measures may be used in a two-dimensional space. These include
k
Euclidean: dis(ti , t j ) (t
h 1
ih t jh ) 2
k
Manhattan: dis(ti , t j ) tih t jh
h 1
39
NOTES
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
40
UNIT – II
CHAPTER 4
CLASSIFICATION
4.1 INTRODUCTION
Classification is the most familiar and most popular data mining
technique. Examples of classification applications include image and pattern
recognition, medical diagnosis, loan approval, detecting faults in industry
applications, and classifying financial market trends. Estimation and prediction
may be viewed as types of classification. When someone estimates your age or
guesses the number of marbles in a jar, these are actually classification
problems. Prediction can be thought of as classifying an attribute value into one
of a set of possible classes. It is often viewed as forecasting a continuous value,
while classification forecasts a discrete value.
Example 4.1 : Classifying the students based on their grades (A, B, C, D or F)
such as
Average Grade
< 60 F
≥ 60 AND < 70 D
≥ 70 AND < 80 C
≥ 80 AND < 90 B
> 90 A
The classification techniques assume some knowledge about the data. It
is a supervised learning. Classification techniques build the model by selecting
the sample of data as a training set, and it is evaluated by the test data set.
Domain Expert may also be used in the training phase.
Definition 4.1. Given a database D = {t1, t2, …, tn} of tuples (items,
records) and a set of classes C = {C1, …, Cm}, the classification problem is to
define a mapping f: D C where each ti is assigned to one class. A class, Cj,
contains precisely those tuples mapped to it; that is, Cj = {ti | f(ti) = Cj, 1≤ i ≤ n,
and ti D}.
Classification is a mapping from the data base to the set of classes. The
classes are predefined, non-overlapping and partition the entire class. Each
tuple in the database is assigned to only on class. There are three basic methods
used to solve the classification problem:
Specifying boundaries. Here classification is performed by
dividing the input space of potential database tuples into regions where each
region is associated with one class.
Using probability distributions. For any given class, Cj, P(ti |
Cj) is the PDF for the class evaluated at one point, ti. If a probability of
occurrence for each class, P(Cj) is known, then P(Cj) P(ti | Cj) is used to
estimate the probability that ti is in class Cj.
41
Using posterior probabilities. Given a data value ti, we would
like to determine the probability that ti is in a class Cj. This is denoted by P (Cj |
ti) and is called the posterior probability. One classification approach would be
to determine the posterior probability for each class and then assign ti to the
class with the highest probability.
A major issue in the classification is overfitting. If the classification
strategy fits for the training data it is not applicable to the population of data.
Consider the example given in the table 4.1, which classifies the adult as short,
medium or tall. The last two columns represent two classifications such as
output1 and output2. The output1 uses the simple division like
2m ≤ height tall
1.7m < height < 2 m medium
height ≤ 1.7m short
The output2 requires more complicated divisions using both height and
gender attributes.
The classification algorithms are categorized into
Statistical based methods
Distance based methods
Decision tree
Neural network based methods
Rule based methods
Table 4.1 Data for Height Classification
Id Gender Height (In m) Output1 Output2
1 F 1.6 Short Medium
2 M 2 Tall Medium
3 F 1.9 Medium Tall
4 F 1.88 Medium Tall
5 F 1.7 Short Medium
6 M 1.85 Medium Medium
7 F 1.6 Short Medium
8 M 1.7 Short Medium
9 M 2.2 Tall Tall
10 M 2.1 Tall Tall
11 F 1.8 Medium Medium
42
12 M 1.95 Medium Medium
13 F 1.9 Medium Tall
14 F 1.8 Medium Medium
15 F 1.75 Medium Medium
4.1.1 Issues in Classification
Missing Data: Missing data values cause problems during both the
training phase and the classification process itself. Missing values in the
training data must be handled and may produce an inaccurate result. Missing
data in a tuple to be classified must be able to be handled by the resulting
classification scheme. There are many approaches to handling missing data:
Ignore the missing data.
Assume a value for the missing data. This may be determined by
using some method to predict what the value could be. Statistical measures like
mean, median or mode can be used to replace the missing value.
Assume a special value for the missing data. This means that the
value of missing data is taken to be a specific value all of its own such as
NULL etc.
Measuring Performance: The performance of classification algorithms is
usually examined by evaluating the accuracy of the classification. However,
since classification is often a fuzzy problem, the correct answer may depend on
the user. Traditional algorithm evaluation approaches such as determining the
space and time overhead can be used, but these approaches are usually
secondary.
Classification accuracy is usually calculated by determining the
percentage of tuples placed in the correct class. This ignores the fact that there
also may be a cost associated with an incorrect assignment to the wrong class.
This perhaps should also be determined.
The performance of classification can be done similar to information
retrieval system. With only two classes, there are four possible outcomes with
the classification, as is shown in Figure 4.1. The upper left and lower right
quadrants for both Figure 4.1(a) and (b) represent correct actions. The
remaining two quadrants are incorrect actions. The performance of a
classification could be determined by associating costs with each of the
quadrants. However, this would be difficult because the total number of costs
needed is m2, where m is the number of classes.
Given a specific class, Cj, and a database tuple, ti, that tuple may or may
not be assigned to that class while its actual membership may or may not be in
that class. This again gives us the four quadrants shown in Figure 4.3©, which
can be described in the following ways:
43
True positive (TP): ti predicted to be in Cj and is actually in it.
False positive (FP): ti predicted to be in Cj but is not actually in
it.
True negative (TN): ti not predicted to be in Cj and is not
actually in it.
False negative (FN): ti not predicted to be in Cj but is actually
in it.
An OC (operating characteristic) curve or ROC (Receiver operating
characteristic) curve or ROC (Relative operating characteristic) curve shows
the relationship between false positives and true positives. An OC curve was
originally used in the communications area to examine false alarm rate. It has
also been used in information retrieval to examine fallout (percentage of
retrieved that are not relevant) versus recall (percentage of retrieved that are
relevant). In the OC curve the horizontal axis has the percentage of false
positives and the vertical axis has the percentage of true positives for a database
sample. At the beginning of evaluating a sample, there are none of either
category, while at the end there are 100 percent of each. When evaluating the
results for a specific sample, the curve looks like a jagged stair-step, as seen in
Figure 4.2, as each new tuple is either a false positive or a true positive. A more
smoothed version of the OC curve can also be obtained.
A confusion matrix illustrates the accurate of the solution to a
classification problem. Given m classes, a confusion matrix is an m X m matrix
where entry ci,j indicates the number of tuples from D that were assigned to
class Cj but where the correct class is Ci. Obviously, the best solutions will
have only zero values outside the diagonal. Table 4.2 shows a confusion matrix
for the height example in Table 4.1 where the Output1 assignment is assumed
to be correct and the Output2 assignment is what is actually made.
44
Retrieved Not Retrieved Assigned Class A Assigned Class B
relevant relevant in Class A in Class A
Not Retrieved
Retrieved Not Assigned Class A Assigned Class B
Not relevant relevant in Class B in Class B
True Negative
False
Positive
c) Class prediction
1
T
00%
r
7
u
5%
e
p 5
o 0%
s 2
i 5%
t
i
v 0 2 5 1 7 1
e 5% 0% 00%
5% 00%
False Positives
45
Figure 4.2 Operating Characteristic Curve
Table 4.2 Confusion matrix
Actual Membership Assignment
Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
4.2 STATISTICAL-BASED ALGORITHMS
4.2.1 Regression
Regression problems deal with estimation of an output value based on
input values. When used for classification, the input values are values from the
database D and the output values represent the classes. Regression can be used
to solve classification problems, but it can also be used for other application
such as forecasting. Regression takes a set of data and fits the data to a formula.
Simple linear regression problem can be thought of as estimating the
formula for a straight line (in a two-dimensional space). This can be equated to
partitioning the data into two classes. With the banking example, these would
be to approve or reject a loan application. The straight line is the break-even
point or the division between the two classes.
Given two points in the XY plane, then the formula for a straight line is
y = mx + b. m and b are the regression coefficients. This can be extended to n
variables, like y c0 c1 x1 ... cn xn where C0, C1, . . . Cn are regression
coefficients. When fitting the model, the line generated by the linear equation
does not fit the model exactly. Thus the model is an estimate of the actual
input-output relationship. Hence the fitted model can be used to predict the
output value given input value. The prediction is an estimate rather than the
actual value. If we attempt to fit the data that are not linear to a linear model the
results will be a poor model of the data.(Fig-4.3).
14
12
10
8
6
4
2
0
0 2 4 6 8 10 12
46
The linear regression model is not suitable to estimate the value. Since
the data may be noisy and may be with outliers. Hence add a random error with
mean 0, ‘ε’ to the fitted linear regression equation. If there are k points then k
formulas of the type
yi c0 c1 x1i i , i 1, . . . , k
Method of least squares can be used to reduce the error.
Regression can be used to perform classification using two different
approaches:
Division: The data are divided into regions based on class.
Prediction: Formulas are generated to predict the output class
value.
The first case views the data as plotted in an n-dimensional space
without any explicit class values shown. Through regression, the space is
divided into regions—one per class. With the second approach, a value for each
class is included in the graph. Using regression, the formula for a line to predict
class values is generated.
Linear regression techniques are easy to understand, but are not
applicable to complex data mining applications. They do not handle
nonnumeric data. Instead logistic or non linear regression can be used. The
logistic curve gives a value between 0 and 1 and it can be interpreted as the
probability of class membership. Here p is the probability of being in the class
and 1-p is the probability that it is not.
4.2.2 Bayesian Classification
Assuming that the contribution by all attributes are independent and that
each contributes equally to the classification problem a simple classification
scheme called naive Bayes classification has been proposed that is based on
Bayes rule of conditional probability. By analyzing the contribution of each
“independent” attribute, a conditional probability is determined. A
classification is made by combining the impact that the different attributes have
on the prediction to but made. The approach is called “naïve” because it
assumes the independence between the various attribute values. Given a data
value xi the probability that a related tuple, ti, is in class Cj is described by
P(Cj|xi). Training data can be used to determine P(xi), P(xi | Cj) and P(Cj). From
these values, Bayes theorem allows us to estimate the posterior probability P(Cj
| xi) and then P(Cj | ti).
Given a training set, the naïve Bayes algorithm first estimates the prior
probability P(Cj) for each class by counting how often class occurs in the
training data. For each attribute, xi, the number of occurrences of each attribute
value xi can be counted to determine P (xi). Similarly, the probability P (xi | Cj)
can be estimated by counting how often each value occurs in the class in the
training data. A tuple in the training data may have many different attributes,
each with many values. This must be done for all attributes and all values of
attributes. We then use these derived probabilities when a new tuple must be
47
classified. This is why naïve Bayes classification can be viewed as both a
descriptive and a predictive type of algorithm. The probabilities are descriptive
and are then used to predict the class membership for a target tuple.
When classifying a target tuple, the conditional and prior probabilities
generated from the training set are used to make the prediction. This is done by
combining the effects of the different attribute values from the tuple. Suppose
that tuple ti has p independent attribute values {xi1, xi2, …, xip}. From the
descriptive phase, we know P(xik | Cj), for each class Cj and attribute xik. We
then estimate P(ti | Cj) by
p
P(ti | C j ) P( xik | C j )
k 1
Now, we have the needed prior probabilities P(Cj) for each class and the
conditional probability P(ti | Cj). To calculate P(ti), we can estimate the
likelihood that ti is in each class. This can be done by finding the likelihood that
this tuple is in each class and then adding all these values. The probability that
ti is in a class is the product of the conditional probabilities for each attribute
value. The posterior probability P(Cj | ti) is then found for each class. The class
with the highest probability is the one chosen for the tuple. Example 4.2
illustrates the used of naïve Bayes classification.
Example 4.2:
Consider the table4.1 with output1 as a decision class. There are four
tuples classified as short, eight as medium and three as tall. To apply bayes
theorem the height attribute is divided into ranges such as (0, 1.6], (1.6, 1.7],
(1.7, 1.8], (1.8, 1.9] , (2.0, ∞). Classify the tuple t = {16, M, 1.95m}
P(gender = male /short) = ¼ and
P(height = 1.95 /short) = 0 , so
P(t/short) = ¼ x 0 = 0
Similarly
P(t/medium) = 2/8 x 1/8 = 0.031
P(t/ tall) = 3/3 x 1/3 = .333
Likelihood of being short = 0 x 4/15 = 0 x 0.0267 = 0
Likelihood of being Medium = 0.031 x 8/15 = 0.031 x 0.533
= 0.0166
Likelihood of being tall = 0.333 x 3/15 = 0.333 x 0.2
= 0.066
Probability of t being either short or medium or tall, P(t) is
P(t) = 0+0.0166 + 0.066 = 0.0826
Actual probabilities of each event
P(short/t) = (0 x 0.0267 ) / 0.0826 = 0
48
P(medium/t) = (0.031 x 0.533) /0.0826 = 0.2
P(tall/t) = (0.333 x 0.2 )/ 0.0826 = 0.799
As P(tall/t) is with the highest probability, the given tuple is classified
as tall.
Table 4.3 Probabilities associated with attributes
Attribute Value Count Probabilities
short medium tall short medium Tall
49
and these are divided into two classes: those that answer your query and those
that do not. Those that answer your query should be more alike than those that
do not answer your query. The similarity in this case is defined by the query
you state, usually a keyword list. Thus, the retrieved pages are similar because
they all contain (to some degree) the keyword list you have specified.
The idea of similarity measures can be abstracted and applied to more
general classification problems. The difficulty lies in how the similarity
measures are defined and applied to the items in the database. Since most
similarity measures assume numeric (and often discrete) values, they might be
difficult to use for more general or the integers may be used.
Using a similarity measure for classification where the classes are
predefined is somewhat simpler than using a similarity measure for clustering
where the classes are not known in advance. Again, think of the IR example.
Each IR query provides the class definition in the form of the IR query itself.
So the classification problem then becomes one of determining similarity not
among all types in the database but between each tuple and the query. This
makes the problem an O(n) problem rather than an O(n2) problem.
Using the IR approach, if we have a representative of each class, we can
perform classification by assigning each tuple to the class to which it is most
similar. We assume here that each tuple, ti, in the database is defined as a
vector <ti1, ti2, …., tik>of numeric values. Likewise, we assume that each class
Cj is defined by a tuple <Cj1, Cj2, …, Cjk> of numeric values. The classification
problem is then restated as,
Definition 4.2. Given a database D= {t1, t2, …, tn}of tuples where each
tuple ti = {t1, t2, …, tn}contains numeric values and a set of classes C = {C1, …,
Cm}where each class Cj = {Cj1, Cj2, …, Cjk}has numeric values, the
classification problem is to assign each ti to the class Cj such that sim(ti,
Cj) ≥ sim (ti, Cl) C1 Є C where Cl ≠ Cj.
To calculate the similarities representative of each class must be
determined. The representatives may be the mean or mode of the classes. In
pattern recognition a predefined pattern may represent the class. Once a
similarity measure is defined, then each item to be classified is compared with
the predefined pattern, and it is placed in the class with high similarity. The
simple classification algorithm is distance based algorithm (Algorithm 4.1).
Here each class is represented by its centroid. Ci denotes the center of class.
Each item is to be compared with the representative or center of the classes, the
number of classes is fixed and small. The complexity to classify one tuple is
O(n).
4.3.1 Simple approach
Algorithm 4.1
Input:
C1, C2, . . . Cm - Centers of each class
t - Input tuple to classify
50
Output:
C – Class to which t is assigned
Simple distance based algorithm
dist = ∞;
for i = 1 to m do
if dis(Ci, t) < dist then
C = i;
Dist = dis(Ci, t)
end if
end for.
4.3.2 K Nearest Neighbors
One common classification scheme based on the use of distance
measures is that of the K nearest neighbors (KNN). The KNN technique
assumes that the entire training set includes not only the data in the set but also
the desired classification for each item. In effect, the training data become the
model. When a classification is to be made for a new item, its distance to each
item in the training set must be determined. Only the K closest entries in the
training set are considered further. The new item is then placed in the class that
contains the most items from this set of K closets items.
ALGORITHM 4.2
Input:
T / / Training data
K / / Number of neighbors
t / / Input tuple to classify
Output:
C / / Class to which t is assigned
KNN algorithm:
/ / Algorithm to classify tuple using KNN
N = ø;
/ / Find set of neighbors, N, for t
for each d T do
if | N | ≤ k, then
N = NU {d};
Else
If uN such that sim(t, u) ≤ sim(t, d), then
Begin
N = N – {U};
51
N = N U {d};
End
/ / Find class for classification
c = class to which the most u N are classified;
Example 4.3:
Using the sample data from Table 4.1 and the Output1 classification as
the training set output value, we classify the tuple < F, 1.6>. Only the height is
used for distance calculation so that both the Euclidean and Manhattan distance
measures yield the same results; that is, the distance is simply the absolute
value of the difference between the values. Suppose that K = 5, then the K
nearest neighbors to the input tuple are {< 1, F, 1.6>, <7, F, 1.6>, <5, F, 1.7>,
<8, M, 1.7>, <15, F, 1.75>}. Of these five items, four are classified as short and
one as medium. Thus, the KNN will classify the given tuple as short.
4.4. DECISION TREE-BASED ALGORITHMS
The decision tree approach is most useful in classification problems.
With this technique a tree is constructed, to model the classification process.
Once the tree is built, it is applied to each tuple in the database and results in a
classification for that tuple. There are two basic steps in the technique: building
the tree and applying the tree to the database. Most research has focused on
how to build effective trees as the application process is straightforward.
The decision tree approach to classification is to divide the search space
into rectangular regions. A tuple is classified based on the region into which it
falls. A definition for a decision tree used in classification is given in Definition
4.3. There are alternative definitions; for example, in a binary DT the nodes
could be labeled with the predicates themselves and each are would be labeled
with yes or no.
Definition4.3. Given a database D={t1,t2,…,tn} where
and the database schema contains the following
attributes {A1, A2, ….,Ah}. Also given is a set of classes C = {C1, …, Cm}. A
decision tree (DT) or classification tree is a tree associated with D that has the
following properties:
Each internal node is labeled with an attribute, Ai.
Each are is labeled with a predicate that can be applied to the
attributed associated with the parent.
Each leaf node is labeled with a class, Cj.
Solving the classification problem using decision trees is a two-step
process:
1. Decision tree induction: Construct a Decision tree using training
data.
2. For each ti Є D, apply the Decision Tree to determine its class.
52
Based on our definition of the classification problem, definition 4.1., the
constructed decision tree represents the logic need to perform the mapping.
Thus, it implicitly defines the mapping.
There are many advantages to the use of decision trees for
classification. Decision trees certainly are easy to use and efficient. Rules can
be generated that are easy to interpret and understand. They scale well for large
databases because the tree size is independent of the database size. Each tuple
in the database must be filtered through the tree. This takes time proportional to
the height of the tree, which is fixed. Trees can be constructed for data with
many attributes.
Disadvantages also exist for decision tree algorithms.
* Do not handle continuous data easily.
* Handling missing data is difficult.
* Overfitting may occur, because of the training data used to construct a
Decision tree.
* Correlations among the attributes in the database are ignored by the
decision tree process.
The problem of overfitting can be corrected via tree pruning.
ALGORITHM 4.3
Input:
D / / Training data
Output:
T
Decision tree Build algorithm:
/ / Simplistic algorithm to illustrate naïve approach to building Decision
tree
T = ø;
Determine best splitting criterion;
T = Create root node and label with splitting attribute;
T = add arc to root node for each split predicate and label;
For each arc do
D = Database created by applying splitting predicate to D;
If stopping point reached for this path, then
= Create leaf node and label with appropriate class;
Else
= DTBuild(D);
= Add to arc;
53
There are many decision tree algorithms. Splitting attributes are the
attributes which are used to label the nodes in the tree and based on this
attribute values the branches are grown. The predicates by which the arcs in the
tree are labeled are called the splitting predicates. Consider the figure given
below. The trees are built with gender and height as the splitting attributes. If
the gender is chosen as the splitting attribute, then male and female are the
splitting predicates. When the attribute height is the splitting attribute the
splitting predicates are < 1.3m, > 1,8m <1.5m and > 2.0m. The splitting
predicates differ based on the gender that is male or female. The efficiency of
the decision trees depends on choosing the splitting attribute. The algorithm
recursively add new sub trees to each branch of the tree and the algorithm
terminates when the stopping criterion is reached. One simple approach is to
stop the process if all the tuples belong to the same class. This class is used to
label the leaf node.
Gender
=F =M
Height Height
1 1 1 2
.3 .5 .8 .0
54
a) Balanced tree
Height
< >2
1.3m
gender tall
short
=F =M
Height Height
<1.5 >=1.5m
<=1.8 >1.8
m
m m
b) Deep tree
Figure 4.4 – Comparing Decision Tress
Height
>2m
<1.3m
>=1.3m >=1.5m
<1.5m <=1.8m >1.8m
gender tall
short gender medium
=M
=M =F
=F
tall medium
medium short
55
C) Bushy tree
height
<=1.5m >=2m
M
F
d) No gender attribute
Figure4.4 – Comparing decision trees
The following issues are faced by most DT algorithms:
Choosing splitting attributes: Which attributes to use for
splitting attributes impacts the performance applying the built DT. Some
attributes are better than others. In the data shown in Table 4.1, the id attribute
definitely should not be used, gender may or may not be used. The choice of
attribute involves not only an examination of the data in the training set but
also the informed input of domain experts.
Ordering of splitting attributes: The order in which the
attributes are chosen is also important. In Figure 4.4(a) the gender attribute is
chosen first. Alternatively, the height attribute could be chosen first. As seen in
Figure 4.4(b), in this case the height attribute must be examined a second time,
requiring unnecessary comparisons.
Splits: Associated with the ordering of the attributes is the
number of splits to take. With some attributes, the domain is small, so the
number of splits is obvious based on the domain (as with the gender attribute).
However, if the domain is continuous or has a large number of values, the
number of splits to use is not easily determined.
Tree structure: To improve the performance of applying the
tree for classification a balanced tree with the fewest levels is desirable.
However, in this case, more complicated comparisons with multiway branching
[see Figure 4.4(c)] may be needed. Some algorithms build only binary trees.
Stopping criteria: The creation of the tree definitely stops
when the training data are perfectly classified. There may be situations when
stopping earlier would be desirable to prevent the creation of larger trees. This
is a trade-off between accuracy of classification and performance. In addition,
stopping earlier maybe performed to prevent overfitting. It is even conceivable
56
that more levels than needed would be created in a tree if it is known that there
are data distributions not represented in the training data.
Training data: The structure of the DT created depends on the
training data. If the training data set is too small, then the generated tree might
not be specific enough to work properly with the more general data. If the
training data set is too large then the created tree may overfit.
Pruning: Once a tree is constructed, some modifications to the
tree might be needed to improve the performance of the tree during the
classification phase. The pruning phase might remove redundant comparisons
or remove sub tress to achieve better performance.
Figure 4.4 shows the different decision tress used to classify the person
according to the height. Fig 4.4(a) shows the balanced tree. The tree is of the
same depth for any path from root to leaf. Figure 4.4(b) and (c) are not
balanced. In Figure 4.4(b), height of the tree is greater than that of any others,
it is bit worse than others. Fig 4.4(d) is built using only the height attribute.
The training data and the tree induction algorithm determine the tree shape.
Thus the best shaped tree that performs perfectly on the training set is
desirable. Some algorithms create only binary trees. Binary trees are deeper.
Performance of these trees may be worse because more comparisons needed.
Pruning of trees are performed to improve the efficiency of the classification.
With pruning techniques, portion of the tree may be removed or combined to
reduce the overall size of the tree. Portion of the tree relating to classification
using an unimportant may be removed. If the pruning is done during the
creation of tree, it prevents from creating large tree. A second approach prunes
the tree after it is built. The time and space complexity of DT algorithms
depends on the size of the training data, q; the number of attributes, h; and the
shape of the resulting tree. Time complexity to build the tree is O(h q logq).
The time to classify a database of size n is based on the height of the tree, O(n
log q).
4.4.1 ID3
The ID3 technique to building a decision tree is based on information
theory and attempts to minimize the expected number of comparisons. The
basic idea of the induction algorithm is to ask questions whose answers provide
the most information. The basic strategy used by ID3 is to choose splitting
attributes with the highest information gain first. The amount of information
associated with an attribute value is related to the probability of occurrence.
The concept used to quantify information is called entropy. Entropy is used to
measure the amount of uncertainty or surprise or randomness in a set of data.
Certainly, when all data in a set belong to a single class, there is no uncertainty.
In this case the entropy is zero. The objective of decision tree classification is
to iteratively partition the given data set into subsets where all elements in each
final subset belong to the same class. If an event has a probability of 1 and if
the event was occurred, then it would not be surprised. As p 0, the surprise
increases. When we deal with a divide and conquer approach such as that used
57
with decision trees, the division results in multiple probabilities whose sum is
1. The formal definition of entropy is shown in the definition given below. The
value for entropy is between 0 and 1 and reaches a maximum when the
probabilities are all the same.
Definition 4.4 Given probabilities p1, p2,…, ps where si=1 pi = 1,
entropy is defined as
Given a database state, D, H (D) finds the amount of order in that state.
When that state is split into s new states S = {D1, D2, …., Ds}, we can again
look at the entropy of those states. Each step in ID3 chooses the state that
orders splitting the most. A database state is completely ordered if all tuples in
it are in the same class. ID3 choose the splitting attribute with the highest gain
in information, where gain is defined as the difference between how much
information is needed to make a correct classification before the split versus
how much information is needed after the split. Certainly, the split should
reduce the information needed by the largest amount. This is calculated by
determining the differences between the entropies of the original dataset and
the weighted sum of the entropies from each of the subdivided datasets. The
entropies of the split datasets are weighted by the fraction so the dataset is
being placed in that division. The ID3 algorithm calculates the gain of a
particular split by the following formula:
Example 4.4 and associated figure 4.5 illustrate this process using the
height example. In this example, six divisions of the possible ranges of the
possible ranges of heights are used. This division into ranges is needed when
the domain of an attributes is continuous or consists of many possible values.
While the choice of these divisions is some what arbitrary, a domain expert
should be able to perform the task.
Figure 4.5(a) illustrates a problem in that the tree has multiple splits
with identical results. In addition, there is a subdivision of range (1.9, 2.0].
Figure 4.5(b) shows as optimized version of the tree.
The formal definition of entropy is shown in Definition 4.4. The value
for entropy is between 0 and 1 and reaches a maximum when the probabilities
are all the same.
58
Height
>1.
>1.6 >1.7 >1. 9<
<= 1.7 <= 8< =2
m 1.8 =1. m
short m 9
m
short
medium
medium height tall
<=1.95m >1.95m
medium tall
a) Original tree
Height
>1.95m
<= 1.7m
>1.7m <=
1.95m
tall
short medium
b) Optimized tree
Figure 4.5 classification problem
Example 4.4:
Consider the table 4.1 with the classification of Output1.
Decision Classes = { short, medium, tall}
No. of tuples under the category short = 4
No. of tuples under the category medium = 8
No. of tuples under the category tall = 3
Total number of tuples = 15
Entropy of the set is
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
Consider the splitting attribute as gender. There are nine tuples in F and
6 tuples under M.
The entropy of the subset that are F is
59
3/9 log(9/3) + 6/9 log(9/6) = 0.2764
The entropy of the subset that are M is
1/6 log(6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
Weighted sum of the entropies = 9/15 * 0.2764 + 6/15 * 0.4392 =
0.34152
Gain in entropy for the gender attribute is
0.4384 – 0.34152 = 0.09688
When the height attribute is considered, we have 11 distinct values, but
height is a continuous data. Divide it into ranges such as
(0, 1.6 ], (1.6, 1.7], (1.7, 1.8], (1.8, 1.9], (1.9, 2.0], (2.0, ∞)
Entropy for each range is
(2/2 log(2/2) + 0 +0 ), (2/2 log(2/2) + 0 +0 ), (0+3/3 log(3/3) + 0 ),
(0+4/4 log(4/4) + 0 ), (0+ 1/2 log(2/1) +1/2 log(2) ) and (0 +0+2/2
log(2/2) ) respectively.
Entropy of the height attribute is
= 0.301
Gain(height) = 0.4384 – 0.301 = 0.3983
As the attribute height is with high information gain, this is selected as
the splitting attribute for DT. Since the number of values is more, number of
branches also will be more and subsequently we will get more number of rules.
To optimize the tree construction reduces the values by dividing the continuous
data in to ranges based on the number of occurrences
4.5 NEURAL NETWORK-BASED ALGORITHMS
Neural networks are used to classify the tuple as in decision trees.
Using the training data the model is built. The activation functions used are
sigmoidal. To classify the tuple, some of the attribute values are given as input
to the directed graph at the corresponding source nodes. There is one sink node
for each class. The output value that is generated indicates the probability that
the corresponding input tuple belongs to that class. The tuple will then be
assigned to the class with the highest probability of membership. The learning
process modifies the labeling of the arcs to better classify tuples. Given a
starting structure and value for all the labels in the graph as each tuple in the
training set is sent through the network, the projected classification made by
the graph can be compared with the actual classification. Based on the accuracy
of the prediction, various labeling in the graph can change. This learning
process continues with all the training data or until the classification accuracy
is adequate.
Solving a classification problem using NNs involves several steps:
1. Determine the number of output nodes as well as what attributes
should be used as input. The number of hidden layers (between the source and
60
the sink nodes) also must be decided. This step is performed by a domain
expert.
2. Determine weights (labels) and functions to be used for the
graph.
3. For each tuple in the training set, propagate it through the
network and evaluate the output predication to the actual result. If the
prediction is accurate, adjust labels to ensure that this prediction has a higher
output weight the next time. If the prediction is not correct, adjust the weights
to provide a lower output value for this class.
4. For each tuple ti Є D, propagate ti through the network and make
the appropriate classification.
Issues:
Attributes (number of source nodes)- Determining which
attributes to use a splitting attributes.
Number of hidden layers- in the simplest case, there is only
one hidden layer.
Number of hidden nodes- Choosing the best number of hidden
nodes per hidden layer is one of the most difficult problems when using NNs.
The number of hidden nodes depends on the structure of the NN, types of
activation functions, training algorithm, and problem being solved. If too few
hidden nodes are used, the target function may not be learned (under fitting). If
too many nodes are used, overfitting may occur. Rules of thumb are often
given that are based on the size of the training set.
Training data- As with Decision Trees, with too much training
data, the NN may suffer from overfitting, while too little and it may not be able
to classify accurately enough.
Number of sinks- Although it is usually assumed that the
number of output nodes is the same as the number of classes, this is not always
the case. For example, with two classes there cold only be one output node,
with the resulting value being the probability of being in the associated class.
Subtracting this value from one would give the probability of being in the
second class.
Interconnections- In the simplest case, each node is connected
to all nodes in the next level.
Weights- The weight assigned to an arc indicates the relative
weight between those two nodes. Initial weights are usually assumed to be
small positive number and are assigned randomly.
Activation functions- Many different types of activation
functions can be used.
61
Learning technique: The technique for adjusting the weights is
called the learning technique. Although many approaches can be used, the most
common approach is some form of back propagation.
Stop- The learning may stop when all the training tuples have
propagated through the network or may be based on time or error rate.
Advantages of using NN in classification:
NN’s are more robust than Decision Trees because of the
weight.
The NN improves its performance by learning. This may
continue even after the training set has been applied.
The use of NN’s can be parallelized for better performance.
There is a low error rate and thus have a high degree of accuracy
once the appropriate training has been performed.
NN’s are more robust than decision trees in noisy environments.
Disadvantages of using NN in classification:
NN’s are difficult to understand.
Generating rules from NN’s is not straightforward.
Input attributes should be numeric.
If the training data is small, overfitting may occur.
The learning phase may fail to converge.
It is quite expensive to use.
4.5.1 Propagation
Given a tuple of values input to the NN, X = <x1, x2, . . . xh>, then each
node accepts one input. Then the summation and activation functions are
applied at each node, with an output value created for each output arc from that
node. These values are in turn sent to the subsequent nodes. This process
continues until a tuple of output values, Y = <y1, …, ym>, is produced from the
nodes in the output layer. The process of propagation is shown in Algorithm
4.4 using a neural network with one hidden layer. Here a hyperbolic tangent
activation function is used for the nodes in the hidden layer, while a sigmoid
function is used for nodes in the output layer. We assume that the constant c in
the activation function has been provided. We also use k to be the number of
edges coming into a node.
Algorithm 4.4
Input:
N / / neural network
x = <x1, …, xh> / / Input tuple consisting of values for input
attributes only
Output:
62
Y - <Y1, …, Ym> / / Tuple consisting of output values from NN
Propagation algorithm:
/ / Algorithm illustrates propagation of a tuple through a NN
for each node i in the input layer do
Output xi on each output arc from i;
for each hidden layer do
for each node i do
Si = kj=1 (wjixji);
for each output arc from i do
output
A simple application of propagation is shown for the height data, here the
classification performed is the same as that seen with the decision tree in Figure
4.4(d).
0 f1 small
Gender 0
0
f2 medium
1
Height
1
1
f3 tall
63
associated with the classes. The weights for the gender node is 0 and whereas
for the height attribute is 1.
4.6 RULE-BASED ALGORITHMS
One straightforward way to perform classification is to generate if-then
rules that cover all cases. For example, we could have the following rules to
determine classification of grades:
If 90 < grade, then class = A
If 80 < grade and grade < 90, then class = B
If 70 < grade and grade < 80, then class = C
IF 60 < grade and grade < 70, then class = D
If grade < 60, then class = F
A classification rule, r=<a, c>, consists of the if or antecedent, a part
and the then or consequent portion, c. The antecedent contains a predicate that
can be evaluated as true of false against each tuple in the database (and
obviously in the training data). These rules relate directly to the corresponding
DT that could be created. A DT can always be used to generate rules, but they
are not equivalent. There are differences between rules and trees:
The tree has an implied order in which the splitting is
performed. Rules have no order.
A tree is created based on looking at all classes. When
generating rules, only one class must be examined at a time.
There are algorithms that generate a rule from trees as well as
algorithms that generate rules without first creating Decision Trees.
4.6.1 Generating Rules from a Decision tree
The process to generate a rule from a decision tree is
straightforward and is outlined in Algorithm 4.8. This algorithm will generate a
rule for each leaf node in the decision tree. All rules with the same consequent
could be combined together by ORing the antecedents of the simpler rules.
ALGORITHM 4.5.
Input:
T / / Decision tree
Output:
R / / Rules
General algorithm:
// Illustrate simple approach to generating classification rules from a
DT
R=Ø
for each path from root to a leaf in T do
a = True
64
for each non – leaf node do
a=a^ (of node combined with label of incident outgoing arc)
c = lable1 of leaf node
R = R U r = <a, c>
Using this algorithm, the following rules are generated for the Decision
tree in Figure 4.5(a):
{<(Height ≤ 1.6 m), Short>
<((Height > 1.6 m) ^ (Height ≤ 1.7 m)), Short>
<((Height > 1.7 m) ^ (Height ≤ 1.8 m)), Medium>
<((Height > 1.8 m) ^ (Height ≤ 1.9 m)), Medium>
<((Height > 1.9 m) ^ (Height ≤ 2 m) ^ (Height ≤ 1.95 m)), Medium>
<((Height > 1.9 m) ^ (Height ≤ 2 m) ^ (Height ≤ 1.95 m)), Tall>
<(Height < 2 m), Tall> }
65
)
Figure 4.7
Tuple in Class 1 and correctly classified Combination of
multiple classifiers
Tuple in Class 1 and incorrectly classified i
g
Tuple in Class 2 and correctly classified u
r
Tuple in Class 2 and incorrectly classified
e
Here the weights, wk, can be assigned by a user or learned based on the 4
past accuracy of each classifier. Another technique is to choose the classifier .
that has the best accuracy in a database sample. This referred to a dynamic 7
classifier selection (DCS). Another variation is simple voting: assign the tuple
to the class to which a majority of the classifiers have assigned it. This may C
have to be modified slightly in case there are many classes and a majority is o
found. m
Given a tuple to classify, the neighborhood around it is first determined, b
then the tuples in that neighborhood are classified by each classifier, and finally i
the accuracy for each class is measured. By examining the accuracy across all n
classifiers for each class, the tuple is placed in the class that has the highest a
local accuracy. In effect, the class chosen is that to which most of its neighbors t
are accurately classified independent of classifier. i
o
n
o
f
m
u
l
t
i
p
l
66 e
c
l
a
NOTES
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
67
UNIT - III
CHAPTER 5
CLUSTERING
5.1 INTRODUCTION
Clustering is similar to classification in that data are grouped. However,
unlike classification, the groups are not predefined. Instead, the grouping is
accomplished by finding similarities between data according to characteristics
found in the actual data. The groups are called clusters. Some authors view
clustering as a special type of classification. Clustering is defined as,
Set of like elements. Elements from different clusters are not
alike.
The distance between points in a cluster is less than the distance
between a point in the cluster and any point outside it.
A term, similar to clustering is database segmentation, where
like tuples (records) in a database are grouped together. This is done to
partition or segment the database into components that then give the user a
more general view of the data. A simple example of clustering is found in
Example 5.1.
Example 5.1:
An international catalog company wanted to group its customers based
on the common features. Company management does not have any predefined
labels for these groups. Based on the resultant groups, management can change
the marketing strategies for the groups of customers. The information available
in the data base is Income, Age, Children, Marital Status and Education. Table-
5 shows the sample data. All attributes are not important always. For example
if the advertisement is for a special sale of children’s clothes, then the targeted
group will be the persons with children.
One possible group or clustering could obtain from the sample data is
shown in the table. The first groups of people have children and high school
degree. The second group contains the persons with no children and has high
school degree. The third group has both children and college degree. The last
two groups have higher incomes and at least a college degree. Based on the
attributes chosen it is possible to get different types of clustering.
Table -5.1 Sample data for Example 5.1
Income Age Children Marital Status Education
25000 35 3 Single High School
15000 25 1 Married High School
20000 40 0 Single High School
30000 20 0 Divorced High School
68
20000 25 3 Divorced College
70000 60 0 Married College
90000 30 0 Married Graduate School
2,00,000 45 5 Married Graduate School
1,00,000 50 2 Divorced College
69
However, with clustering, this may not be the case. Thus, when the clustering
process finishes creating a set of clusters, the exact meaning of each cluster
may not be obvious. Here is where a domain expert is needed to assign a label
or interpretation for each cluster.
There is no one correct answer to a clustering problem. In fact,
many answers may be found. The exact number of clusters required is not easy
to determine. Again, a domain expert may be required. For example, suppose
we have a set of data about plants that have been collected during a field trip.
Without any prior knowledge of plant classification, if we attempt to divide this
set of data into similar groupings, it would not be clear how many groups
should be created.
Another related issue is what data should be used for clustering.
Unlike learning during a classification process, where there is some a priori
knowledge concerning what the attributes of each classification should be, in
clustering we have no supervised learning to aid the process. Indeed, clustering
can be viewed as similar to unsupervised learning.
Basic features of clustering (As opposed to classification):
The (best) number of clusters is not known.
There may not be any priori knowledge concerning the clusters.
Cluster results are dynamic.
The clustering problem is stated as shown in Definition 5.1. Here we
assume that the number of clusters to be created is an input value, k. The actual
content (and interpretation) of each cluster, Kj, 1 < j < k, is determined as a
result of the function definition. Without loss of generality, we will view that
the result of solving a clustering problem is that a set of clusters is created: K =
{K1, K2, …., Kk}.
Definition 5.1. Given a database D= {t1, t2, …, tn}of tuples and an
integer value k, the clustering problem is to define a mapping f : D {1, …,
k} where each ti is assigned tone cluster Kj, 1 < j < k. A cluster, Kj, contains
precisely those tuples mapped to it; that is, Kj = {t1 | f(ti) = Kj, 1 < i<n, and ti Є
D}.
A classification of the different types of clustering algorithms is shown
in Figure 5.2. Clustering algorithms themselves may be viewed as hierarchical
of partitional. With hierarchical clustering, a nested set of clusters is created.
Each level in the hierarchy has a separate set of clusters. At the lowest level,
each item is in its own unique cluster. At the highest level, all items belong to
the same cluster. With hierarchical clustering, the desired number of clusters is
not input. With partition clustering, the algorithm creates only one set of
clusters. These approaches use the desired number of clusters to drive how the
final set is created. Traditional clustering algorithms tend to be targeted to
small numeric databases that fit into memory. There are, however, more recent
clustering algorithms that look at categorical data and are targeted to larger,
perhaps dynamic databases. Algorithms targeted to larger databases may adapt
70
to memory constraints by either sampling the database or using data structures,
which can be compressed or pruned to fit into memory regardless of the size of
the database. Clustering algorithms may also differ based on whether they
produce overlapping or non overlapping clusters. Even though we consider
only non overlapping clusters, it is possible to place an item in multiple
clusters. In turn, overlapping clusters can be viewed as extrinsic or intrinsic.
Extrinsic techniques use labeling of the times to assist in the classification
process. These algorithms are the traditional classification supervised learning
algorithms in which a special input training set is used. Intrinsic algorithms do
not use any a priori category labels, but depend only on the adjacency matrix
containing the distance between objects. All algorithms we examine in this
chapter fall into the intrinsic class.
Clustering
71
within one cluster is more like tuples within that cluster than it is similar to
tuples outside it. As with classification, then, we assume the definition of a
similarity measure, sim(ti, tj), defined between any two tuples, ti, tj Є D. Thus
similarity measure is defined as given below.
Definition 5.2. Given a database D= {t1, t2, …, tn} of tuples, a similarity
measure, sim(ti, tj), defined between any two tuples, ti, tj Є D, and an integer
value k, the clustering problem is to define a mapping f : D {1, …, k} where
each ti is assigned to one cluster Kj, 1 < j < k. Given a cluster Kj,
.
A distance measure, dis(ti, tj) as opposed to similarity, is often used in
clustering. The clustering problem then has the desirable property that given a
cluster,Kj, .
Some clustering algorithms look only at numeric data. Numeric data
assumes metric data points. Metric attributes satisfy the triangular inequality.
Given a cluster Km of N points , then centroid, radius and
diameter is defined as,
Here the centroid is the “middle” of the cluster; it need not be an actual
point in the cluster. Some clustering algorithms alternatively assume that the
cluster is represented by one centrally located object in the cluster called
medoid. The radius is the square root of the average means squared distance
from any point in the cluster to the centroid, and the diameter is the square root
of the average mean squared distance between all pairs of points in the cluster.
Mm indicates the medoid for cluster Km.
Many clustering algorithms require that the distance between clusters
(rather than elements) be determined. This is not an easy task given that there
are many interpretations for distance between clusters. Given clusters Ki and
Kj, there are several standard alternatives to calculate the distance between
clusters. A representative list is:
Single link: Smallest distance between an element in one
cluster and an element in the other, dis(Ki, Kj) = min(dis(til, tjm) til Є Ki Kj
and tjm Є Kj Ki.
72
Complete link: Largest distance between an element in one
cluster and an element in the other, dis(Ki, Kj) = max(dis(til, tjm) til Є Ki Kj
and tjm Є Kj Ki.
Average: Average distance between an element in one cluster
and an element in the other, dis(Ki, Kj) = mean (dis(til, tjm) ) til Є Ki Kj and
tjm Є Kj Ki.
Centroid: if clusters have a representative centroid, then the
centroid distance is defined as the distance between the centroids, dis (Ki, Kj)
= dis (Ci, Cj), where Ci is the centroid for Ki and Cj is the centroid for Kj.
Medoid: Using a medoid to represent each cluster, the distance
between the clusters can be defined by the distance between the medoids:
dis (Ki, Kj) = dis(Mi, Mj).
5.3 OUTLIERS
Outliers are sample points with values much different from those of the
remaining set of data. Outliers may represent errors in the data (perhaps a
malfunctioning sensor recorded an incorrect data value) or could be correct
data values that are simply much different from the remaining data. A person
who is 2.5 meters tall is much taller than most people. In analyzing the height
of individuals, this value probably would be viewed as an outlier.
Some clustering techniques do not perform well with the presence of
outliers. Clustering algorithms may actually find and remove outliers to ensure
that they perform better. However, care must be taken in actually removing
outliers. For example, suppose that the data mining problem is to predict
flooding. Extremely high water level values occur very infrequently, and when
compared with the normal water level values may seem to be outliers.
However, removing these values may not allow the data mining algorithms to
work effectively because there would be no data that showed that floods ever
actually occurred.
Outlier detection, or outlier mining, is the process of identifying outliers
in a set of data. Clustering, or other data mining, algorithms may then choose to
remove or treat these values differently. Some outlier detection techniques are
based on statistical techniques. These usually assume that the set of data
follows a known distribution and that outliers can be detected by well-known
tests such as discordance tests. However, these tests are not very realistic for
real-world data because real-world data values may not follow well-defined
data distributions. Also, most of these tests assume a single attribute value, and
many attributes are involved in real-world datasets. Alternative dictation
techniques may be based on distance measures.
5.4 HIERARCHICAL ALGORITHMS
Hierarchical clustering algorithms actually create sets of clusters.
Example 5.2 illustrates the concept. Hierarchical algorithms differ in how the
sets are created. A tree data structure, called a dendrogram, can be used to
73
illustrate the hierarchical clustering technique and the sets of different clusters.
The root in a dendrogram tree contains one cluster where all elements are
together. The leaves in the dendrogram each consist of a single element cluster.
Internal nodes in the dendrogram represent new clusters formed by merging the
clusters that appear as its children in the tree. Each level in the tree is
associated with the distance measure that was used to merge the clusters. All
clusters created at a particular level were combined because the children
clusters had a distance between them less than the distance value associated
with this level in the tree. A dendrogram for example 5.2 is seen in Figure 5.3.
Example 5.2
8
7 E
6
5 C D
4 XY (Scatter) 1
3
B
2 A
1 F
0
A
0 1 2 3 4 5 6 7 8 9
The above figure shows five elements A,B,C,D,E and F. Depend on the
number of clusters ‘K’, objects are grouped. If k = 1 all objects are grouped in
to one cluster. If k= 2, then cluster1 will contain the objects {A,B,C,D,E} and
cluster 2 contains {F}, As F is in farthest distance it cannot be combined with
others. Dendrogram for this problem is given in Figure 5.4.
A B C D E F
Figure 5.3 Dendrogram for example 5.2
The space complexity for hierarchical algorithms is O(n2), because this
is the space required for the adjacency matrix. The space required for the
dendrogram is O(kn), which is much less than O(n2). The time complexity for
hierarchical algorithms is O(kn2) because there is one iteration for each level in
74
the dendrogram. Depending on the specific algorithm, however, this could
actually be O(maxd n2) where maxd is the maximum distance between points.
Different algorithms may actually merge the closest clusters from the next
lowest level or simply create new clusters at each level with progressively
larger distance.
Hierarchical techniques are well suited for many clustering applications
that naturally exhibit a nesting relationship between clusters.
5.4.1 Agglomerative Algorithms
Agglomerative algorithms start with each individual item in its own
cluster and iteratively merge clusters until all items belong in one cluster.
Different agglomerative algorithms differ in how the clusters are merged at
each level. Algorithm 5.1 illustrates the typical agglomerative clustering
algorithm. It assumes that a set of elements and distances between them is
given as input. We use an n x n vertex adjacency matrix, A, as input. Here the
adjacency matrix, A, contains a distance value rather than a simple Boolean
value: A[i, j] = dis(ti, tj). The output of the algorithm is a dendrogram, DE,
which we represent as a set of ordered triples <d, k, K> where d is the threshold
distance, k is the number of clusters, and K is the set of clusters. The
dendrogram in Figure 5.7(a) would be represented by the following:
{<0,5 {{A}, {B}, {C}, {D}, {E}}>,
<1, 3, {{A, B}, {C, D}, {E}}>
<2,2,{ A, B, C, D}, {E}}>,
<3,1 {A, B, C, D, E}}>}
Outputting the dendrogram produces a set of clusters rather than just
one clustering. The user can determine which of the clusters (based on distance
threshold) he or she wishes to use.
ALGORITHM 5.1
Input:
D = {t1, t2, …, tn) / / Set of elements
A // Adjacency matrix showing distance between elements
Output:
DE / / Dendrogram represented as a set of ordered triples
Agglomerative algorithm:
d = 0;
k = n;
k = {{t1}, …, {tn}};
DE = {<d, k, K>}; // Initially dendrogram contains each element in
its own cluster.
repeat
oldk = k;
75
d = d+1;
Ad = Vertex adjacency matrix for graph with threshold distance of d;
<k, K> = NewClusters(Ad, D);
if oldk ≠ k then
DE = DE U <d, k, K>
// New set of clusters added to a dendrogram.
Until k = 1
NewClusters is a procedure used to determine the creation of next level
of clusters from the previous level. This is where the different types of
agglomerative algorithms differ. It is possible those only two clusters from the
prior level are merged or that multiple clusters are merged. Algorithms also
differ in terms of which clusters are merged when there are several clusters
with identical distance.. In addition, the technique used to determine the
distance between clusters may vary. Single link, complete link, and average
link techniques are perhaps the most well known agglomerative techniques
based on graph theory concepts.
All agglomerative approaches experience excessive time and space
constraints. The space required for the adjacency matrix is O(n2) where there
are n items to cluster. Because of the iterative nature of the algorithm the
matrix (or a subset of it) must be accessed multiple times. The simplistic
algorithm provided in algorithm 5.1 performs at most maxd examinations of
this matrix, where maxd is the largest distance between any two points. In
addition, the complexity of the NewClusters procedure could be expensive.
This is a potentially severe problem in large databases. Another issue with the
agglomerative approach is that it is not incremental. Thus, when new elements
are added or old ones are removed or changed, the entire algorithm must be
rerun.
Single Link Technique: The single link technique is based on the idea of
finding maximal connected components in a graph. A connected component is
a graph in which there exists a path between any two vertices. With the single
link approach, two clusters are merged if there is at least one edge that connects
the two clusters, that is, if the minimum distance between any two points is less
than or equal to the threshold distance being considered. For this reason, it is
often called the nearest neighbor clustering technique. Example 5.3 illustrates
this process.
The single link algorithm is obtained by replacing the NewClusters
procedure in the agglomerative algorithm with a procedure to find connected
components of a graph. We assume that this connected components procedure
has as input a graph (actually represented by a vertex adjacency matrix and set
of vertices) and as outputs a set of connected components defined by a number
(indicating the number of components) and an array containing the membership
of each component. Note that this is exactly what the last two entries in the
ordered triple are used for by the dendrogram data structure.
76
The single link approach is quite simple, but it suffers from several
problems. This algorithm is not very efficient because the connected
components procedure, which is an O(n2) space and time algorithm, is called at
each iteration. A more efficient algorithm could be developed by looking at
which clusters from an earlier level can be merged at each step. Another
problem is that the clustering creates clusters with long chains.
An alternative view to merging clusters in the single link approach is
that two clusters are merged at a stage where the threshold distance is d if the
minimum distance between any vertex in one cluster and any vertex in the
other cluster is at most d.
There have been other variations of the single link algorithm. One
variation, based on the use of a minimum spanning tree (MST), is shown in
Algorithm 5.2. Here we assume that a procedure, MST, produces a minimum
spanning tree given an adjacency matrix as input. The clusters are merged in
increasing order of the distance found in the MST. In the algorithm we show
that once two clusters are merged, the distance between them in the tree
becomes . Alternatively, we could have replaced the two nodes and edge with
one node.
ALGORITHM 5.2
Input:
D = {t1, t2, …, tn) / / Set of elements
A // Adjacency matrix showing distance between elements
Output:
DE / / Dendrogram represented as a set of ordered triples
MST single link algorithm:
d = 0;
k = n;
k = {{t1}, …, {tn}};
DE = {<d, k, k}; / / Initially dendrogram contains each element in
its own cluster.
repeat
oldk = k;
Ki, Kj = two cluster closest together in MST;
K = K – {Kj} – {Kj} U {Ki U Kj};
k = oldk -1;
d = dis (Ki, Kj);
DE = DEU<d, k, k>; / / New set of clusters added to adendrogram.
dis(Ki, Kj) =
until k = 1
77
We illustrate this algorithm using the data in Example 5.3. Figure 5.5
shows one MST for the example. The algorithm will merge A and B and then C
and D (or the reverse). These two clusters will then be merged at a threshold of
2. Finally, E will be merged at a threshold of 3. Note that we get exactly the
same dendrogram as in Figure 5.4(a).
The time complexity of this algorithm is O(n2) and it dominates the
time of the algorithm. Once it is created having n- 1 edges, the repeat loop will
be repeated only n-1 times.
The single linkage approach is infamous for its chain effect; that is, two
clusters the respective clusters to be merged that are far apart, but this has no
impact on the algorithm. Thus, resulting clusters may have points that are not
related to each other at all, but simply happen to be near (perhaps via a
transitive relationship) points that are close to each other.
Complete Link Algorithm. Algorithm the complete link algorithm is
similar to the single link algorithm; it looks for cliques rather than connected
components. A clique is a maximal graph in which there is an edge between
any two vertices. Here a procedure is used to find the maximum distance
between any clusters so that two clusters are merged if the maximum distance
is less than or equal to the distance threshold. In this algorithm, we assume the
existence of a procedure, clique, which finds all cliques in a graph. As with the
single link algorithm, this is expensive because it is an O (n2) algorithm.
Clusters found with the complete link method tend to be more compact
than those found using the single ink technique. Using the data found in
Example 5.3., Figure 5.4(b) shows the dendrogram created. A variation of the
complete link algorithm is called the farthest neighbor algorithm. Here the
closest clusters are merged where the distance is the smallest measured by
looking at the maximum distance between any two points.
Example 5.3:
78
Average Link. The average link technique merges two clusters if the
average distance between any two points in the two target cluster is below the
distance threshold. The algorithm used here is slightly different from that found
in single and complete link algorithms because we must examine the complete
graph (not just the threshold graph) at each stage. Thus, we restate this
algorithm in Algorithm 5.3.
79
k = oldk -1;
DE = DE U <d, k, K>;
// New set of clusters added to a dendrogram.
until k = 1
Note that in this algorithm we increment d by 0.5 rather than by 1. This
is a rather arbitrary decision based on understanding of the data. Certainly, we
could have used an increment of 1, but we would have had a dendrogram
different from that seen in Figure 5.4( c).
80
The solution with the best value for the criterion function is the clustering
solution used. One common measure is a squared error metric, which measures
the squared distance from each point to the centroid for the associated cluster:
This definition assumes that each tuple has only one numeric value as
opposed to a tuple with many attribute values. This algorithm assumes that the
desired number of clusters ‘k’ as an input parameter. Note that the initial values
for the means are arbitrarily assigned. These could be assigned randomly or
perhaps could use the values from the first k input items themselves. The
convergence criteria could be based on the squared error, but they need not be.
For example, the algorithm could stop when no (or a very small) number of
tuples are assigned to different clusters. Other termination techniques have
simply looked at a fixed number of iterations. A maximum number of iterations
may be included to ensure stopping even without convergence.
81
ALGORITHM 5.4
Input:
D = {t1, t2, …, tn) / / Set of elements
k / / Number of desired clusters
Output:
K / / Set of clusters
K-means algorithm:
assign initial values for means m1, m2, …, mk;
repeat
assign each item ti to the cluster which has the closest mean;
calculate new mean for each cluster;
until convergence criteria is met;
The K-means algorithm is illustrated in Example 5.4.
The time complexity of K-means is O (tkn) where t is the number of
iterations, k is the number of cluster and n is the number of objects. K-means
finds a local optimum and may actually miss the global optimum. K-means
does not work on categorical data because the mean can not be computed for
nominal data. Only convex-shaped clusters are found. It also does not handle
outliers well. Few variation of K-Means exist to cluster the categorical data. K-
Modes is the categorical clustering algorithm, similar to K-means where mean
are replaced by modes. A typical value for k is 2 to 10.
Although the K-means algorithm often produces good results, it is not
time-efficient and does not scale well. By saving distance information from one
iteration to the next, the actual number of distance calculations that must be
made can be reduced.
Some K-means variations examine ways to improve the chances of
finding the global optimum. This often involves careful selection of the initial
clusters and means. Another variation is to allow clusters to be split and
merged. The variance within a cluster is examined, and if it is too large, a
cluster is split. Similarly, if the distance between two cluster centroids is less
than a predefined threshold, they will be combined.
Example 5.4:
Consider the items to cluster as
{ 2,4,10,12,3,20,30,11,25}
And assume k= 2. Initially assign the means of two clusters, m1 and m2
as 2 and 4 respectively.
Let t = 10, find the distance between m1 and t
=8
82
= 6
83
Recent research at Microsoft has examined how to efficiently perform
the clustering algorithms with large databases. The basic idea of this scaling
approach is as follows:
1. Read a subset of the database into main memory.
2. Apply clustering technique to data in memory.
3. Combine results with those from prior samples.
4. The in-memory data are then divided into three different types:
those items that will always be needed even when the next sample is brought
in, those that can be discarded with appropriate updates to data being kept in
order to answer the problem, and those that will be saved in a compressed
format. Based on the type, each data item is then kept, deleted, or compressed
in memory.
5. If termination criteria are not met, then repeat from step 1.
This approach has been applied to the K-means algorithm and has been
shown to be effective.
5.6.1 BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is
designed for clustering a large amount of meteoric data. It assumes that there
may be a limited amount of main memory and achieves a liner I/O time
requiring only one database scan. It is incremental and hierarchical, and it uses
an outlier handling technique. Here points that are found in sparsely populated
are removed. The basic idea of the algorithm is that a tree is built that captures
needed information to perform clustering. The clustering is then performed on
the tree itself, where labeling of nodes in the tree contain the needed
information to calculate distance values. A major characteristic of the BIRCH
algorithm is the use of the clustering feature, which is a triple that contains
information about a cluster (see Definition 5.3). The clustering feature provides
a summary of the information about one cluster. By this definition it is clear
that BIRCH applies only to numeric data. This algorithm uses a tree called a
CF tree as defined in Definition 5.4. The size of the tree is determined by a
threshold value, T, associated with each leaf node. This is the maximum
diameter allowed for any leaf. Here diameter is the average of the pair wise
distance between all points in the cluster. Each internal node corresponds to a
cluster that is composed of the subclusters represented by its children.
Definition 5.3. A clustering feature (CF) is a triple (N, LS, SS), Where
the number of the points in the cluster is N, LS is the sum of the points in
the cluster, and SS is the sum of the squares of the points in the cluster.
Definition 5.4. A CF tree is a balanced tree with a branching factor
(maximum number of children a node may have) B. Each leaf node also
represents a cluster and contains a CF entry for each subcluster in it. A
subcluster in a leaf node must have a diameter no greater than a given
threshold value T.
84
Unlike a dendrogram, a CF tree is searched in a top-down fashion. Each
node in the CF tree contains clustering feature information about its
subclusters. As points are added to the clustering problem the CF tree is build.
A point is inserted into the cluster (represented by a leaf node) to which it is
closest. If the diameter for the leaf node is greater than T, then a splitting and
balancing of the tree is performed (similar to that used in a B-tree). The
algorithm adapts to main memory size by changing the threshold value. A
larger threshold, T yields a smaller CF tree. This process can be performed
without rereading the data. The clustering feature data provides enough
information to perform this condensation. The complexity of the algorithm is O
(n).
ALGORITHM 5.5
Input:
D = {t1, t2, …, tn) / / Set of elements
T / / Threshold for CF tree construction
Output:
K / / Set of clusters
85
When an item is inserted into a cluster at the leaf node of the tree, the cluster
must satisfy the threshold value. If it does, then the CF entry for that cluster is
modified. If it does not, then that item is added to that node as a single-item
cluster.
Node splits occur if no space exists in a given node. This is based on the
size of the physical page because each node size is determinedly the page size.
An attractive feature of the CF values is that they are additive; that is, if two
clusters are merged, the resulting CF is the addition of the CF values for the
starting clusters. Once the tree is built, the leaf nodes of the CF tree represent
the current clusters.
In reality, this algorithm, Algorithm 5.5, is only the first of several steps
proposed for the use of BIRCH with large databases. The complete outline of
steps is:
1. Create initial CF tree using a modified version of Algorithm 5.10.
This in effect “loads” the database into memory. IF there is insufficient
memory to construct the CF tree with a given threshold, the threshold value is
increased and a new smaller CF tree is constructed. This can be done by
inserting the leaf nodes of the previous tree into the new small tree.
2. The clustering represented by the CF tree may not be natural
because each entry has a limited size. In addition, the input order can
negatively impact the results. These problems can be overcome by another
global clustering approach applied to the leaf nodes in the CF tree. Here each
leaf node is traded as a single point for clustering. Although the original work
proposes a centroid-based agglomerative hierarchical clustering algorithm to
cluster these clusters, other clustering algorithms could be used.
3. The last phase (Which is optional) reclusters all points by placing
them in the cluster that has the closest centroid. Outliers, points that are too far
from any centroid, can be removed during this phase.
BIRCH is linear in both a space and I/O time. The choice of threshold
value is imperative to an efficient execution of the algorithm. Otherwise, the
tree may have to be rebuilt many times to ensure that it can be memory-
resident. This gives the worst-case time complexity of O (n2).
5.7 CLUSTERING WITH CATEGORICAL ATTRIBUTES
Traditional algorithms do not always work with categorical data.
Example 5.5 illustrates some problems that exist when clustering categorical
data. This example uses a hierarchical-based centroid algorithm to illustrate the
problems. The problem illustrated here is that the centroid tends to weaken the
relationship between the associated cluster and other clusters. The problem
worsens as more and more clusters are merged. The number or attributes
appearing in the mean increases, while the individual values actually decreased.
This makes the centroid representations very similar and makes distinguishing
between clusters difficult.
86
The ROCK (Robust Clustering using links) clustering algorithm is
targeted to both Boolean data and categorical data. A novel approach to
identifying similarity is based on the number of links between items. A pair of
items are said to be neighbors if their similarity exceeds some threshold. This
need not be defined based on a precise metric, but rather a more intuitive
approach using domain experts could be used. The number of links between
two items is defined as the number of common neighbors they have. The
objective of the clustering algorithm is to group together points that have more
links. The algorithm is a hierarchical agglomerative algorithm using the
number of links as the similarity measure rather than a measure based on
distance.
Instead of using a Euclidean distance, a different distance, such as the
Jaccard coefficient, has been proposed. One proposed similarity measure based
on the Jaccard coefficient is defined as
Here link(Ki, Kj) is the number of links between the two clusters. Also
ni and nj are the number of points each cluster. The denominator is used to
87
normalize the number of links because larger clusters would be expected to
have more links simply because they are larger. ni1+2 f () is an estimate for the
number of links between pairs of points in Ki, when the threshold used for the
similarity measure is . The function f() depends on the data, but it is found
to satisfy the property that each item in Ki has approximately n f () neighbors in
the cluster are connected, f()= 1. Then ni3 is the number of links between
points in Ki.
The first step in the algorithm converts the adjacency matrix into a
Boolean matrix where an entry is 1 if the two corresponding points are
neighbors. As the adjacency matrix is of size n2, this is an O(n2) step. The next
step converts this into a matrix indicating the links. This can be found by
calculating S x S, which can be done in O (n2.37). The hierarchical clustering
portion of the algorithm then starts by placing each point in the sample in a
separate cluster. It then successively merges clusters until k clusters are found.
To facilitate this processing, both local and global heaps are used. A local heap,
q, is created to represent each cluster. Here q contains every cluster that has a
nonzero link to the cluster that corresponds to this cluster. Initially, a cluster is
created for each point, ti. The heap for ti, q[ti], contains every cluster that has a
nonzero link to {ti}. The global heap contains information about each cluster.
All information in the heap is ordered based on the goodness measure.
Example 5.5
Consider an information retrieval system where documents may
contain keywords { book, water, sun , sand, swim, read}. Suppose there are
four documents where the first contains the word { book}, the second contains
{ water, sun, sand, swim}, the third contains { water, sun swim, read} and the
fourth contains {read, sand}. These values can be represented using Boolean
points such as if the value is present in the document assign 1 else assign 0.
The first document contains only the word book thus the respective Boolean
value is (1,0,0,0,0,0). Similarly the Boolean values for other documents are
(0,1,1,1,1,0), (0,1,1,0,1,1) and (0,0,0,1,0,1). The adjacency matrix is computed
by Euclidean distance.
1 2 3 4
1 0 2.24 2.24 1.73
2 2.24 0 1.41 2
3 2.24 1.41 0 2
4 1.73 2 2 0
The distance between the points 2 and 3 is the smallest(1.41) and thus
they are merged. We get a cluster containing {(0,1,1,1,1,0),(0,1,1,0,1,1)} with a
centroid of (0,1,1,0.5,1,0.5). Now the distance between the new centroid to the
original points 1 and 4 are 2.24 and 1.73 respectively, while the distance
between original points 1 and 4 is 1.73. Next we can merge these points, but 1
88
and 4 have no word in common. So with k= 2, we get the clusters {1,4} and
{2,3}.
Example 5.6
Instead of using Euclidean distance, similarity measure is used for
the above example and it is given in the table.
1 2 3 4
1 1 0 0 0
2 0 1 0.6 0.2
3 0 0.6 1 0.2
4 0 0.2 0.2 1
Similarity between first and second document is 0 since nothing is in
common. Similarity between 2 and 3 is (water, sun, swim)/ (water, sun, swim,
sand, read) = 3/5 i.e 0.6. Let the threshold value be 0.2. Then the neighbors
are {(2,3), (2,4), (3,4)}. The objects are neighbors of itself. Thus we have
{(1,1),(2,2),(3,3),(4,4)}. Common neighbors between the object 1 and 1 is 1,
1and 2 is 0, 2 and 3 is 1 and so on.
1 2 3 4
1 1 0 0 0
2 0 3 3 3
3 0 3 3 3
4 0 3 3 3
89
NOTES
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
90
UNIT - IV
CHAPTER – 6
ASSOCIATION RULES
6.1 INTRODUCTION
The purchasing of one product when another product is purchased
represents an association rule. Association rules are frequently used by retail
stores to assist in marketing, advertising, floor placement, and inventory
control. Although they have direct applicability to retail businesses, they have
been used for other purposes as well, including predicting faults in
telecommunication networks. Association rules are used to show the
relationships between data items. These uncovered relationships are not
inherent in the data, as with functional dependencies, and they do not represent
any sort of causality or correlation. Instead, association rules detect common
usage of items.
A database in which an association rule is to be found is viewed as a set
of tuples, where each tuple contains a set of items. For example, a tuple could
be {PeanutButter, Bread and Jelly}, which consists of the three items:
PeanutButter, Bread and Jelly. Keeping grocery story cash register transactions
in mind, each item represents an item purchased, while each tuple is the list of
items purchased at one time. In the simplest case, we are not interested in
quantity or cost, so these may be removed from the records before processing.
Table 6.1 lists five transactions and five items: {Beer, Bread, Jelly, Milk,
PeanutButter}.
The support of an item (or set of items) is the percentage of transactions
in which that item (or items) occurs. Table 6.2 shows the support for all subsets
of items from total set. As seen, there is an exponential growth in the set of
items. In this case we could have 31(25 -1) sets of items from the original set of
five items (ignoring the empty set). This explosive growth in potential sets of
items is an issue that most association rule algorithms must contend with, as the
conventional approach to generating association rules is counting the
occurrences of sets of items in the transaction database.
TABLE 6.1: Sample Data to Illustrate Association Rules
Transaction Items
__________________________________________
91
Set Support Set Support
92
The selection of association rules is based on these two values as
described in the definition of the association rule problem in Definition 6.4.
Confidence measures the strength of the rule, whereas support measures how
often it should occur in the database. Typically, large confidence values and a
smaller support are used.
Consider the rule R1 = Bread => PeanutButter,
93
Definition 6.4. Given a set of items I = {I1, I2, …….,Im} and a database
of transaction D = {t1, t2, …, tn} where ti = {Ii1, Ii2, …….,Iik}and Iij Є I, an
association rule problem is to identify all association rules X => Y with a
minimum support and confidence. These values (s, α) are given as input to the
problem.
The efficiency of association rule algorithms usually is discussed with
respect to the number of scans of the database that are require and the
maximum number of itemsets that must be counted.
6.2 LARGE ITEMSETS
The most common approach to finding association rules is to break up
the problem into two parts:
1. Find large itemsets as defined in Definition 6.5.
2. Generate rules from frequent itemests.
An itemset is any subset of the set of all items, I.
Definition 6.5 A large (frequent) itemset is an itemset whose number of
occurrences is above a threshold, s. We use the notation L to indicate the
complete set of large itemsets and l to indicate a specific large itemset.
Once the large itemsets have been found, we knew that any interesting
association rule, X => Y, must have X Y in this set of frequent itemsets.
Note that the subset of any large itemset is also large. Because of the large
number of notations used in association rule algorithms, we summarize them in
Table 6.4. When a specific term has a subscript, this indicates the size of the set
being considered. For example, lk is a large itemset of size k. some algorithms
divide the set of transactions into partitions. In this case, we use p to indicate
the number of partitions and a superscript to indicate which partition. For
example, Di is the ith partition of D.
Finding large itemsets generally is quite easy but very costly. The naïve
approach would be to count all itemsets that appear in any transaction. Given a
set of items of size m, there are 2m subsets. Since we are not interested in the
empty set, the potential number of large itemsets is then 2m – 1. Because of the
explosive growth of this number, the challenge of solving the association rule
problem is often viewed as how to efficiently determine all large itemsets.
(When m = 5, there are potentially 31 itemsets. When m = 30, this becomes
1073741823.) Most association rule algorithms are based on smart ways to
reduce the number of itemsets to be counted. These potentially large itemsets
are called candidates, and the set of all counted (potentially large) itemsets is
the candidate itemset(c). One performance measure used for association rule
algorithms is the size of c. Another problem to be solved by association rule
algorithms is what data structure is to be used during the counting process. A
tree or hash trees are common.
When all large itemsets are found, generating the association rules is
straightforward. Algorithm 6.1, outlines this technique. In this algorithm we
use a function support, which returns the support for the input itemset.
94
TABLE 6.4: Association Rule Notation
___________________________________________
Term Description
___________________________________________
D Database of transactions
ti Transaction in D
s Support
α Confidence
X, Y Itemsets
X => Y Association rule
L Set of large itemsets
l Large itemset in L
C Set of candidate itemsets
p Number of partitions
_____________________________________________
ALGORITHM 6.1
Input:
D / / Database of transactions
I / / Items
L / / Large itemsets
s / / Support
α / / Confidence
Output:
R // Association Rules satisfying s and α
ARGen algorithm:
R = Ø;
for each 1 Є L do
for each x 1 such that x ≠ Ø do
If then
95
L = {{Beer}, {Bread}, {Milk}, {PeanutButter}, {Bread,
PeanutButter}}
To generate the association rule, consider the itemset {Bread,
PeanutButter}.
Support(Bread, PeanutButter) = 60
Support(Bread) = 80
and
Support(PeanutButter) = 60
96
φ
A B C D
AB AC AD BC BD CD
ABCD
a) Lattice of itemsets for {A, B, C, D}
φ
A B C C D
AB AC AD BC BD CD
b) Subsets of ACD
Figure 6.1
Downward
ABCD
Closure
97
used to generate Ci+1. An itemest is considered as a candidate only if all its
subsets also are large. To generate candidates of size i+1, joins are made of
large itemsets found in the previous pass. Table 6.5 shows the process using
the data found in Table6.1 with s = 30% and α = 50%. There are no candidates
of size three because there is only one large itemset of size two.
An algorithm called Apriori-Gen is used to generate the candidate
itemsets for each pass after the first. All singleton itemsets are used as
candidates in the first pass. Here the set of large itemsets of the previous pass,
Li-1, is joined with itself to determine the candidates. Individual itemsets must
have all but one item in common in order to be combined.
Consider the table 6.1 again.
When k= 1,
The 1-candidate itemsets are {Bread}, {Jelly},{Peanutbutter}, {Milk},
{Beer}. Assume s = 30% and α = 50%. Out of five transactions atleast two
transactions should contain the itemset to get the support of 30%.
ItemSet Supportcount
Bread 4
Jelly 1
Peanutbutter 3
Milk 2
Beer 2
1- Candidate itemset
ItemSet Supportcount
Bread 4
Peanutbutter 3
Milk 2
Beer 2
Itemset with the specified support s = 30%
Generate the 2-item candidate set from this table. The possible
combinations are {Bread, PeanutButter}, {Bread, Milk}, { Bread, Beer},
{PeanutButter, Milk}, {PeanutButter, Beer} and {Milk,Beer}.
ItemSet Supportcount
{Bread, PeanutButter} 3
{Bread, Milk} 1
98
{Bread, Beer} 1
{PeanutButter, Milk} 1
PeanutButter, Beer} 0
{Milk, Beer} 1
2-candidate itemsets
ItemSet Supportcount
{Bread, PeanutButter} 3
Itemset with the specified support s = 30%
Since there is only one itemset in the 2-candidate set, we cannot
generate candidate sets with three itemset. The largest itemset possible is
{Bread, Peanutbutter}.
TABLE 6.5: Using Apriori with Transactions in Table 6.1
Pass Candidates Large Itemsets
1 {Beer}, {Bread}, {Jelly}, {Beer}, {Bread},
{Milk}, {PeanutButter} {Milk},
{PeanutButter}
2 Beer, Bread}, {Beer, Milk} {Bread, PeanutButter}
{Beer, PeanutButter}, {Bread, Milk},
{Bread, PeanuButter},
{Milk, PeanutButter}
99
for each I Є Li-1 do
for each J ≠ I Є Li-1 do
if i-2 of the element in I and j are equal then
Ck = Ck {I J};
Given the large itmset property and Apriori-Gen, the Apriori algorithm
itself (see Algorithm 6.3) is rather stratightforward. In this algorithm we use ci
to be the count for item Ii Є I.
ALGORITHM 6.3
Input:
I / / Itemsets
D / / Database of transactions
s / / Support
Output:
L / / Large itemsets
Apriori algorithm:
k=0; / / k is used as the scan number.
L = Ø;
C1 = I; / / Initial candidates are set to be the items.
repeat
k = k+1
Lk = Ø;
for each Ii Є Ck do
Ci = 0; / / Initial counts for each itemset are 0.
for each tj Є D do
for each Ii Є Ck do
if Ii Є tj then
Ci = Ci + 1;
for each Ii Є Ck do
if ci ≥ (s x | D |) do
Lk = Lk Ii;
L = L Lk;
Ck+1 = Apriori-Gen(Lk)
until Ck+1 = Ø
The Apriori algorithm assumes that the database is memory-
resident. The maximum number of database scans is one more than the
cardinality of the largest large itemset. Large number of database scans is a
weakness of the Apriori approach.
100
6.3.2 PARTITIONING
Various approaches to generating large itemsets have been proposed
based on a partitioning of the set of transaction. In this case, D is divided into p
partitions D1, D2, . . . , Dp. Partitioning may improve the performance of
finding large itemsets in several ways:
By taking advantage of the large itmeset property,, we know that a
large itemset must be large in at least one of the partitions. This idea can help to
design algorithms more efficiently than those based on looking at the entire
database.
Partitioning algorithms may be able to adapt better to limited main
memory. Each partition can be created such that it fits into main memory. In
addition, it would be expected that the number of itemsets to be counted per
partition would be smaller than those needed for the entire database.
By using partitioning, parallel and/or distributed algorithms can be
easily created, where each partition could be handled by a separate machine.
Incremental generation of association rules may be easier to perform
by trading the current state of the database as one partition and trading the new
entries as a second partition.
The basic partition algorithm reduces the number of database scans to
two and divides the database into partitions such that each can be placed into
main memory. When it scans the database, it brings that partition of the
database into main memory and counts the item in that partition alone. During
the first database scan, the algorithm finds all large itemsets in each partition.
Although any algorithm could be used for this purpose, the original proposal
assumes that some level-wise approach, such a Apriori, is used. Here Li
represents the large itemsets from partition Di. During the second scan, only
those itemsets that are large in at least one partition are used as candidates and
counted to determine if they are large across the entire database. Algorithm 6.5
shows this basic partition algorithm.
ALGORITHM 6.4
Input:
I / / Itemsets
D {D1, D2, …., Dp} / / Database of transactions divided into
partitions
s // Support
Output:
L // Large itemsets
Partition algorithm:
C = Ø;
for i = 1 to p do // Find large itemset in each partition
Li = Apriori(I, Di, S);
101
C = C Li;
L = Ø;
for each Ii Є C do
Ci = 0; //Initial counts for each itemset are 0.
for each tj Є D do // count candidates during second scan,
for each Ii Є C do
if Ii Є tj then
Ci = Ci + 1;
for each Ii Є C do
if ci ≥ (s x | D |) then
L = L Ii ;
Figure 6.2 illustrates the use of the partition algorithm using the market
basket data. Here the database is partitioned into two parts, the first containing
two transactions and the second with three transactions. Using a support of
10%, the resulting large itemsets L1 and L2 are shown. If the items are
uniformly distributed across the partitions, then a large fraction of the itemsets
will be large. However, if the data are not uniform, there may be a large
percentage of false candidates.
L1 = {{Bread}{Jelly},{PeanutButter},{Bread, Jelly},
t1 Bread, Jelly, Peanutbutter {Bread, PeanutButter}, {Jelly, PeanutButter}}{Bread, Jelly,
D1 PeanutButter} }
t2 Bread, Peanutbutter
102
Food
Whol 2% Skim
Whea e
t Whit Rye
e
103
A customer buys wine for between $ 30 and $50 a bottle => she also
buys caviar
This differs from a traditional association rule such as:
A customer buys wine=> she also buys caviar.
The cost quantity has been divided into an interval. In these cases, the
items are not simple literals. For example, instead of having the items {Bread,
Jelly}, we might have the items {(Bread: [0…1]), (Bread:1…2]), (Bread:
2….)), (Jelly: [0…1.5]), (Jelly: (1.5…3]), (Jelly: (3….))}.
The basic approach to finding quantitative association rules is found in
algorithm 6.5. Here we show the Apriori algorithm being used to generate the
large itemsets, but any such algorithm could be used.
ALGORITHM 6.5
Input:
I / / Itemsets
P , P , …, pP;
1 2
/ / Processors
D {D , D , …., Dp}
1 2
/ / Database divided into partitions
S / / Support
Output:
L / / Large itemsets
Quantitative association rule algorithm:
for each Ii Є I do / / Partition items:
if Ij is to be partitioned, then
determine number of partitions;
map attribute values into new partitions creating new
items;
replace Ij in I with the new items Ij1, …, Ijm;
Apriori (I, D, S);
One item is divided into many items, the minimum support and
confidence used for quantitative rules should be lowered. The minimum
support problem obviously is worse with a large number of internals. Thus, an
alternative solution would be to combine adjacent intervals when calculating
support. Similarly, when there are a small number of intervals, the confidence
threshold may need to be lowered. For example, look at X => Y. Suppose there
are only two intervals for X. Then the count for those transactions containing X
will be quite high when compared to those containing Y (if this is a more
typical item with many intervals).
6.4.4 Using Multiple Minimum Supports
When looking at large databases with many types of data, using one
minimum support value can be a problem. Different items behave differently.
104
It certainly is easier to obtain a given support threshold with an attribute that
has only two values than it is with an attribute that has hundreds of values. It
might be more meaningful to find a rule of the form
SkimMilk => WheatBread
with a support of 3% than it is to find
Milk => Bread
with a support of 6%. Thus, having only one support value for all
association rules may not work well. Some useful rules could be missed. This
is particularly of interest when looking at generalized association rules, but it
may arise in other situations as well. Think of generating association rules from
a non-market basket database. As was seen with quantitative rules, we may
partition attribute values into ranges. Partitions that have a small number of
values obviously will produce lower supports than those with a large number of
values. If a larger support is used, we might miss out on generating meaningful
association rules.
This problem is called the rare item problem. If the minimum support is
too high, then rules involving items that rarely occur will not be generated. If it
is set too low, then too many rules may be generated, many of which
(particularly for the frequently occurring items) are not important. Different
approaches have been proposed to handle this. One approach is to partition the
data used on support and generates association rules for each partition
separately. Alternatively, we could group rare items together and generate
association rules for these groupings. A more recent approach to handling this
problem is to combine clustering and association rules. First we cluster the data
together based on some clustering criteria, and then we generate rules for each
cluster separately. This is a generalization of the partitioning of the data
solution.
One approach, MISapriori, allows a different support threshold to be
indicated for each item. Here MIS stands for minimum item support. The
minimum support for a rule is the minimum of all the minimum supports for
each item in the rule. An interesting problem occurs when multiple minimum
supports are used. The minimum support requirement for an itemset may be
met even though it is not met for some of its subsets. This seems to violate the
large itemset property.
6.4.5 Correlation Rules
A correlation rule is defined as asset of item sets that are correlated.
The motivation for developing these correlation rules is that negative
correlations may be useful.
Example 6.3, illustrates this concept. In this example, even though the
probability of purchasing two items together seems high, it is much higher if
each item is purchased without the other item. Correlation satisfies upward
closure in the itemset lattice. Thus, if a set is correlated, so is every superset of
it.
105
Example 6.3:
Consider two items A and B, where A=>B has a support of 15% and a
confidence of 60%. Probability of purchasing the product B is 0.7. The
correlation can be expressed as
= 0.857
Since this value is less than 1, it indicates a negative correlation
between A and B.
6.5MEASURING THE QUALITY OF RULES
Support and confidence are the normal methods used to measure the
quality of an association rule:
s(A=> B) = P(A, B)
and
α(A => B) = P (B | A)
However, there are some problems associated with these metrics. For
example confidence totally ignores P(B). A rule may have a high support and
confidence but maybe an obvious rule. For example, if someone purchases
potato chips, there may be a high likelihood that he or she would also buy a
cola. This rule is not really of interest because it is not surprising. Various
concepts such as surprise and interest have been used to evaluate the quality or
usefulness of rules. With correlation rules, we saw that correlation may be used
to measure the relationship between items in a rule. This may also be expressed
as the lift or interest
This measure takes into account both P (A) and P (B). A problem with
this measure is that it is symmetric. Thus, there is no difference between the
value for interest (A=>B) and the value for interest (B=>A).
As with lift, conviction takes into account both P(A) and P(B). From
logic we know that implication A B (A^B). A measure of the
independence of the negation of implication, then is . The formula for
conviction is
106
Chisquare test for independence is proposed to measure the significance
of rules. The chi squared significance test takes in to account both the presence
and the absence of items in sets. Here it is used to measure how much an
itemset (potential correlation rule) count differs from the expected. The chi
squared statistic can be calculated in the following manner. Suppose the set of
items is I = {I1, I2, …, Im}. Because we are interested in both the occurrence
and the nonoccurrence of an item, a transaction tj can be viewed as
tj Є {I1, I1} x {I2, I2} x …. x {Im, Im}
Given any possible item set X, it also is viewed as a subset of the
Cartesian product. The chi squared static is then calculated for X as,
Here O(X) is the count of the number of transactions that contain the
items in X. For one item Ii, the expected value is E[Ii] is calculated assuming
independence and is thus defined as
______________________________________________
A 15 10 25
55 20 75
Total 70 30 100
______________________________________________
= 22.5
= 1.587
If all values were independent, then the chi squared statistic should be
0. From the statistical table it is found that at 95% confidence, the chisquare
value is 3.84. Calculated value is compared with the tabulated value, if it is less
than that accept the hypothesis else reject the hypothesis.
107
CHAPTER 7
WEB MINING
Determining the size of the World Wide Web is extremely difficult. In
1999 it was estimated to contain over 350 million pages with growth at the rate
of about 1 million pages a day. The Web can be viewed as the largest database
available and presents a challenging task for effective design and access. There
is no real structure or schema to the Web. Web mining is mining of data related
to the World Wide Web. This may be the data actually present in Web pages or
data related to Web activity.
Content of actual Web pages.
Intrapage structure includes the HTML or XML code for the page.
Interpage structure is the actual linkage structure between Web
pages.
Usage data that describe how Web pages are accessed by visitors.
User profiles include demographic and registration information
obtained about users. This could also include information found in cookies.
Web mining can be divided into three different types, which are Web
usage mining, Web content mining and Web structure mining.
Web content mining is the process to discover useful information from
text, image, audio or video data in the web. Web content mining sometimes is
called web text mining. Web content mining examines the content of Web
pages as well as results o f Web searching. The content includes text as well as
graphics data. Web content mining is further divided into web page content
mining and search results mining.
Web structure mining is the process of using graph theory to analyse
the node and connection structure of a web site. According to the type of web
structural data, web structure mining can be divided into two kinds. The first
kind of web structure mining is extracting patterns from hyperlinks in the web.
A hyperlink is a structural component that connects the web page to a different
location. The other kind of the web structure mining is mining the document
structure. With web structure mining, information is obtained from the actual
organization of pages on the web. For example, clustering may be applied to
web pages to identify similar pages. The intrapage structure includes links
within the page as well as the code (HTML, XML) for the page.
Web usage mining is the process of finding out what users are looking
for on internet. Some users might be looking at only textual data whereas some
other might want to get multimedia data. Web usage mining also helps finding
the search pattern for a particular group of people belonging to a particular
region. Web usage mining looks at logs of web access. General access pattern
tracking is a type of usage mining that looks at a history of web pages visited.
This usage may be general or may be targeted to specific use or users. For
108
example, patterns can be clustered based on their similarity. This in turn can be
used to cluster users into groups based on similar access behavior.
There are many applications for web mining. One application is
targeted advertising. Based on the user access pattern the advertising
frames/pages can be reorganized. Targeting attempts to send advertisements to
people who have not been to a Web site to entice them to visit it. Thus targeted
ad is found on a different web site. On the web, targeting can be used to
display advertising at web sites visited by persons that fit into business target
demographic area.
7.1 WEB CONTNET MINING
Web content mining can be thought of as extending the work performed
by basic search engines. Most of the search engines are keyword-based. It can
improve on traditional search entices through such techniques as concept
hierarchies and synonyms, user profiles, and analyzing the links between
pages. Traditional search engines must have crawlers to search the web and
gather information, indexing techniques to store the information, and query
processing support to provide fast and accurate information to users. Data
mining techniques can be used to help search engines provide the efficiency,
effectiveness, and scalability needed.
Web content mining can be divided as agent based approach and
Data base based approach. Agent-based approaches have software systems
(Agents) that perform the content mining. In the simplest case, search engines
belong to this class, as do intelligent search agents, information filtering, and
personalized web agents. Intelligent search agents go beyond the simple search
engines and use other techniques besides keyword searching to accomplish a
search. For example, they may use user profiles or knowledge concerning
specified domains. Information filtering utilizes IR techniques, knowledge of
the link structures, and other approaches to retrieve and categorize documents.
Personalized web agents use information about user preferences to direct their
search. The database approaches view the Web data as belonging to a database.
There are many query languages that target the web.
Basic content mining is a type of text mining. Text Mining is the
discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key
element is the linking together of the extracted information together to form
new facts or new hypotheses to be explored further by more conventional
means of experimentation.
Text mining functions can be viewed in a hierarchy with the simplest
functions at the top and the more complex functions at the bottom. Much
research is currently under way that investigates the use of natural language
processing techniques in text mining to uncover hidden semantics, such as
question and answer systems. More traditional mining operations involve
keyword searching, similarity measures, clustering, and classification.
109
Keyw
ord
Term association
Similarity search
Classification Clustering
110
Web usage patterns can be used together business intelligence to
improve sales and advertisement.
Gathering statistics concerning how users actually access web pages
may or may not be viewed as part of mining.
Web usage mining actually consists of three separate types of activities
Preprocessing activities center on reformatting the web log data
before processing.
Pattern discovery activities form the major portion of the mining
activities because these activities look to find hidden patterns within the log
data.
Pattern analysis is the process of looking at and interpreting the
results of the discovery activities.
There are many issues associated with using the Web log for mining
purposes:
Identification of the exact user is not possible from the log alone.
With a web client cache, the exact sequence of pages a user actually
visits is difficult to uncover from the server site. Pages that are referenced
maybe found in the cache.
There are many security, privacy, and legal issues yet to be solved.
For example, is the set of pages a person visits actually private information?
Should a web browser actually divulge information to other companies about
the habits of its users? After all, this information could be valuable to potential
advertisers.
7.1.1 Pattern Discovery
The most common data mining technique used on click stream data is
that of uncovering traversal patterns. A traversal pattern is a set of pages visited
by a user in a session. Other types of patterns may be uncovered by web usage
mining. For example, association rules can look at pages accessed together in
one session in dependent of ordering. Similar traversal patterns may be
clustered together to provide a clustering of the users. This is different from
clustering of pages, which tends to identify similar pages, not users.
Several different types of traversal patterns have been examined. These
patterns differ in how the patterns are defined. The differences between the
different types of patterns can be described by the following features:
Duplicate page reference (backward traversals and
refreshes/reloads) may or may not be allowed.
A pattern may be composed only of contiguous page references, or
alternatively of any pages referenced in the same session.
The pattern of references may or may not be maximal in the
session. A frequent pattern is maximal if it has no sub pattern that is also
frequent.
111
Patterns found using different combinations of these three properties
may be used to discover different features and thus may be used for different
purposes. Knowledge of contiguous page references frequently made can be
useful to predict future references
Association Rules: Association rules can be used to find what pages
are accessed together. Here we are really finding large item sets. A page is
regarded as an item, and a session is regarded as a transaction with both
duplicates and ordering ignored. The support is defined to be the number of
occurrences of the item set divided by the number of transactions of sessions.
Sequential Patterns: Sequential patterns have also been applied to
web access logs. A sequential pattern is defined as an ordered set of pages that
satisfies a given support and is maximal (i.e., it has no subsequence that is also
frequent). Support is defined as the percentage of session with the pattern, but
rather the percentage of customers who have the pattern. Since a user may have
many sessions, it is possible that a sequential pattern could span sessions. It
also need not be contiguously accessed pages. A k-sequence is a sequence of
length k (i.e., it has k pages in it).
Frequent Episodes: Episodes, which originally were proposed for
telecommunication alarm analysis, can also be applied to web logs. All pages
(corresponding to events) are ordered by their access time, and the users
usually need not be identified (i.e., no sessions). By definition, an episode is a
partially ordered set of pages. In addition, the individual page accesses must
occur within a particular time frame. A serial episode is an episode in which
these vents are totally ordered. Note that they need not be contiguous. A
parallel episode is a set of events where there need not be any particular
ordering.
Maximal Frequent Forward Sequences. One approach to mining log
traversal patterns is to remove any backward traversals. Each raw session is
transformed into forward reference (i.e., removes the backward traversals and
reloads/refreshes), from which the traversal patterns are then mined using
improved level-wise algorithms.
112
CHAPTER - 8
SPATIAL MINING
8.1 INTRODUCTION
Spatial data are data that have a spatial or location component. Spatial
data can be viewed as data about objects that themselves are located in a
physical space. This may be implemented with a specific location attribute(s)
such as address or latitude/longitude or may be more implicitly included such
as by a partitioning of the database based on location. In addition, spatial data
may be accessed using queries containing spatial operators such as near, north,
south, adjacent, and contained in. Spatial data are stored in spatial databases
that contain the spatial data and no spatial data about objects. Because of the
inherent distance information associated with spatial data, spatial database are
often stored using special data structures or indices built using distance or
topological information.
Spatial data are required for many current information technology
systems. Geographic information systems (GIS) are used to store information
related to geographic locations on the surface of the Earth. This includes
applications related to weather, community infrastructure needs, disaster
management, and hazardous waste. Data mining activities include prediction of
environmental catastrophes. Biomedical applications, including medical
imaging and illness diagnosis, also require spatial systems.
Spatial mining, often called spatial data mining or knowledge discovery
in spatial databases, is data mining as applied to spatial databases or spatial
data. Some of the applications for spatial data mining are in the areas of GIS
systems, geology, environmental science, resource management, agriculture,
medicine, and robotics.
8.1.1 SPATIAL DATA MINING PRIMITIVES
Operations needed to support spatial data mining involve those required
for spatial databases. There are several topological relationships that can exist
between two spatial objects. These relationships are based on the ways in
which two objects are placed in geographic domain:
Disjoint: A is disjoint from B if there are no points in A that are
contained in B.
Overlaps or intersects: A overlaps with B if there is at least
one point in A that is also in B.
Equals: A equals B if all points in the two objects are in
common.
Covered by or inside or contained in: A is contained in B if
all points in A are in B. There may be points B that are not in A.
Covers or contains: A contains B if and only if B is contained
in A.
113
The Euclidean and Manhattan measures are often used to measure the
distance between two points. The distance between two spatial objects can be
defined as extensions to these two traditional definitions:
Minimum:
dis(A, B) = min (dis ((xa, ya), (xb, yb)))
(xa, ya) Є A, (xb, yb) Є B
Maximum:
dis(A, B) = max (dis ((xa, ya), (xb, yb)))
(xa, ya) Є A, (xb, yb) Є B
Average:
dis(A, B) = average( dis ((xa, ya), (xb, yb)))
(xa, ya) Є A, (xb, yb) Є B
Center:
dis(A, B) = dis ((xca, yca), (xcb, ycb))
Where (xca, yca) is a center point for object A and (xcb, ycb) for B.
Note the similarity to distance measures used in clustering. In fact, you
can think of the spatial object as a cluster of the points within it. The center
points used for the last distance formula can be identified by finding the
geometric center of the object. For example, if an MBR is used, the distance
between objects could be found using the Euclidean distance between the
centers of the MBR for the two objects.
Spatial objects may be retrieved based on selection, aggregation, or
join-type operations. A selection may be performed based on the spatial or no
spatial attributes. Retrieving based on spatial attributes could be performed
using one of the spatial operators. A spatial join retrieves based on the
relationship between two spatial objects.
8.2 TEXT MINING
Text mining, sometimes alternately referred to as text data mining,
refers to the process of deriving high-quality information from text. High-
quality information is typically derived through the divining of patterns and
trends through means such as statistical pattern learning.
'High quality' in text mining usually refers to some combination
of relevance, novelty, and interestingness. Typical text mining tasks
include text categorization, text clustering, concept/entity extraction,
production of granular taxonomies, sentiment analysis, document
summarization, and entity relation modeling.
Text Mining is the discovery by computer of new, previously unknown
information, by automatically extracting information from different written
resources. A key element is the linking together of the extracted information
114
together to form new facts or new hypotheses to be explored further by more
conventional means of experimentation.
Text mining is different from what we're familiar with in web search. In
search, the user is typically looking for something that is already known and
has been written by someone else.
The difference between regular data mining and text mining is that in
text mining the patterns are extracted from natural language text rather than
from structured databases of facts.
text mining, that discovers new pieces of knowledge, from approaches
that find overall trends in textual data.
115
NOTES
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
116
UNIT - V
CHAPTER – 9
DATA WAREHOUSE
9.1 INTRODUCTION
The advent of Information Technology has affected all walks of life. In
the context of business processes, this field has evolved from the good old term
EDP to MIS, and then to DSS. When we talk of the use of information
technology in a business environment, we cannot ignore the presence of a
database system at its core. Database technology, meanwhile, has also grown
from a simple file system, to a data navigation system, then to a query-based
retrieval system, and to an interactive information extraction from distributed
sources to support decision-making tasks. Normally, the operational databases
record transaction-like information, and do not explicitly record the
summarized information as required by the top-level decision-makers at
different points of time. Moreover, corporate managers may like to work in an
environment that need not access the operational database. There is another
aspect to such enterprise-wide computing. Different departments of the
enterprise design their operational databases taking into their own local
requirements, and the availability of their resources. They also typically collect
diverse kinds of data and maintain large databases from multiple,
heterogeneous, autonomous, and distributed information sources. The
integration of such data so as to provide easy and efficient access is highly
desirable, and yet challenging. The state-of-art database technology provides us
with two different types of tools to extract information from a distributed,
heterogeneous environment.
The traditional database approach to heterogeneous database integration
is to build wrappers and integrators (or mediators) on top of multiple
heterogeneous databases. These tools facilitate the following tasks. As and
when a query (a requirement of the top-level decision-maker) is posed, a
metadata dictionary is used to translate the query into queries appropriate for
the individual heterogeneous sits involved. These queries are then mapped and
sent to local query processors. The results returned from the different sites are
integrated into a global answer set. We may also call this a query-driven
approach (or some there it a lay approach, or an on-demand approach). This
query-driven approach requires complex information filtering and integration
processes, and competes for resources with processing at local sources. It is
inefficient and potentially expensive for frequent queries, especially for queries
requiring aggregation.
On the other hand, there exists an interesting alternative to this
traditional approach. We term it the update-driven approach, eager approach or
in-advance approach. In update-driven approach information from multiple,
heterogeneous sources is integrated in advance and stored separately for direct
querying and analysis. Though it does not contain the most current
117
information, it brings high performance. Furthermore, query processing in this
model does not interfere with the processing at local sources.
Data warehousing is essentially the process that facilitates the update-
driven approach. It is the new technology that provides us with the tools to
store the summarized information from multiple, heterogeneous databases in a
single repository. Data warehousing is the process of integrating enterprise-
wide corporate data into a single repository, from which end-users can easily
run queries, make reports and perform analysis. A data warehouse is a
decision-support environment that leverages data stored in different sources,
organizing it and delivering it to decision makers across the enterprise,
regardless of their platform or technical skill level. In a nutshell data
warehousing is the data management and analysis technology adopting an
update-driven principle.
The data warehouse is a new approach to enterprise-wide computing at
the strategic or architectural level. A data warehouse can provide a central
repository for large amounts of diverse and valuable information. By filing the
data into a central point of storage, the data warehouse provides an integrated
representation of the multiple sources of information dispatch across the
enterprise. IT ensures the consistency of management rules and conventions
applied to the data. IT ensures the consistency of management rules and
conventions applied to the data. IT also provides the appropriate tools to
extract specific data, convert it into business information, and monitor for
changes and, hence, it is possible to use this information to make insightful
decisions. A data warehouse is a competitive tool that gives every end user the
ability to access quality enterprise-wide data.
9.2 WHAT IS A DATA WAREHOUSE?
A data warehouse supports business analysis and decision making by
creating an enterprise-wide intergraded database of summarized, historical
information. It integrates data from multiple, incompatible sources. By
transforming data into meaningful information, a data warehouse allows the
business manager to perform more substantive, accurate and consistent
analysis. Data warehousing improves the productivity of corporate decision
makers through consolidation, conversion, transformation and integration of
operational data and provides a consistent view of the enterprise. A data
warehouse is supposed to be a place where data gets stored so that applications
can access and share it easily. But a database does that already. So then, what
makes a warehouse so different?. The main difference is that usual databases
hold operational-type data and that many of the decision support type
applications put too much strain on the databases intervening into the day-to-
day operation. A data warehouse is of course a database, but it contains
summarized information. In general, our database is not a data warehouse
unless we also
118
collect and summarize information from disparate sources and
use it as the place where this disparity can be reconciled, and
place the data into a warehouse because we intend to allow
several different applications to make use of the same
information.
A data warehouse refers to a database that is maintained separately
from an organization’s operational databases. An operational database is
designed and funded for known tasks and workloads, such as indexing and
hashing using primary keys, searching for particular records and optimized
“canned” queries. On the other hand, data warehouse queries are often very
complex. They involve the computation of large groups of data at the
summarized level, and may require the use of special data organizations; access
and implementations methods based on multidimensional views.
DEFINITION 9.1
A data warehouse is a subject-oriented, integrated, time-varying, non-
volatile collection of data in support of the management’s decision-making
process.
SUBJECT-ORIENTED
A data warehouse is organized around major subjects such has
customer, products, sales, etc. Data are organized according to subject instead
of application. For example, an insurance company using a data warehouse
would organize their data by customer, premium, and claim instead of by
different products (auto, life, etc.). The data organized by subject obtains only
the information necessary for the decision support processing.
NON-VOLATILE
A data warehouse is always a physically separate store of data, which is
transformed from the application data found in the appropriate environment.
Due to this separation, data warehouses do not require transaction processing,
recovery, concurrency control, etc. The data are not updated or changed in any
way once they enter the data warehouse, but are only loaded, refreshed and
accessed for queries.
TIME-VARYING
Data are stored in a data warehouse to provide a historical perspective.
Every key structure in the data warehouse contains, implicitly or explicitly, an
element of time. The data warehouse contains a place for sorting data that are 5
to 10 years old, or older, to be used for comparisons, trends and forecasting.
INEGRATED
A data warehouse is usually constructed by integrating multiple,
heterogeneous sources such as relational databases, flat files, and OLTP files.
When data resides in many separate applications in the operational
environment, the encodings of data is often inconsistent. When data are moved
from operational environment into the data warehouse, they assume a
119
consistent coding convention. Data cleaning and data-integration techniques
are applied to maintain consistency in naming convention, measures of
variables, encoding structure, and physical attributes.
9.3 MULTIDIMENSIONAL DATA MODEL
Consider the data set given in the table 9.1, it shows the Employment in
California by sex, by year and by profession. This form of representing
multidimensional tables is very popular in Statistical Data Analyses, because in
the early days it was only possible to represent information on paper and thus
the 2-D restriction. We observe that the rows and the columns represent more
than one dimension, if the data set contains more than 2 dimensions. Selecting
an arbitrary order of the dimensions for the rows and the columns does this.
The rows in Table 9.1 represent the two dimensions; sex and year, which are
ordered as sex first then year. The columns, however, do not really represent 2
distinct dimensions but they do represent some sort of taxonomy of a
dimension. The professional class and profession represent a hierarchical
relationship between the instances of professional calls and the instances of the
profession. We also observe that there is summary information, which is the
main theme of the table. In this case the summary function is sum.
Professional Class
120
dimensions such as sex, profession and year. Each dimension can be divided
into subdimensions.
Female
Male
(1) 1977
(2) 2411
Profession
(3) 5343
1541
(4)
2129
(5)
1237
(6)
Year 91 92 93 94 95
121
9.5 DIMENSION MODELLING
The notion of a dimension provides a lot of semantic information,
especially about the hierarchical relationship between its elements. It is
important to that dimension modeling is a special technique for structuring data
around business concepts. Unlike ER modeling, which describes entities and
relationships, dimension modeling structures the numeric measures and the
dimensions. The dimension schema can represent the details of the dimensional
modeling. The following figures show the dimension modeling for our
example.
year
sex
Figure 9.2
profession Dimension
Modelling
122
Thus, we observe that a multidimensional data model has the
following conceptual basic components:
1. Summary measure: e.g., employment, sales, etc.
2. Summary function: e.g., sum
3. Dimension: e.g., sex, year profession, state
4. Dimension hierarchy: e.g., professional classprofession.
9.7 SUMMARY MEASURES
The summary measure is essentially the main theme of the analysis in a
multidimensional model. A measure value is computed for a given cell by
aggregating the data corresponding to the respective dimension-value sets
defining the cell. The measures can be categorized into 3 groups based on the
kind of aggregate function used.
Distributive: A numeric measure is distributive if it can be computed
in a distributed manner as follows. Suppose the data is partitioned into a few
subsets. The measure can be simply the aggregation of the measures of all
portions. For example, count, sum, min and max are distributive measures.
Algebraic: An aggregate function is algebraic if it can be computed
by an algebraic function with some set of arguments, each of which may be
obtained by a distributive measure. For example, average is obtained by
sum/count.
Other examples of distributive functions are standard-deviations and
center-of-mass.
Holistic: An aggregate function is holistic if there is no constant bound
on the storage size needed to describe a subaggregate. That is, there does not
exist an algebraic function that can be used to compute this function. Examples
of such functions are median, mode, most-frequent, etc.
EXAMPLE 9.2
A store called, Deccan Electronics may create a sales data warehouse in
order to keep records of the store’s sales with respect to the time, product and
location. Thus, the dimensions are time, product and location. There
dimensions allow the store to keep track of things like monthly sales of items,
and the locations at which the items were sold. A dimension table for product
contains the attributes item name, brand, and type. The attributes shop,
manager, city, region, state, and country describe the dimension location.
These attributes are related by a total order forming a hierarchy such as
shop<city<state<country. This hierarchy is shown in Figure 9.3. An example of
a partial order for the time dimension is the attributes week, month, quarter and
year. The sales data warehouse includes the sales amount in rupees and the
total number of units sold. Note that we can have more than one numeric
measure. Figure 2.4 shows the multidimensional model for such situations. The
dimension hierarchies considered for the data cube are time:
(month<quarter<year); location: (city<province<country); and product.
123
Product Time
Total Total
Branch
Prod. country Year
Group
Quarter
Family
Brand Month Week
Location
Article Total
Region State
City Manager
Shop
mumbai
pune
vizag
cochin
chennai
124
mumbai
pune
vizag
cochin
pune
Q1
Q2
Andhra
Pradesh
Kerala
Tamilnadu
Q2
Q3
Q4
125
9.8 OLAP OPERATIONS
Once the model for data warehouse is built then the next step is to
explore the different analytical tools with which to perform the complex
analysis of data. These data analysis tools are called OLAP (On-Line
Analytical Processing). OLAP is mainly used to access the live data online and
to analyze it. OLAP tools are designed in order to accomplish such analyses on
very large databases. Hence, OLAP provides a user-friendly environment for
interactive data analysis. Whilst data warehousing is primarily targeted at
providing consolidated data for further analysis, OLAP provides the means to
analyze those data in an application-oriented manner.
In the multidimensional model, the data are organized into multiple
dimensions and each dimension contains multiple levels of abstraction. Such
an organization provides the users with the flexibility to view data from
different perspective. There exist a number of OLAP operations on data cubes
which allow interactive querying and analysis of the data. According to the
underlying multidimensional view with classification hierarchies defined upon
the dimensions, OLAP systems provide specialized data analysis methods.
Slicing: This operation and the following one, Dicing, are used for
reducing the data cube by one or more dimensions. The slice operation
performs a selection on one dimension of the given cube, resulting in a sub
cube. Figure 2.5 shows a slice operation where the sales data are selected from
the central cube for the dimension time, using the criteria time=’Q2’.
Slicetime=’Q2’C[quarter, city, product] = C[city, product]
Dicing : This operation is for selection a smaller data cube and
analyzing it from different perspectives. The dice operation defines a subcube
by performing a selection on two or more dimensions. Figure 2.6 shows a dice
operation on the central cube based on the following selection criteria, which
involves three dimensions: (location=”Mumbai” or Pune”) and (time = “Q1” or
“Q2”), where ‘quarter’ and ‘city’ have truncated domains such as {Q1, Q2}
and {Mumbai, Pune} respectively.
126
roll-uptimeC[quarter, city, product] = C[quarter, province, product]
Here, each data cell of the resulting cuboid is the aggregation of the
data cells that are merged due to the roll-up operation. In other words, the
measures stored in the data cells, C[Q1, Mumbai, computer] and C[Q1, pune,
computer], are added to determine the measure to be stored at C[Q1,
Maharashtra, computer] (Maharashtra is a province or state in India and the
cities Mumbai and Pune are in this province).
When roll-up is performed by dimension reduction, one or more
dimensions are removed from the given cube. For example, consider an
employment data cube containing only the two dimensions, profession and
time. Roll-up may be performed by removing, say the sex dimension, resulting
in an aggregation of the total employment by profession and by year.
Drill-Down: This operation is concerned with switching from an aggregated
to a more detailed level within the same classification hierarchy. Drill-down is
the reverse of roll-up. It navigates from less detailed data to more detailed data.
Drill-down can be realized by either stepping-down a dimension hierarchy or
introducing additional dimensions.
Figure 9.8 shows the result of a drill-down operation performed on the
central cube by stepping down a dimension hierarchy for a time period defined
as day <month<quarter<year. Drill-down occurs by descending in the time
hierarchy form the level of a quarter to the level of a month. The resulting data
cube details the total sales per month rather than summarized by quarter. Drill-
down can also be performed by adding new dimensions to a cube. For
example, a drill-down on the give cube can occur by introducing an additional
dimension, such as customer-type.
Drill-within: It is switch from one classification to a different one
within the same dimension;
Drill-across: It means switching from a classification in one
dimension to a different classification in a different dimension.
Pivot (rotate): Pivot (also called “rotate”) is a visualization operation
which rotates the data axes in order to provide an alternative presentation of the
same data. Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes.
Other OLAP operations may include ranking the top-N or
bottom-N items in lists, as well as computing moving averages, growth rather,
interests, and internal rates return, depreciation, currency conversions, and
statistical functions. The results of these operations are typically visualized in a
cross-tabular form, i.e., mapped to a grid-oriented, two-dimensional layout
structure.
127
mumbai
pune
vizag
cochiin
chennai
jan
feb
mar
Apr
May
jun
jul
aug
sep
oct
nov
dec
128
reduces the number of physical joins, and requires low maintenance and very
simple metadata.
EXAMPLE 9.3
Let us consider the “Employment” data warehouse. We have three
Dimension Tables and one Fact Table. The star Schema for this example is
shown in Figure 9.9. Fact table
Dimension Sex-key
(sex)table Time-key
Profession-key Dimension
(time)table
Sex-key
sex
Time-key
Year
Profession Quarter
Month
Day
Dimension
(profession)table
Profession-key
Profession class
Title
Level
Discipline
129
EXAMPLE 2.4
An example of a snowflake schema for a accompany Deccan Electronic
is given in Figure 2.10. It can be seen that the dimension table for the items is
normalized resulting in two tables namely, the item is normalized resulting in
two tables namely, the item and supplier tables.
FACT CONSTELLATION
Most often, there may be a need to have more than one Fact Table and
these are called Fact constellations. A fact Constellation is a kind of schema
where we have more than one fact table sharing among them some Dimension
Tables. It is also called Galaxy Schema. For example, let us assume that decant
Electronics would like to have another fact table for supply and delivery. It
may contain five dimensions, or keys: time, item, delivery-agent, origin,
destination along with the numeric measure as the number of units supplied
and the cost of delivery. It can be seen that both fact Tables can share the same
item-Dimension table as well as time-Dimension table.
Fact table
Time-key
Item-key Dimension (time)
Location-key table
Dimension (item) table
Item-key Time-key
Item-name Year
Brand Rupees-sold Quarter
Type Units-sold Month
Supplier-key Day
130
FACT1
FACT2
Dimension1-key
Dimension2-key
Dimension2-key
Dimension3-key
Dimension3-key
Dimension4-key
Summary
Summary
Dimension1
Schema Dimension3
Schema
Dimension4
Dimension2
Schema
Schema
Figure 9.11 Fact Constellation: Fact1 and Fact2 share the same
dimension tables, Dim2 and Dim3
9.10 DATA WAREHOUSING ARCHITECTURE
Figure 9.12 shows a typical data warehousing architecture. Very
often, this structure can be visualized as 3-tier architecture. Tier 1 is essentially
the warehouse server, Tier 2 is the OLAP engine for analytical processing, and
Tier 3 is a client containing reporting tools, visualization tools, data mining
tools, querying tools, etc. There is also the backend process which is concerned
with extracting data from multiple operational databases and from external
sources; with cleaning, transforming and integrating this data for loading into
the data warehouse server; and of course with periodically refreshing the
warehouse. Tier 1 contains the main data warehouse. It can follow one of three
models or some combination of these. It can be single enterprise warehouse, or
may contain several departmental marts. The third model is to have a virtual
warehouse. Tier 2 follows three different ways of designing the OLAP engine,
namely ROLAP, MOLAP and extended SQL OLAP.
131
META DATA
Cleaning, DW
Integration etc
server
Data marts
Mining
Background process DW server OLAP engine
Analyses
132
data from inventory and payroll. There are several ways to partition the data,
such as by business function or geographic region.
Historically, the implementation of a data warehouse has been limited
to the resource constraints and priorities of the MIS organization. The task of
implementing a data warehouse can be very big effort, taking a significant
amount of time. And, depending on the implementing alternatives chosen, this
could dramatically impact the time it takes to see a payback or return on
investment. There are any alternatives to design a data warehouse. One can
have a stand-alone data mart or a dependent data mart.
The current trend is to define the data warehouse as a conceptual
environment. The industry is moving away from a single, physical data
warehouse toward a set of industry is moving away from a single, physical data
warehouse toward a set of smaller, more manageable, databases called data
marts. The physical data marts together serve as the conceptual data
warehouse. These marts must provide the easiest possible access to information
required by its user community.
STAND-ALONE DATA MART
This approach enables a department or work-group to implement a data
mart with minimal or no impact on the enterprise’s operational database.
DEPENDENT DATA MART
This approach is similar to the stand-alone data mart, except that
management of the data sources by the enterprise database is required. These
data source include operational databases and external sources of data.
VIRTUAL DATA WAREHOUSE
This model creates a virtual view of database, wallowing the creating of
a “virtual warehouse” as opposed to a physical warehouse. In a virtual
warehouse, you have a logical description of all the databases and their
structures, and individuals who want to get information from those databases
do not have to know anything about them.
This approach creates a single “virtual database” form all the data
resources. The data resources can be local or remote. In this type of a data
warehouse, the data is not moved from the sources. Instead, the users are given
direct access to the data. The direct access to the data is sometimes through
simple SQL queries, view definition, or data-access middleware. With this
approach, it is possible to access remote data sources including major
RDBMSs.
The virtual data warehouse scheme lets a client application access data
distributed across multiple data source through a single SQL statement, a single
interface. All data sources are accessed as though they are local users and their
applications do not even need to know the physical location of the data.
There is a great benefit in starting with a virtual warehouse, since many
organizations do not want to replicate information in the physical data
warehouse. Some organizations decide to provide both by creating a data
133
warehouse containing summary-level data with access to legacy data for
transaction details.
A virtual database is easy and fast, but it is not without problems. Since
the queries must compete with the production data transactions, its
performance can be considerably degraded. Since there is no metadata, no
summary data or history, all the queries must be repeated, creating an
additional burden on the system. Above all, there is no clearing or refreshing
process involved, causing the queries to become very complex.
9.12 METADATA
Metadata is to the data warehouse what the card catalogue is to the
traditional library. It serves to identify the contents and location of data in the
warehouse. Metadata is a bridge between the data warehouse and the decision
support application. In addition to providing a logical linkage between data and
application, metadata can pinpoint access to information across the entire data
warehouse, and can enable the development of applications which
automatically update themselves to reflect data warehouse content changes.
In a traditional database a scheme describes the conceptual or logical
data structures of all the objects or entities with which the database is
concerned, together with all relationships between them known to the database.
In such a well-defined concept the difference between metadata and data
disappears—metadata is simple data. However, in the context of the data
warehouse, metadata is needed to provide an unambiguous interpretation.
Metadata provides a catalogue of data in the data warehouse and the pointers to
this data. In addition to this, metadata may contain: data
extraction/transformation history, column aliases, data warehouse table sizes,
and data communication/modeling algorithms and data usage statistics.
Metadata is also used to describe many aspects of the applications, including
hierarchical relationships, stored formulae, whether calculations have to be
performed before or after consolidation, currency conversion information, time
series information, item description and notes for reporting, security and access
controls, data update status, formatting information, data sources, availability
of pre-calculated summary tables, and data storage parameters. In the absence
of this information, the actual data is not intelligible and it would not be wise to
attempt to view or update it.
A deeper look at what metadata is – expanding the basic “data about
data” definition – reveals the following, metadata, in its broadest sense, defines
and describes the entire application environment. It answers such question as
What does this field mean in business terms?
Which business processes does this set of queries support?
When did the job to up data the customer data in our data mart last
run?
Which file contains the product data, where does it reside, and what
is its detailed structure?
134
A metadata repository should contain:
A description of the structure of the data warehouse. This includes the
warehouse schema, view, dimensions, hierarchies and derived data
definitions, data marts location and contents etc.
Operational metadata, such as data linkages, currency of data and
monitoring information (warehouse usage statistics, error reports, and
its trails).
The summarization processes which include dimension definition, data
on granularity, partitions, summary measure, aggregation
summarization etc.
Details of data sources which include source databases and their
contents, gateway descriptions, a data partitions, data extractions,
clearing, transformation rules and defaults.
Data related to system performance, which include indices and profiles
that improve data access and retrieval performances, in addition to rules
for timing and scheduling of refresh, update, and replication cycles; and
Business metadata, which includes business terms and definitions, data
ownership information, and changing policies.
TYPES OF METADATA
Due to the variety of metadata, it is necessary to categorize it into
different types, based on how it is used. Thus, there are three broad categories
of metadata.
BUILD-TIME METADATA
Whenever we design and build a warehouse, the metadata that we
generate can be termed as build-time metadata. This metadata links business
and warehouse terminology and describes the data’s technical structure. It is
the most detailed and exact type of metadata and is used extensively by
warehouse designers, developers, and administrators. It is the primary source
of most of the metadata used in the warehouse.
USAGE METADATA
When the warehouse is in production, usage metadata, which is derived
from build-time metadata, is an important tool for users and data
administrators. This metadata is used differently from build-time metadata and
its structure must accommodate this fact.
CONTROL METADATA
The third way metadata is used is, of course, by the databases and other
tools to manage their own operations. For example, a DBMS builds an internal
representation of the database catalogue for use as a working copy from the
build-time catalogue. This representation functions as control metadata. Most
control metadata is of interest, only to systems programmers. However, one
subset which is generated and used by the tools that populate the warehouse is
135
of considerable interest to users and data warehouse administrators. It provides
vital information about the timeliness of warehouse data and helps users track
the sequence and timing of warehouse events.
9.13 OLAP ENGINE
The main function of the OLAP engine is to present the user a
multidimensional view of the data warehouse and to provide tools for OLAP
operations. If the warehouse server organizes the data warehouse in the form of
multidimensional arrays, then the implementation considerations of the OLAP
engine are different from those when the server keeps the warehouse in a
relational form.
SPECIALIZED SQL SERVER
This model assumes that the warehouse organizes data in a relational
structure and the engine provides an SQL-like environment for OLAP tools.
The main idea is to exploit the capabilities of SQL. We shall see that the
standard SQL is not suitable for OLAP operations.
RELATIONAL OLAP (ROLAP)
The ROLAP approach begins with the premise that data does not need
to be stored multidimensional to be viewed multi dimensionally. A scalable,
parallel, relational database provides the storage and high-speed access to this
underlying data. A middle analysis tier provides a multidimensional conceptual
view of the data and an extended analytical functionality which are not
available in the underlying relational server. The presentation tier delivers the
results to the users. ROLAP systems provide the benefit of full analytical
functionality, which maintaining the advantage of relational data.
ROLAP depends on a specialized schema design and its technology is
limited by its non-integrated, disparate tier architecture. The problem is that the
data is physically separated from the analytical processing. For many queries
this is not a major problem, but it limits the scope of analysis. Normally, the
ROLAP engine formulates optimized SQL statements that it sends to the
RDBMS server. It then takes the data back from the server, reintegrates it, and
performs further analysis and computation before delivering the finessed
results to the user.
Two important features of ROLAP are
Data warehouse and relational database are inseparable
Any change in the dimensional structure requires a physical
reorganization of the database, which is too time consuming.
Certain applications are too fluid for this and the on-the-fly
dimensional view of a ROLAP tool is the only appropriate choice.
MULTIDIMENSIONAL OLAP (MOLAP)
The third option is to have a special purpose Multidimensional Data
Model for the data warehouse, with a Multidimensional OLAP (MOLAP)
server for analysis. The traditional ER model tends to be too complex and
136
difficult to navigate, as the most important data warehouse requirement is to
have fewer queries accessing a large number of records.
MOLAP servers support multidimensional views of data through array-
based data warehouse servers. They map multidimensional views of data cube
to array structures. The advantage of using a data cube is that it allows fast
indexing to recomputed summarized data. As with a multidimensional data
store, storage utilization is low, and MOLAP is recommended in such cases.
ROLAP VS MOLAP
The following arguments can be given in favour of MOLAP:
1. Relational tables are unnatural for multidimensional data.
2. Multidimensional arrays provide efficiency in storage and operations.
3. There is a mismatch between multidimensional operations and SQL.
4. For ROLAP to achieve efficiency, it has to perform outside current
relational systems, which is the same as what MOLAP does.
The following arguments can be given in favors of ROLAP:
1. ROLAP integrates naturally with existing technology and standards.
2. MOLAP does not support ad hoc queries effectively, because it is
optimized for multidimensional operations.
3. Since data has to be downloaded into MOLAP systems, updating is
difficult.
4. The efficiency of ROLAP can be achieved by using techniques such as
encoding and compression.
5. ROLAP can readily take advantage of parallel relational technology.
The claim that MOLAP performs better that ROLAP is intuitively
believable.
9.14 DATA WAREHOUSE BACKEND PROCESS
Data warehouse systems use backend tools and utilities to populate and
refresh their data. These tools and facilities include the following functions: (1)
data extraction, which gathers data from multiple, heterogeneous, and external
sources; (2) data cleaning, which detects errors in the data and rectifies them
when possible; (3) data transformation, which converts data from legacy or
hose format to warehouse formal; (4) load, which sorts, summarizes,
consolidates, computers, views, checks integrity, and builds indices and
partitions; and (5) refresh, which propagates the updates from the data sources
to the warehouse.
DATA EXTRACTION
Data extraction is the process of extracting data for the warehouse from
various sources. The data may come from a variety of sources, such as
Production data,
Legacy data,
137
internal office systems,
external systems,
Metadata.
DATA CLEANING
Since data warehouses are used for decision making, it is inessential
that the data in the warehouse be correct. However, since large volumes of data
from heterogeneous sources are involved, there is a high probability of errors
in the data. Therefore, data cleaning is essential in the construction of quality
data warehouses. The date cleaning techniques include
Using transformation rules, e.g., translating attribute names like
‘age’ to’DOB’;
using domain-specific knowledge;
performing parsing and fuzzy matching, e.g., for multiple data
sources, one can designate a preferred source as a matching
standard, and
auditing, i.e., discovering facts that flag unusual patterns.
It is difficult and costly, however, to clean the data that are entered as a
result of poor business practices, such as no clear naming conventions, no
consistent data standards, etc.
DATA TRANSFORMATION
The sources of data for data warehouse are usually heterogeneous. Data
transformation is concerned with transforming heterogeneous data to a uniform
structure so that the data can be combined and integrated.
LOADING
Since a data warehouse integrates time-varying data from multiple
sources, the volumes of data to be loaded into a data warehouse can be huge.
Moreover, there is usually and insufficient time interval when the warehouse
can be taken off-line and when loading data, indices and summary tables need
to be rebuilt. A loading system should also allow system administrators to
monitor the status, cancel, suspend, resume loading or change the loading rate,
and restart loading after failures without any loss of data integrity.
There are different data loading strategies.
Batch loading.
Sequential loading.
Incremental loading.
138
REFRESH
When the source data is updated, we need to update the warehouse.
This process is called the refresh function. Determining how frequently to
refresh is an important issue. One extreme is to refresh on every update. This is
very expensive, however, and is normally only necessary when OLAP queries
need the most current data, such as Active Data Warehouse, for example, an
up-to-the minute stock quotation. A more realistic choice is to perform refresh
periodically. Refresh policies should be set by the data administrator, based on
user needs and data traffic.
139
NOTES
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
…………………………………………………………………….……………..
140