Introduction To Data Mining-1
Introduction To Data Mining-1
Introduction To Data Mining-1
Data mining is used by companies in order to get customer preferences, determine price of
their product and services and to analyze market.
Data mining architecture has many elements like Data Warehouse, Data Mining Engine,
Pattern evaluation, User Interface and Knowledge Base.
Data Warehouse:
A data warehouse is a place which store information collected from multiple sources under
unified schema. Information stored in a data warehouse is critical to organizations for the
process of decision-making.
Data Mining Engine is the core component of data mining process which consists of various
modules that are used to perform various tasks like clustering, classification, prediction and
correlation analysis.
1
2. Introduction to Data Mining
Pattern Evaluation:
Pattern Evaluation is responsible for finding various patterns with the help of Data Mining
Engine.
User Interface:
User Interface provides communication between user and data mining system. It allows user
to use the system easily even if user doesn't have proper knowledge of the system.
Knowledge Base:
Knowledge Base consists of data that is very important in the process of data mining.
Knowledge Base provides input to the data mining engine which guides data mining engine
in the process of pattern search.
One of the most important benefits of One of the pros of Data Warehouse is its
data mining techniques is the detection ability to update consistently. That's why
and identification of errors in the system. it is ideal for the business owner who
wants the best and latest features.
Data mining helps to create suggestive Data Warehouse adds an extra value to
patterns of important factors. Like the operational business systems like CRM
2
2. Introduction to Data Mining
buying habits of customers, products, systems when the warehouse is
sales. So that, companies can make the integrated.
necessary adjustments in operation and
production.
The Data mining techniques are never In the data warehouse, there is great
100% accurate and may cause serious chance that the data which was required
consequences in certain conditions. for analysis by the organization may not
be integrated into the warehouse. It can
easily lead to loss of information.
The information gathered based on Data Data warehouses are created for a huge
Mining by organizations can be misused IT project. Therefore, it involves high
against a group of people. maintenance system which can impact
the revenue of medium to small-scale
organizations.
Organizations can benefit from this Data warehouse stores a large amount of
analytical tool by equipping pertinent historical data which helps users to
and usable knowledge-based analyze different time periods and trends
information. for making future predictions.
Organizations need to spend lots of their In Data warehouse, data is pooled from
resources for training and multiple sources. The data needs to be
Implementation purpose. Moreover, data cleaned and transformed. This could be a
mining tools work in different manners challenge.
due to different algorithms employed in
their design.
3
2. Introduction to Data Mining
Another critical benefit of data mining Data warehouse allows users to access
techniques is the identification of errors critical data from the number of sources
which can lead to losses. Generated data in a single place. Therefore, it saves
could be used to detect a drop-in sale. user's time of retrieving data from
multiple sources.
Establish relevance and relationships amongst data. Use this information to generate
profitable insights
Business can make informed decisions quickly
Helps to find out unusual shopping patterns in grocery stores.
Optimize website business by providing customize offers to each visitor.
Helps to measure customer's response rates in business marketing.
Creating and maintaining new customer groups for marketing purposes.
Predict customer defections, like which customers are more likely to switch to another
supplier in the nearest future.
Differentiate between profitable and unprofitable customers.
Identify all kind of suspicious behavior, as part of a fraud detection process.
Some people don’t differentiate data mining from knowledge discovery while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process −
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
4
2. Introduction to Data Mining
The process of finding and interpreting patterns from data involves the repeated application
of the following steps:
5
2. Introduction to Data Mining
Choosing the data mining task:
Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
Choosing the data mining algorithm(s):
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the criteria of the KDD process.
Data Mining:
Searching for patterns of interest in a particular representational form. Such representations
as classification rules or trees, regression, clustering, and so forth.
Interpreting mined patterns.
Consolidating discovered knowledge.
6
2. Introduction to Data Mining
information. Most of the times, new tools and methodologies would have to be developed to
extract relevant information.
Performance
The performance of the data mining system mainly depends on the efficiency of algorithms
and techniques used. If the algorithms and techniques designed are not up to the mark, then it
will affect the performance of the data mining process adversely.
Incorporation of Background Knowledge
If background knowledge can be incorporated, more reliable and accurate data mining
solutions can be found. Descriptive tasks can come up with more useful findings and
predictive tasks can make more accurate predictions. But collecting and incorporating
background knowledge is a complex process.
Data Visualization
Data visualization is a very importance process in data mining because it is the main process
that displays the output in a presentable manner to the user. The information extracted should
convey the exact meaning of what it actually intends to convey. But many times, it is really
difficult to represent the information in an accurate and easy-to-understand way to the end
user. The input data and output information being really complex, very effective and
successful data visualization techniques need to be applied to make it successful.
Data Privacy and Security
Data mining normally leads to serious issues in terms of data security, privacy and
governance. For example, when a retailer analyzes the purchase details, it reveals information
about buying habits and preferences of customers without their permission.
1 Association
2 Classification
3 Clustering
4 Sequential patterns
5 Decision tree.
7
2. Introduction to Data Mining
1 Association Technique
Association Technique helps to find out the pattern from huge data, based on a relationship
between two or more items of the same transaction. The association technique is used to
analyze market means it help us to analyze people's buying habits.
For example, you might identify that a customer always buys ice cream whenever he comes
to watch move so it might be possible that when customer again comes to watch move he
might also want to buy ice cream again.
2 Classification Technique
Let assume you have set of records, each record contains a set of attributes and depending
upon this attributes you will be able to predict unseen or unknown records. For example, you
have given all records of employees who left the company, with classification technique you
can predict who will probably leave the company in a future period.
3 Clustering Technique
Clustering is one of the oldest techniques used in the process of data mining. The main aim of
clustering technique is to makes cluster(groups) from pieces of data which share common
characteristics. Clustering Technique help to identify the differences and similarities between
the data.
Take an example of a shop in which many items are for sales, now the challenge is how to
keep those items in such way that customer can easily find his required item.By using the
clustering technique, you can keep some items in one corner that have some similarities and
other items in another corner that have some different similarities.
4 Sequential patterns
Sequential patterns are a useful method for identifying trends and similar patterns.
For example, in customer data you identify that a customer buys particular product on
particular time of year, you can use this information to suggest customer these particular
product on that time of year.
5 Decision tree
Decision tree is one of the most common used data mining techniques because its model is
easy to understand for users. In decision tree you start with a simple question which has two
or more answers.Each answer leads to a further two or more question which help us to make
a final decision. The root node of decision tree is a simple question.
8
2. Introduction to Data Mining
Decision tree
First check water level, if water level is > 50ft then alert is send and if water level is < 50ft
then check water level if water level is > 30ft then send warning and if water level is < 30ft
then water is in normal range.
Data Summarization
Data Summarization is a simple term for a short conclusion of a big theory or a paragraph.
This is something where you write the code and in the end, you declare the final result in the
form of summarizing data. Data summarization has the great importance in the data mining.
As nowadays a lot of programmers and developers work on big data theory. Earlier, you used
to face difficulties to declare the result, but now there are so many relevant tools in the
market where you can use in the programming or wherever you want in your data.
9
2. Introduction to Data Mining
of data you want in the data or we can say use filtration when you retrieve data. Although,
“Data Summarization” technique gives the good amount of quality to summarize the data.
Moreover, a customer or user can take benefits in their research. Excel is the best tool for
data summarization and I will discuss this in brief.
Now, look in the table name Register which contains six columns such as Invoice Date,
Customer, Type, Country, Amount, and Status. We have already inserted random data for our
ease. Now, I want total amount of ‘Paid’ and ‘Future’ in the table. So, let’s understand with
an example of SUMIF formula as follows,
10
2. Introduction to Data Mining
In the output, we used formula SUMIF, now see the formula explains (Range, Criteria,
Sum_Range) as arguments.
ARGUMENTS
Range: It selects the range from where you want to retrieve the string or number value.
Criteria: It selects particular criteria in the range.
Sum_range: It selects the range from where you need the data or amount in the table.
Once you type the formula then press Enter key and get the result as follows.
11
2. Introduction to Data Mining
Now, I tell to you what exactly happened in the result. First of all, it retrieved data from
“Status” column as Range, then select “Paid” string as Criteria and finally the sum of
“Amount” column of all status where status lies with “Paid”. This is a basic data mining
process, now we will discuss some other complex data.
Data Cleaning:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
a) Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. You note that many
tuples have no recorded value for several attributes such as customer income. How can you
go about filling in the missing values for this attribute? Let’s look at the following methods.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably. By ignoring the tuple, we do not make use of
the remaining attributes’ values in the tuple. Such data could have been useful to the task at
hand.
2. Fill in the missing value manually: In general, this approach is time consuming and may
not be feasible given a large data set with many missing values.
12
2. Introduction to Data Mining
3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant such as a label like “Unknown” or 1. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence, although this
method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill
in the missing value: Chapter 2 discussed measures of central tendency, which indicate the
“middle” value of a data distribution. For normal (symmetric) data distributions, the mean
can be used, while skewed data distribution should employ the median. For example, suppose
that the data distribution regarding the income of AllElectronics customers is symmetric and
that the mean income is $56,000. Use this value to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as the
given tuple: For example, if classifying customers according to credit risk, we may replace
the missing value with the mean income value for customers in the same credit risk category
as that of the given tuple. If the data distribution for a given class is skewed, the median value
is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
b) Noisy Data:
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that
is, the values around it. The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Figure 3.2 illustrates some binning techniques. In this example, the data for price are first
sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three
values). In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the Sorted data for price (in dollars): 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
13
2. Introduction to Data Mining
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Figure 3.2 Binning methods for data smoothing. greater the effect of the smoothing.
Alternatively, bins may be equal width, where the interval range of values in each bin is
constant. Binning is also used as a discretization technique.
Regression: Data smoothing can also be done by regression, a technique that conforms data
values to a function. Linear regression involves finding the “best” line to attributes are
involved and the data are fit to a multidimensional surface.
c) Data Cleaning as a Process
Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have
looked at techniques for handling missing data and for smoothing data. “But data
cleaning is a big job.
The first step in data cleaning as a process is discrepancy detection. Discrepancies can be
caused by several factors, including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting to
divulge information about themselves), and data decay (e.g., outdated addresses).
Discrepancies may also arise from inconsistent data representations and inconsistent use of
codes. Other sources of discrepancies include errors in instrumentation devices that record
data and system errors. Errors can also occur when the data are (inadequately) used for
purposes other than originally intended. There may also be inconsistencies due to data
integration (e.g., where a given attribute can have different names in different databases).
Data Integration:
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set. This can help improve the accuracy and speed of the subsequent data
mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration.
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files.
There are a number of issues to consider during data integration. Schema integration and
object matching can be tricky. How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity identification problem.
For example, how can the data analyst or the computer be sure that customer id in one
database and cust number in another refer to the same attribute? Examples of metadata for
each attribute include the name, meaning, data type, and range of values permitted for the
14
2. Introduction to Data Mining
attribute, and null rules for handling blank, zero, or null values (Section 3.2). Such metadata
can be used to help avoid errors in schema integration. The metadata may also be used to help
transform the data hence, this step also relates to data cleaning, as described earlier. When
matching attributes from one database to another during integration, special attention must be
paid to the structure of the data. This is to ensure that any attribute functional dependencies
and referential constraints in the source system match those in the target system. For
example, in one system, a discount may be applied to the order, whereas in another system it
is applied to each individual line item within the order. If this is not caught before integration,
items in the target system may be improperly discounted.
2. Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For
example, for the same real-world entity, attribute values from different sources may differ.
This may be due to differences in representation, scaling, or encoding. For instance, a weight
attribute may be stored in metric units in one system and British imperial units in another. For
a hotel chain, the price of rooms in different cities may involve not only different currencies
but also different services (e.g., free breakfast) and taxes.
When exchanging information between schools, for example, each school may have its own
curriculum and grading scheme. One university may adopt a quarter systems, offer three
courses on database systems, and assign grades from AC to F, whereas another may adopt a
semester system, offer two courses on databases, and assign grades from 1 to 10. It is difficult
to work out precise course-to-grade transformation rules between the two universities,
making information exchange difficult.
Data Transformation:
Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise fromthe data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for data analysis at multiple
abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as 1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The
labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept
hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy for the attribute
price. More than one concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
15
2. Introduction to Data Mining
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level.
Data Transformation by Normalization
The measurement unit used can affect the data analysis. For example, changing measurement
Units from meters to inches for height, or from kilograms to pounds for weight may lead to
very different results. In general, expressing an attribute in smaller units will lead to a larger
range for that attribute, and thus tend to give such an attribute greater effect or “weight.” To
help avoid dependence on the choice of measurement units, the data should be normalized or
standardized. This involves transforming the data to fall within a smaller or common range
such as [ 1, 1] or [0.0, 1.0]. (The terms standardize and normalize are used interchangeably
in data preprocessing, although in statistics, the latter term also has other connotations.)
Normalizing the data attempts to give all attributes an equal weight. Normalization is
particularly useful for classification algorithms involving neural networks or distance
measurements such as nearest-neighbor classification and clustering. If using the neural
network backpropagation algorithm for classification mining, normalizing the input values
for each attribute measured in the training tuples will help speed up the learning phase. For
distance-based methods, normalization helps prevent attributes with initially large ranges
(e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary
attributes). It is also useful when given no prior knowledge of the data.
Data Reduction
Imagine that you have selected data from the AllElectronics data warehouse for analysis. The
data set will likely be huge! Complex data analysis and mining on huge amounts of data can
take a long time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data set
that ismuch smaller in volume, yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet transforms
and principal components analysis, which transform or project the original data onto a
smaller space. Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed
(Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For
parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. (Outliers may also be stored.)
Regression and log-linear models (Section 3.4.5) are examples. Nonparametric methods for
16
2. Introduction to Data Mining
storing reduced representations of the data include histograms (Section 3.4.6), clustering
(Section 3.4.7), sampling (Section 3.4.8), and data cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the
compressed data without any information loss, the data reduction is called lossless. If,
instead, we can reconstruct only an approximation of the original data, then the data reduction
is called lossy. There are several lossless algorithms for string compression; however, they
typically allow only limited data manipulation. Dimensionality reduction and numerosity
reduction techniques can also be considered forms of data compression.
Histograms
Histograms use binning to approximate data distributions and are a popular form of data
reduction. Histograms were introduced in Section 2.2.3. A histogram for an attribute, A,
partitions the data distribution of A into disjoint subsets, referred to as buckets or bins. If each
bucket represents only a single attribute–value/frequency pair, the buckets are called
singleton buckets. Often, buckets instead represent continuous ranges for the given attribute.
Example 3.3 Histograms. The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5,
5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce the data,
it is common to have each bucket denote a continuous value range for the given attribute. In
Figure 3.8, each bucket represents a different $10 range for price.
“How are the buckets determined and the attribute values partitioned?” There are several
partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket range is uniform (e.g.,
the width of $10 for the buckets in Figure 3.8).
17
2. Introduction to Data Mining
Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created
so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly
the same number of contiguous data samples).
Top-down discretization
If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, then
it is called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-points,
removes some by merging neighborhood values to form intervals, then it is called bottom-up
discretization or merging.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more efficient
than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
18
2. Introduction to Data Mining
Discretization and Concept Hierarchy Generation for Numerical Data
Typical methods
1 Binning
2 Histogram Analysis
3 Cluster Analysis
Cluster analysis is a popular data discretization method.A clustering algorithm can be applied
to discrete a numerical attribute of A by partitioning the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming
a lower level of the hierarchy.
We now look at data transformation for nominal data. In particular, we study concept
hierarchy generation for nominal attributes. Nominal attributes have a finite (but possibly
large) number of distinct values, with no ordering among the values. Examples include
geographic location, job category, and item type.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a user
or a domain expert. Fortunately, many hierarchies are implicit within the database schema
and can be automatically defined at the schema definition level. The concept hierarchies can
be used to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be found relating to specific regions
or countries, in addition to individual branch locations.
We study four methods for the generation of concept hierarchies for nominal data, as follows.
19
2. Introduction to Data Mining
1. Specification of a partial ordering of attributes explicitly at the schema level by users
or experts: Concept hierarchies for nominal attributes or dimensions typically involve a
group of attributes. A user or expert can easily define a concept hierarchy by specifying a
partial or total ordering of the attributes at the schema level. For example, suppose that a
relational database contains the following group of attributes:
street, city, province or state, and country. Similarly, a data warehouse location dimension
may contain the same attributes. A hierarchy can be defined by specifying the total ordering
among these attributes at the schema level such as street < city < province or state < country.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially
the manual definition of a portion of a concept hierarchy. In a large database, it is unrealistic
to define an entire concept hierarchy by explicit value enumeration. On the contrary, we can
easily specify explicit groupings for a small portion of intermediate-level data. For example,
after specifying that province and country form a hierarchy at the schema level, a user could
define some intermediate levels manually, such as “fAlberta, Saskatchewan, Manitobag _
prairies Canada” and “fBritish Columbia, prairies Canadag _ Western Canada.”
3. Specification of a set of attributes, but not of their partial ordering: A user may specify
a set of attributes forming a concept hierarchy, but omit to explicitly state their partial
ordering. The systemcan then try to automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy.
“Without knowledge of data semantics, how can a hierarchical ordering for an arbitrary set
of nominal attributes be found?” Consider the observation that since higher-level concepts
generally cover several subordinate lower-level concepts, an attribute defining a high concept
level (e.g., country) will usually contain a smaller number of distinct values than an attribute
defining a lower concept level (e.g., street). Based on this observation, a concept hierarchy
can be automatically generated based on the number of distinct values per attribute in the
given attribute set.
The attribute with the most distinct values is placed at the lowest hierarchy level. The ower
the number of distinct values an attribute has, the higher it is in the generated concept
hierarchy. This heuristic rule works well in many cases. Some local-level swapping or
adjustments may be applied by users or experts, when necessary, after examination of the
generated hierarchy.
Let’s examine an example of this third method.
Example 3.7 Concept hierarchy generation based on the number of distinct values per
attribute.
Suppose a user selects a set of location-oriented attributes—street, country, province or state,
and city—fromthe AllElectronics database, but does not specify the hierarchical ordering
among the attributes.
A concept hierarchy for location can be generated automatically, as illustrated in Figure 3.13.
First, sort the attributes in ascending order based on the number of distinct values in each
attribute. This results in the following (where the number of distinct values per attribute is
shown in parentheses): country (15), province or state (365), city (3567), and street
(674,339). Second, generate the hierarchy from the top down according to the sorted order,
with the first attribute at the top level and the last attribute at the bottomlevel. Finally, the
user can examine the generated hierarchy, and when necessary, modify it to reflect desired
20
2. Introduction to Data Mining
semantic relationships among the attributes. In this example, it is obvious that there is no
need to modify the generated hierarchy.
Note that this heuristic rule is not foolproof. For example, a time dimension in a database
may contain 20 distinct years, 12 distinct months, and 7 distinct days of the week. However,
this does not suggest that the time hierarchy should be “year <month < days of the week,”
with days of the week at the top of the hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be careless when
defining a hierarchy, or have only a vague idea about what should be included in a hierarchy.
Consequently, the user may have included only a small subset of the
relevant attributes in the hierarchy specification. For example, instead of including all of the
hierarchically relevant attributes for location, the user may have specified only street and
city. To handle such partially specified hierarchies, it is important to embed data semantics in
the database schema so that attributes with tight semantic connections can be pinned together.
In this way, the specification of one attribute may trigger a whole group of semantically
tightly linked attributes to be “dragged in” to forma complete hierarchy. Users, however,
should have the option to override this feature, as necessary.
Example 3.8 Concept hierarchy generation using prespecified semantic connections.
Suppose that a data mining expert (serving as an administrator) has pinned together the five
attributes number, street, city, province or state, and country, because they are closely linked
semantically regarding the notion of location. If a user were to specify only the attribute city
for a hierarchy defining location, the systemcan automatically drag in all five semantically
related attributes to form a hierarchy. The user may choose to drop any of these attributes
(e.g., number and street) from the hierarchy, keeping city as the lowest conceptual level.
In summary, information at the schema level and on attribute–value counts can be used to
generate concept hierarchies for nominal data. Transforming nominal data with the use of
concept hierarchies allows higher-level knowledge patterns to be found. It allows mining at
multiple levels of abstraction, which is a common requirement for data mining applications.
21
2. Introduction to Data Mining
Generation:
History
While the concepts behind association rules can be traced back earlier, association rule
mining was defined in the 1990s, when computer scientists Rakesh Agrawal, Tomasz
Imieliński and Arun Swami developed an algorithm-based way to find relationships between
items using point-of-sale (POS) systems. Applying the algorithms to supermarkets, the
scientists were able to discover links between different items purchased, called association
rules, and ultimately use that information to predict the likelihood of different products being
purchased together.
Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrence, in a database. It identifies frequent if-then
associations, which are called association rules.
An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent
is an item found within the data. A consequent is an item found in combination with the
antecedent.
Association rules are created by searching data for frequent if-then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the data. Confidence indicates the number of
times the if-then statements are found true. A third metric, called lift, can be used to compare
confidence with expected confidence.
Association rules are calculated from itemsets, which are made up of two or more items. If
rules are built from analyzing all the possible itemsets, there could be so many rules that the
rules hold little meaning. With that, association rules are typically created from rules well-
represented in data.
22
2. Introduction to Data Mining
In data mining, association rules are useful for analyzing and predicting customer behavior.
They play an important part in customer analytics, market basket analysis, product clustering,
catalog design and store layout.
Programmers use association rules to build programs capable of machine learning. Machine
learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to
become more efficient without being explicitly programmed.
Binaryzation
What is Binarization?
Binarization is the process of transforming data features of any entity into vectors of binary
numbers to make classifier algorithms more efficient. In a simple example, transforming an
image’s gray-scale from the 0-255 spectrum to a 0-1 spectrum is binarization.
In machine learning, even the most complex concepts can be transformed into binary form.
For example, to binarize the sentence “The dog ate the cat,” every word is assigned an ID (for
example dog-1, ate-2, the-3, cat-4). Then replace each word with the tag to provide a binary
vector. In this case the vector: <3,1,2,3,4> can be refined by providing each word with four
possible slots, then setting the slot to correspond with a specific word:
<0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. This is commonly referred to as the bag-of-words-method.
Since the ultimate goal is to make this data easier for the classifier to read while minimizing
memory usage, it’s not always necessary to encode the whole sentence or all the details of a
complex concept. In this case, only the current state of how the data is parsed is needed for
the classifier. For example, when the top word on the stack is used as the first word in the
input queue. Since order is quite important, a simpler binary vector is preferable.
23
2. Introduction to Data Mining
Measures of Similarity and Dissimilarity
Distance or similarity measures are essential to solve many pattern recognition problems such
as classification and clustering. Various distance/similarity measures are available in
literature to compare two data distributions. As the names suggest, a similarity measures how
close two distributions are. For multivariate data complex summary methods are developed to
answer this question.
Similarity Measure
Numerical measure of how alike two data objects are.
Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
Numerical measure of how different two data objects are.
Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
24