Introduction To Data Mining-1

2.
Introduction to Data Mining

What Is Data Mining?
Data mining refers to extraction of information from large amount of data. In today’s world
data mining is very important because huge amount of data is present in companies and
different type of organization. It becomes impossible for humans to extract information from
this large data, so machine learning technology are used in order to process data fast enough
to extract information from it.
Data mining is used by companies in order to get customer preferences, determine price of
their product and services and to analyze market.
Data mining is also known as knowledge discovery in Database (KDD).
Data Mining Architecture
Data mining architecture has many elements like Data Warehouse, Data Mining Engine,
Pattern evaluation, User Interface and Knowledge Base.
Data Warehouse:
A data warehouse is a place which store information collected from multiple sources under
unified schema. Information stored in a data warehouse is critical to organizations for the
process of decision-making.
Data Mining Engine:
Data Mining Engine is the core component of data mining process which consists of various
modules that are used to perform various tasks like clustering, classification, prediction and
correlation analysis.
1
2. Introduction to Data Mining
Pattern Evaluation:
Pattern Evaluation is responsible for finding various patterns with the help of Data Mining
Engine.
User Interface:
User Interface provides communication between user and data mining system. It allows user
to use the system easily even if user doesn't have proper knowledge of the system.
Knowledge Base:
Knowledge Base consists of data that is very important in the process of data mining.
Knowledge Base provides input to the data mining engine which guides data mining engine
in the process of pattern search.
Data Mining Vs Data Warehouse: Key Differences
Data Mining Data Warehouse
Data mining is the process of analyzing A data warehouse is database system

unknown patterns of data. which is designed for analytical instead
of transactional work.
Data mining is a method of comparing Data warehousing is a method of

large amounts of data to finding right centralizing data from different sources
patterns. into one common repository.
Data mining is usually done by business Data warehousing is a process which

users with the assistance of engineers. needs to occur before any data mining
can take place.
Data mining is the considered as a On the other hand, Data warehousing is

process of extracting data from large the process of pooling all relevant data
data sets. together.
One of the most important benefits of One of the pros of Data Warehouse is its
data mining techniques is the detection ability to update consistently. That's why
and identification of errors in the system. it is ideal for the business owner who
wants the best and latest features.
Data mining helps to create suggestive Data Warehouse adds an extra value to
patterns of important factors. Like the operational business systems like CRM
2
buying habits of customers, products, systems when the warehouse is
sales. So that, companies can make the integrated.
necessary adjustments in operation and
production.
The Data mining techniques are never In the data warehouse, there is great
100% accurate and may cause serious chance that the data which was required
consequences in certain conditions. for analysis by the organization may not
be integrated into the warehouse. It can
easily lead to loss of information.
The information gathered based on Data Data warehouses are created for a huge
Mining by organizations can be misused IT project. Therefore, it involves high
against a group of people. maintenance system which can impact
the revenue of medium to small-scale
organizations.
After successful initial queries, users Data Warehouse is complicated to

may ask more complicated queries implement and maintain.
which would increase the workload.
Organizations can benefit from this Data warehouse stores a large amount of
analytical tool by equipping pertinent historical data which helps users to
and usable knowledge-based analyze different time periods and trends
information. for making future predictions.
Organizations need to spend lots of their In Data warehouse, data is pooled from
resources for training and multiple sources. The data needs to be
Implementation purpose. Moreover, data cleaned and transformed. This could be a
mining tools work in different manners challenge.
due to different algorithms employed in
their design.
The data mining methods are cost- Data warehouse's responsibility is to

effective and efficient compares to other simplify every type of business data.
statistical data applications. Most of the work that will be done on
user's part is inputting the raw data.
3
Another critical benefit of data mining Data warehouse allows users to access
techniques is the identification of errors critical data from the number of sources
which can lead to losses. Generated data in a single place. Therefore, it saves
could be used to detect a drop-in sale. user's time of retrieving data from
multiple sources.
Why use Data mining?
Some most important reasons for using Data mining are:
 Establish relevance and relationships amongst data. Use this information to generate
profitable insights
 Business can make informed decisions quickly
 Helps to find out unusual shopping patterns in grocery stores.
 Optimize website business by providing customize offers to each visitor.
 Helps to measure customer's response rates in business marketing.
 Creating and maintaining new customer groups for marketing purposes.
 Predict customer defections, like which customers are more likely to switch to another
supplier in the nearest future.
 Differentiate between profitable and unprofitable customers.
 Identify all kind of suspicious behavior, as part of a fraud detection process.
What is Knowledge Discovery(KDD)?
Some people don’t differentiate data mining from knowledge discovery while others view
data mining as an essential step in the process of knowledge discovery. Here is the list of
steps involved in the knowledge discovery process −
 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
 Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.
4
The following diagram shows the process of knowledge discovery −
Knowledge Discovery Database in Data Mining
The process of finding and interpreting patterns from data involves the repeated application
of the following steps:
Developing an understanding of:
The application domain

Relevant prior knowledge
The goals of the end-user
Creating a target dataset:
Selecting a data set, or focusing on a subset of variables, or data samples, on which discovery
is to be performed.
Data cleaning and preprocessing:
Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
5
Choosing the data mining task:
Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
Choosing the data mining algorithm(s):
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the criteria of the KDD process.
Data Mining:
Searching for patterns of interest in a particular representational form. Such representations
as classification rules or trees, regression, clustering, and so forth.
Interpreting mined patterns.
Consolidating discovered knowledge.
Challenges in Data Mining:

Introduction
Though data mining is very powerful, it faces many challenges during its implementation.
The challenges could be related to performance, data, methods and techniques used etc. The
data mining process becomes successful when the challenges or issues are identified correctly
and sorted out properly.
Noisy and Incomplete Data
Data mining is the process of extracting information from large volumes of data. The real-
world data is heterogeneous, incomplete and noisy. Data in large quantities normally will be
inaccurate or unreliable. These problems could be due to errors of the instruments that
measure the data or because of human errors. Suppose a retail chain collects the email id of
customers who spend more than $200 and the billing staff enters the details into their system.
The person might make spelling mistakes while entering the email id which results in
incorrect data. Even some customers might not be ready to disclose their email id which
results in incomplete data. The data even could get altered due to system or human errors. All
these result in noisy and incomplete data which makes the data mining really challenging.
Distributed Data
Real world data is usually stored on different platforms in distributed computing
environments. It could be in databases, individual systems, or even on the Internet. It is
practically very difficult to bring all the data to a centralized data repository mainly due to
organizational and technical reasons. For example, different regional offices might be having
their own servers to store their data whereas it will not be feasible to store all the data
(millions of terabytes) from all the offices in a central server. So, data mining demands the
development of tools and algorithms that enable mining of distributed data.
Complex Data
Real world data is really heterogeneous and it could be multimedia data including images,
audio and video, complex data, temporal data, spatial data, time series, natural language text
and so on. It is really difficult to handle these different kinds of data and extract required
6
information. Most of the times, new tools and methodologies would have to be developed to
extract relevant information.
Performance
The performance of the data mining system mainly depends on the efficiency of algorithms
and techniques used. If the algorithms and techniques designed are not up to the mark, then it
will affect the performance of the data mining process adversely.
Incorporation of Background Knowledge
If background knowledge can be incorporated, more reliable and accurate data mining
solutions can be found. Descriptive tasks can come up with more useful findings and
predictive tasks can make more accurate predictions. But collecting and incorporating
background knowledge is a complex process.
Data Visualization
Data visualization is a very importance process in data mining because it is the main process
that displays the output in a presentable manner to the user. The information extracted should
convey the exact meaning of what it actually intends to convey. But many times, it is really
difficult to represent the information in an accurate and easy-to-understand way to the end
user. The input data and output information being really complex, very effective and
successful data visualization techniques need to be applied to make it successful.
Data Privacy and Security
Data mining normally leads to serious issues in terms of data security, privacy and
governance. For example, when a retailer analyzes the purchase details, it reveals information
about buying habits and preferences of customers without their permission.
Data Mining Tasks/Data Mining Techniques:

Extracting important knowledge from a very large amount of data can be crucial to
organizations for the process of decision-making.
Some data mining techniques are :-
1 Association
2 Classification
3 Clustering
4 Sequential patterns
5 Decision tree.
7
1 Association Technique
Association Technique helps to find out the pattern from huge data, based on a relationship
between two or more items of the same transaction. The association technique is used to
analyze market means it help us to analyze people's buying habits.
For example, you might identify that a customer always buys ice cream whenever he comes
to watch move so it might be possible that when customer again comes to watch move he
might also want to buy ice cream again.
2 Classification Technique
Classification technique is most common data mining technique. In classification method we

use mathematical techniques such as decision trees, neural network and statistics in order to
predict unknown records. This technique helps in deriving important information about data.
Let assume you have set of records, each record contains a set of attributes and depending
upon this attributes you will be able to predict unseen or unknown records. For example, you
have given all records of employees who left the company, with classification technique you
can predict who will probably leave the company in a future period.
3 Clustering Technique
Clustering is one of the oldest techniques used in the process of data mining. The main aim of
clustering technique is to makes cluster(groups) from pieces of data which share common
characteristics. Clustering Technique help to identify the differences and similarities between
the data.
Take an example of a shop in which many items are for sales, now the challenge is how to
keep those items in such way that customer can easily find his required item.By using the
clustering technique, you can keep some items in one corner that have some similarities and
other items in another corner that have some different similarities.
4 Sequential patterns
Sequential patterns are a useful method for identifying trends and similar patterns.
For example, in customer data you identify that a customer buys particular product on
particular time of year, you can use this information to suggest customer these particular
product on that time of year.
5 Decision tree
Decision tree is one of the most common used data mining techniques because its model is
easy to understand for users. In decision tree you start with a simple question which has two
or more answers.Each answer leads to a further two or more question which help us to make
a final decision. The root node of decision tree is a simple question.
Take a example of flood warning system.
Decision tree : last night study
8
Decision tree
First check water level, if water level is > 50ft then alert is send and if water level is < 50ft
then check water level if water level is > 30ft then send warning and if water level is < 30ft
then water is in normal range.
Need for Pre-processing the Data

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent
data due to their typically huge size (often several gigabytes or more) and their likely origin
from multiple, heterogeneous sources. Low-quality data will lead to low-quality mining
results. “How can the data be preprocessed in order to help improve the quality of the data
and, consequently, of the mining results? How can the data be preprocessed so as to improve
the efficiency and ease of the mining process?”
There are several data preprocessing techniques. Data cleaning can be applied to remove
noise and correct inconsistencies in data. Data integration merges data from multiple sources
into a coherent data store such as a data warehouse. Data reduction can reduce data size by,
for instance, aggregating, eliminating redundant features, or clustering. Data transformations
(e.g., normalization) may be applied, where data are scaled to fall within a smaller range like
0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving
distance measurements. These techniques are not mutually exclusive; they may work
together. For example, data cleaning can involve transformations to correct wrong data, such
as by transforming all entries for a date field to a common format.
Data Summarization
Data Summarization is a simple term for a short conclusion of a big theory or a paragraph.
This is something where you write the code and in the end, you declare the final result in the
form of summarizing data. Data summarization has the great importance in the data mining.
As nowadays a lot of programmers and developers work on big data theory. Earlier, you used
to face difficulties to declare the result, but now there are so many relevant tools in the
market where you can use in the programming or wherever you want in your data.
WHY DATA SUMMARIZATION?

Why we need more summarization of data in the mining process, we are living in a digital
world where data transfers in a second and it is much faster than a human capability. In the
corporate field, employees work on a huge volume of data which is derived from different
sources like Social Network, Media, Newspaper, Book, cloud media storage etc. But
sometimes it may create difficulties for you to summarize the data. Sometimes you do not
expect data volume because when you retrieve data from relational sources you can not
predict that how much data will be stored in the database.
As a result, data becomes more complex and takes time to summarize information. Let me
tell you the solution to this problem. Always retrieve data in the form of category what type
9
of data you want in the data or we can say use filtration when you retrieve data. Although,
“Data Summarization” technique gives the good amount of quality to summarize the data.
Moreover, a customer or user can take benefits in their research. Excel is the best tool for
data summarization and I will discuss this in brief.
DATA SUMMARIZATION IN EXCEL

You can make the summary of the data in excel with ease and in a less time. There are so
many ways to mining the data in excel, but I tell you the very simple formula for
summarizing the data. First thing, we need a table as follows,
DATA SUMMARIZATION WITH SUMIF()
Now, look in the table name Register which contains six columns such as Invoice Date,
Customer, Type, Country, Amount, and Status. We have already inserted random data for our
ease. Now, I want total amount of ‘Paid’ and ‘Future’ in the table. So, let’s understand with
an example of SUMIF formula as follows,
10
In the output, we used formula SUMIF, now see the formula explains (Range, Criteria,
Sum_Range) as arguments.
ARGUMENTS
 Range: It selects the range from where you want to retrieve the string or number value.
 Criteria: It selects particular criteria in the range.
 Sum_range: It selects the range from where you need the data or amount in the table.
Once you type the formula then press Enter key and get the result as follows.
11
Now, I tell to you what exactly happened in the result. First of all, it retrieved data from
“Status” column as Range, then select “Paid” string as Criteria and finally the sum of
“Amount” column of all status where status lies with “Paid”. This is a basic data mining
process, now we will discuss some other complex data.
Data Cleaning:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
a) Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. You note that many
tuples have no recorded value for several attributes such as customer income. How can you
go about filling in the missing values for this attribute? Let’s look at the following methods.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably. By ignoring the tuple, we do not make use of
the remaining attributes’ values in the tuple. Such data could have been useful to the task at
hand.
2. Fill in the missing value manually: In general, this approach is time consuming and may
not be feasible given a large data set with many missing values.
12
3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant such as a label like “Unknown” or 1. If missing values are replaced by,
say, “Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence, although this
method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill
in the missing value: Chapter 2 discussed measures of central tendency, which indicate the
“middle” value of a data distribution. For normal (symmetric) data distributions, the mean
can be used, while skewed data distribution should employ the median. For example, suppose
that the data distribution regarding the income of AllElectronics customers is symmetric and
that the mean income is $56,000. Use this value to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as the
given tuple: For example, if classifying customers according to credit risk, we may replace
the missing value with the mean income value for customers in the same credit risk category
as that of the given tuple. If the data distribution for a given class is skewed, the median value
is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
b) Noisy Data:
Noise is a random error or variance in a measured variable.
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that
is, the values around it. The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they perform local smoothing.
Figure 3.2 illustrates some binning techniques. In this example, the data for price are first
sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains three
values). In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value. In general, the larger the width, the Sorted data for price (in dollars): 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
13
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Figure 3.2 Binning methods for data smoothing. greater the effect of the smoothing.
Alternatively, bins may be equal width, where the interval range of values in each bin is
constant. Binning is also used as a discretization technique.
Regression: Data smoothing can also be done by regression, a technique that conforms data
values to a function. Linear regression involves finding the “best” line to attributes are
involved and the data are fit to a multidimensional surface.
c) Data Cleaning as a Process
Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have
looked at techniques for handling missing data and for smoothing data. “But data
cleaning is a big job.
The first step in data cleaning as a process is discrepancy detection. Discrepancies can be
caused by several factors, including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting to
divulge information about themselves), and data decay (e.g., outdated addresses).
Discrepancies may also arise from inconsistent data representations and inconsistent use of
codes. Other sources of discrepancies include errors in instrumentation devices that record
data and system errors. Errors can also occur when the data are (inadequately) used for
purposes other than originally intended. There may also be inconsistencies due to data
integration (e.g., where a given attribute can have different names in different databases).
Data Integration:
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set. This can help improve the accuracy and speed of the subsequent data
mining process.
The semantic heterogeneity and structure of data pose great challenges in data integration.
1. Entity Identification Problem
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files.
There are a number of issues to consider during data integration. Schema integration and
object matching can be tricky. How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity identification problem.
For example, how can the data analyst or the computer be sure that customer id in one
database and cust number in another refer to the same attribute? Examples of metadata for
each attribute include the name, meaning, data type, and range of values permitted for the
14
attribute, and null rules for handling blank, zero, or null values (Section 3.2). Such metadata
can be used to help avoid errors in schema integration. The metadata may also be used to help
transform the data hence, this step also relates to data cleaning, as described earlier. When
matching attributes from one database to another during integration, special attention must be
paid to the structure of the data. This is to ensure that any attribute functional dependencies
and referential constraints in the source system match those in the target system. For
example, in one system, a discount may be applied to the order, whereas in another system it
is applied to each individual line item within the order. If this is not caught before integration,
items in the target system may be improperly discounted.
2. Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For
example, for the same real-world entity, attribute values from different sources may differ.
This may be due to differences in representation, scaling, or encoding. For instance, a weight
attribute may be stored in metric units in one system and British imperial units in another. For
a hotel chain, the price of rooms in different cities may involve not only different currencies
but also different services (e.g., free breakfast) and taxes.
When exchanging information between schools, for example, each school may have its own
curriculum and grading scheme. One university may adopt a quarter systems, offer three
courses on database systems, and assign grades from AC to F, whereas another may adopt a
semester system, offer two courses on databases, and assign grades from 1 to 10. It is difficult
to work out precise course-to-grade transformation rules between the two universities,
making information exchange difficult.
Data Transformation:
Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise fromthe data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for data analysis at multiple
abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as 1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The
labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept
hierarchy for the numeric attribute. Figure 3.12 shows a concept hierarchy for the attribute
price. More than one concept hierarchy can be defined for the same attribute to accommodate
the needs of various users.
15
6. Concept hierarchy generation for nominal data, where attributes such as street can be
generalized to higher-level concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be automatically defined at the
schema definition level.
Data Transformation by Normalization
The measurement unit used can affect the data analysis. For example, changing measurement
Units from meters to inches for height, or from kilograms to pounds for weight may lead to
very different results. In general, expressing an attribute in smaller units will lead to a larger
range for that attribute, and thus tend to give such an attribute greater effect or “weight.” To
help avoid dependence on the choice of measurement units, the data should be normalized or
standardized. This involves transforming the data to fall within a smaller or common range
such as [ 1, 1] or [0.0, 1.0]. (The terms standardize and normalize are used interchangeably
in data preprocessing, although in statistics, the latter term also has other connotations.)
Normalizing the data attempts to give all attributes an equal weight. Normalization is
particularly useful for classification algorithms involving neural networks or distance
measurements such as nearest-neighbor classification and clustering. If using the neural
network backpropagation algorithm for classification mining, normalizing the input values
for each attribute measured in the training tuples will help speed up the learning phase. For
distance-based methods, normalization helps prevent attributes with initially large ranges
(e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary
attributes). It is also useful when given no prior knowledge of the data.
Data Reduction
Imagine that you have selected data from the AllElectronics data warehouse for analysis. The
data set will likely be huge! Complex data analysis and mining on huge amounts of data can
take a long time, making such analysis impractical or infeasible.
Data reduction techniques can be applied to obtain a reduced representation of the data set
that ismuch smaller in volume, yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration. Dimensionality reduction methods include wavelet transforms
and principal components analysis, which transform or project the original data onto a
smaller space. Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed
(Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation. These techniques may be parametric or nonparametric. For
parametric methods, a model is used to estimate the data, so that typically only the data
parameters need to be stored, instead of the actual data. (Outliers may also be stored.)
Regression and log-linear models (Section 3.4.5) are examples. Nonparametric methods for
16
storing reduced representations of the data include histograms (Section 3.4.6), clustering
(Section 3.4.7), sampling (Section 3.4.8), and data cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed from the
compressed data without any information loss, the data reduction is called lossless. If,
instead, we can reconstruct only an approximation of the original data, then the data reduction
is called lossy. There are several lossless algorithms for string compression; however, they
typically allow only limited data manipulation. Dimensionality reduction and numerosity
reduction techniques can also be considered forms of data compression.
Histograms
Histograms use binning to approximate data distributions and are a popular form of data
reduction. Histograms were introduced in Section 2.2.3. A histogram for an attribute, A,
partitions the data distribution of A into disjoint subsets, referred to as buckets or bins. If each
bucket represents only a single attribute–value/frequency pair, the buckets are called
singleton buckets. Often, buckets instead represent continuous ranges for the given attribute.
Example 3.3 Histograms. The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5,
5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18,
18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce the data,
it is common to have each bucket denote a continuous value range for the given attribute. In
Figure 3.8, each bucket represents a different $10 range for price.
“How are the buckets determined and the attribute values partitioned?” There are several
partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket range is uniform (e.g.,
the width of $10 for the buckets in Figure 3.8).
17
Equal-frequency (or equal-depth): In an equal-frequency histogram, the buckets are created
so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly
the same number of contiguous data samples).
Data Discretization and Concept Hierarchy Generation

Data Discretization techniques can be used to divide the range of continuous attribute into
intervals.Numerous continuous attribute values are replaced by small interval labels.
This leads to a concise, easy-to-use, knowledge-level representation of mining results.
Top-down discretization
If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, then
it is called top-down discretization or splitting.
Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-points,
removes some by merging neighborhood values to form intervals, then it is called bottom-up
discretization or merging.
Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning

of the attribute values, known as a concept hierarchy.
Concept hierarchies
Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more efficient
than mining on a larger data set.
Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
18
Discretization and Concept Hierarchy Generation for Numerical Data
Typical methods
1 Binning
Binning is a top-down splitting technique based on a specified number of bins.Binning is an

unsupervised discretization technique.
2 Histogram Analysis
Because histogram analysis does not use class information so it is an unsupervised

discretization technique.Histograms partition the values for an attribute into disjoint ranges
called buckets.
3 Cluster Analysis
Cluster analysis is a popular data discretization method.A clustering algorithm can be applied
to discrete a numerical attribute of A by partitioning the values of A into clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures, forming
a lower level of the hierarchy.
Discretization by Cluster, Decision Tree, and Correlation Analyses

Clustering, decision tree analysis, and correlation analysis can be used for data discretization.
We briefly study each of these approaches.
Cluster analysis is a popular data discretization method. A clustering algorithm can be
applied to discretize a numeric attribute, A, by partitioning the values of A into clusters or
groups. Clustering takes the distribution of A into consideration, as well as the closeness of
data points, and therefore is able to produce high-quality discretization results.
Clustering can be used to generate a concept hierarchy for A by following either a top-down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy. In the former, each initial cluster or partition may be further decomposed
into several subclusters, forming a lower level of the hierarchy.
In the latter, clusters are formed by repeatedly grouping neighboring clusters in order to form
higher-level concepts.
Concept Hierarchy Generation for Nominal Data
We now look at data transformation for nominal data. In particular, we study concept
hierarchy generation for nominal attributes. Nominal attributes have a finite (but possibly
large) number of distinct values, with no ordering among the values. Examples include
geographic location, job category, and item type.
Manual definition of concept hierarchies can be a tedious and time-consuming task for a user
or a domain expert. Fortunately, many hierarchies are implicit within the database schema
and can be automatically defined at the schema definition level. The concept hierarchies can
be used to transform the data into multiple levels of granularity.
For example, data mining patterns regarding sales may be found relating to specific regions
or countries, in addition to individual branch locations.
We study four methods for the generation of concept hierarchies for nominal data, as follows.
19
1. Specification of a partial ordering of attributes explicitly at the schema level by users
or experts: Concept hierarchies for nominal attributes or dimensions typically involve a
group of attributes. A user or expert can easily define a concept hierarchy by specifying a
partial or total ordering of the attributes at the schema level. For example, suppose that a
relational database contains the following group of attributes:
street, city, province or state, and country. Similarly, a data warehouse location dimension
may contain the same attributes. A hierarchy can be defined by specifying the total ordering
among these attributes at the schema level such as street < city < province or state < country.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially
the manual definition of a portion of a concept hierarchy. In a large database, it is unrealistic
to define an entire concept hierarchy by explicit value enumeration. On the contrary, we can
easily specify explicit groupings for a small portion of intermediate-level data. For example,
after specifying that province and country form a hierarchy at the schema level, a user could
define some intermediate levels manually, such as “fAlberta, Saskatchewan, Manitobag _
prairies Canada” and “fBritish Columbia, prairies Canadag _ Western Canada.”
3. Specification of a set of attributes, but not of their partial ordering: A user may specify
a set of attributes forming a concept hierarchy, but omit to explicitly state their partial
ordering. The systemcan then try to automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy.
“Without knowledge of data semantics, how can a hierarchical ordering for an arbitrary set
of nominal attributes be found?” Consider the observation that since higher-level concepts
generally cover several subordinate lower-level concepts, an attribute defining a high concept
level (e.g., country) will usually contain a smaller number of distinct values than an attribute
defining a lower concept level (e.g., street). Based on this observation, a concept hierarchy
can be automatically generated based on the number of distinct values per attribute in the
given attribute set.
The attribute with the most distinct values is placed at the lowest hierarchy level. The ower
the number of distinct values an attribute has, the higher it is in the generated concept
hierarchy. This heuristic rule works well in many cases. Some local-level swapping or
adjustments may be applied by users or experts, when necessary, after examination of the
generated hierarchy.
Let’s examine an example of this third method.
Example 3.7 Concept hierarchy generation based on the number of distinct values per
attribute.
Suppose a user selects a set of location-oriented attributes—street, country, province or state,
and city—fromthe AllElectronics database, but does not specify the hierarchical ordering
among the attributes.
A concept hierarchy for location can be generated automatically, as illustrated in Figure 3.13.
First, sort the attributes in ascending order based on the number of distinct values in each
attribute. This results in the following (where the number of distinct values per attribute is
shown in parentheses): country (15), province or state (365), city (3567), and street
(674,339). Second, generate the hierarchy from the top down according to the sorted order,
with the first attribute at the top level and the last attribute at the bottomlevel. Finally, the
user can examine the generated hierarchy, and when necessary, modify it to reflect desired
20
semantic relationships among the attributes. In this example, it is obvious that there is no
need to modify the generated hierarchy.
Note that this heuristic rule is not foolproof. For example, a time dimension in a database
may contain 20 distinct years, 12 distinct months, and 7 distinct days of the week. However,
this does not suggest that the time hierarchy should be “year <month < days of the week,”
with days of the week at the top of the hierarchy.
4. Specification of only a partial set of attributes: Sometimes a user can be careless when
defining a hierarchy, or have only a vague idea about what should be included in a hierarchy.
Consequently, the user may have included only a small subset of the
relevant attributes in the hierarchy specification. For example, instead of including all of the
hierarchically relevant attributes for location, the user may have specified only street and
city. To handle such partially specified hierarchies, it is important to embed data semantics in
the database schema so that attributes with tight semantic connections can be pinned together.
In this way, the specification of one attribute may trigger a whole group of semantically
tightly linked attributes to be “dragged in” to forma complete hierarchy. Users, however,
should have the option to override this feature, as necessary.
Example 3.8 Concept hierarchy generation using prespecified semantic connections.
Suppose that a data mining expert (serving as an administrator) has pinned together the five
attributes number, street, city, province or state, and country, because they are closely linked
semantically regarding the notion of location. If a user were to specify only the attribute city
for a hierarchy defining location, the systemcan automatically drag in all five semantically
related attributes to form a hierarchy. The user may choose to drop any of these attributes
(e.g., number and street) from the hierarchy, keeping city as the lowest conceptual level.
In summary, information at the schema level and on attribute–value counts can be used to
generate concept hierarchies for nominal data. Transforming nominal data with the use of
concept hierarchies allows higher-level knowledge patterns to be found. It allows mining at
multiple levels of abstraction, which is a common requirement for data mining applications.
21
Generation:
History
While the concepts behind association rules can be traced back earlier, association rule
mining was defined in the 1990s, when computer scientists Rakesh Agrawal, Tomasz
Imieliński and Arun Swami developed an algorithm-based way to find relationships between
items using point-of-sale (POS) systems. Applying the algorithms to supermarkets, the
scientists were able to discover links between different items purchased, called association
rules, and ultimately use that information to predict the likelihood of different products being
purchased together.
association rules (in data mining)

Association rules are if-then statements that help to show the probability of relationships
between data items within large data sets in various types of databases. Association rule
mining has a number of applications and is widely used to help discover sales correlations
in transactional data or in medical data sets.
How association rules work
Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrence, in a database. It identifies frequent if-then
associations, which are called association rules.
An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent
is an item found within the data. A consequent is an item found in combination with the
antecedent.
Association rules are created by searching data for frequent if-then patterns and using the
criteria support and confidence to identify the most important relationships. Support is an
indication of how frequently the items appear in the data. Confidence indicates the number of
times the if-then statements are found true. A third metric, called lift, can be used to compare
confidence with expected confidence.
Association rules are calculated from itemsets, which are made up of two or more items. If
rules are built from analyzing all the possible itemsets, there could be so many rules that the
rules hold little meaning. With that, association rules are typically created from rules well-
represented in data.
22
Uses of association rules in data mining
In data mining, association rules are useful for analyzing and predicting customer behavior.
They play an important part in customer analytics, market basket analysis, product clustering,
catalog design and store layout.
Programmers use association rules to build programs capable of machine learning. Machine
learning is a type of artificial intelligence (AI) that seeks to build programs with the ability to
become more efficient without being explicitly programmed.
Binaryzation
What is Binarization?
Binarization is the process of transforming data features of any entity into vectors of binary
numbers to make classifier algorithms more efficient. In a simple example, transforming an
image’s gray-scale from the 0-255 spectrum to a 0-1 spectrum is binarization.
How is Binarization used?
In machine learning, even the most complex concepts can be transformed into binary form.
For example, to binarize the sentence “The dog ate the cat,” every word is assigned an ID (for
example dog-1, ate-2, the-3, cat-4). Then replace each word with the tag to provide a binary
vector. In this case the vector: <3,1,2,3,4> can be refined by providing each word with four
possible slots, then setting the slot to correspond with a specific word:
<0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. This is commonly referred to as the bag-of-words-method.
Since the ultimate goal is to make this data easier for the classifier to read while minimizing
memory usage, it’s not always necessary to encode the whole sentence or all the details of a
complex concept. In this case, only the current state of how the data is parsed is needed for
the classifier. For example, when the top word on the stack is used as the first word in the
input queue. Since order is quite important, a simpler binary vector is preferable.
23
Measures of Similarity and Dissimilarity
Distance or similarity measures are essential to solve many pattern recognition problems such
as classification and clustering. Various distance/similarity measures are available in
literature to compare two data distributions. As the names suggest, a similarity measures how
close two distributions are. For multivariate data complex summary methods are developed to
answer this question.
Similarity Measure
 Numerical measure of how alike two data objects are.
 Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
 Numerical measure of how different two data objects are.
 Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
24

Introduction To Data Mining-1

Uploaded by

Copyright:

Available Formats

Introduction To Data Mining-1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Mining-1

Uploaded by

Copyright:

Available Formats

2.

Introduction to Data Mining

Data mining is also known as knowledge discovery in Database (KDD).

Data Mining Architecture

Data Mining Engine:

Data Mining Vs Data Warehouse: Key Differences

Data Mining Data Warehouse

Data mining is the process of analyzing A data warehouse is database system

Data mining is a method of comparing Data warehousing is a method of

Data mining is usually done by business Data warehousing is a process which

Data mining is the considered as a On the other hand, Data warehousing is

After successful initial queries, users Data Warehouse is complicated to

The data mining methods are cost- Data warehouse's responsibility is to

Why use Data mining?

Some most important reasons for using Data mining are:

What is Knowledge Discovery(KDD)?

The following diagram shows the process of knowledge discovery −

Knowledge Discovery Database in Data Mining

Developing an understanding of:

The application domain

Challenges in Data Mining:

Data Mining Tasks/Data Mining Techniques:

Some data mining techniques are :-

Classification technique is most common data mining technique. In classification method we

Take a example of flood warning system.

Decision tree : last night study

Need for Pre-processing the Data

WHY DATA SUMMARIZATION?

DATA SUMMARIZATION IN EXCEL

Noise is a random error or variance in a measured variable.

1. Entity Identification Problem

Data Discretization and Concept Hierarchy Generation

This leads to a concise, easy-to-use, knowledge-level representation of mining results.

Discretization can be performed rapidly on an attribute to provide a hierarchical partitioning

Binning is a top-down splitting technique based on a specified number of bins.Binning is an

Because histogram analysis does not use class information so it is an unsupervised

Discretization by Cluster, Decision Tree, and Correlation Analyses

association rules (in data mining)

How association rules work

Uses of association rules in data mining

How is Binarization used?

You might also like