r20 DWDM Unit 2 PART 2

36
C) Detection and Resolutionof Data Value Conflicts:

A third important issue in data integration is the detection and resolution of data valueconflicts.
For example, for the same real–world entity, attribute value from different sources may differ.
This may be due to difference in representation
For instance, a weight attribute may be stored in metric units in one system and British imperial
units in another.
For a hotel chain, the price of rooms in different cities may involve not only different currencies but
also different services (such as free breakfast) and taxes. An attribute in one system may be
recorded at a lower level of abstraction than the sameattribute in another
Careful integration of the data from multiple sources can help to reduce and avoid redundancies
and inconsistencies in the resulting data set. This can help to improve the accuracy and speed of
the subsequent of mining process.
37
2.4. Data Reduction:
Obtain a reduced representation of the data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results.
Why data reduction?
A database/data warehouse may store terabytes of data. Complex data analysis may take a very
long time to run on the complete data set.
Data reduction
techniques
Data cube Attribute subset Numerosity Dimesionality

aggregation selection reduction Reduction
Data cube aggregation:

It is a process in which information is Gathered and expressed in summary form.
They involve you in the annual sales, rather than the quarterly or half-yearly average, so we
can summarize the data in such a way that the resulting data summarizes the total sales as
per year.
YEAR 2020
Quarter Sales
Q1 200
Q2 300
Q3 250
Q4 350 Year Sales
2020 1100
2021 950
YEAR 2021
Quarter Sales
Q1 150
Q2 250
Q3 300
Q4 250
38
Attribute Subset Selection:
Attribute subset Selection is a technique which is used for data reduction in data mining process.
Data reduction reduces the size of data so that it can be used for analysis purposes more
efficiently.
Need of Attribute Subset Selection:

The data set may have a large number of attributes. But some of those attributes can be irrelevant
or redundant.
The goal of attribute subset selection is to find a minimum set of attributes such that dropping of
those irrelevant attributes does not much affect the utility of data and the cost of data analysis
could be reduced.
Mining on a reduced data set also makes the discovered pattern easier to understand.
for example,
if the task is to classify customers as to whether or not they are likely Data Reduction to purchase
a popular new CD at AllElectronics when notified of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant, unlike attributes such as age or music taste
Methods of Attribute Subset Selection:

1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction
Note:
-The “best” (and “worst”) attributes are typically determined using: – the tests of statistical
significance, which assume that the attributes are independent of one another.
– the information gain measure used in building decision trees for classification.
1. Stepwise Forward Selection:

-The procedure starts with an empty set of attributes as the reduced set.
– First: The best single-feature is picked.
– Next: At each subsequent iteration or step, the best of the remaining original attributes is added
to the set
Initial attribute set : {A1,A2,A3,A4,A5,A6}
Initial reduced set : { }
=>{A1}
=>{A1, A3}
=>{A1, A3,A6}
Final Reduced set :{A1, A3, A6}

39
2.Stepwise Backward Elimination:
-The procedure starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the set

Initial attribute set : {A1,A2,A3,A4,A5,A6}
=> {A1, A3, A4, A5, A6}

=> {A1, A3, A5, A6}
=> {A1, A3, A6}
Final Reduced set : {A1, A3, A6}
3. Combination of Forward Selection and Backward Elimination.

-The stepwise forward selection and backward elimination methods can be combined
– At each step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4. Decision Tree Induction

This approach uses decision tree for attribute selection.
It constructs a flow chart like structure having nodes denoting a test on an attribute. Each branch
corresponds to the outcome of test and leaf nodes is a class prediction.
The attribute that is not the part of tree is considered irrelevant and hence discarded.
Numerosity reduction:
Data is replaced with alternative form of data representation
In the numerosity reduction, the data volume is decreased by selecting an alternative, smaller form
of data representation. These techniques can be parametric or non-parametric.
40
Parametric:
only parameters of data and outliers are stored instead of actual data
Regression:
Regression can be a simple linear regression or multiple linear regression. When there is
only single independent attribute, such regression model is called simple linear regression and if
there are multiple independent attributes, then such regression models are called multiple linear
regression
Log-Linear Models:
Log-linear model can be used to estimate the probability of each data point in a
multidimensional space for a set of attributes.
Non- Parametric:
data is stored in the form of histograms, clustering and sampling etc.,
Histogram:
It is the data representation in terms of frequency. It uses binning to approximate data
distribution and is a popular form of data reduction.
41
Example:
Clustering:
Clustering divides the data into groups/clusters. This technique partitions the whole data into
different clusters. In data reduction, the cluster representation of the data is used to replace the
actual data. It also helps to detect outliers in data.
42
Sampling:
Sampling can be used for data reduction because it allows a large data set to be represented by a
much smaller random data sample (or subset).
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are applied so as to

Obtained reduced or ―compressed representation of the data.
Types:
Lossless: If the original data can be reconstructed from the compressed data without any loss of
information
Lossy: If the original data can be reconstructed from the compressed data with loss of information,
then the data reduction is called lossy.
Effective methods in lossy dimensional reduction:
a) Wavelet transforms
b) Principal components analysis.

43
2.5. Data Transformation and Discretization
Data transformation is the process of converting data from one format or structure
into another format or structure.
In data transformation, the data are transformed or consolidated into forms appropriatefor mining.
Strategies for data transformation include the following:
1. Smoothing: which works to remove noise from the data. Techniques include
1. binning,regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are
constructedandadded from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data.
Forexample, the daily sales data may be aggregated so as to compute monthly and annualtotal
amounts. This step is typically used in constructing a data cube for data analysisat multiple
abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,such as
1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replacedbyinterval
labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).Thelabels, in turn,
can be recursively organized into higher-level concepts, resultingin a concepthierarchy for the
numeric attribute.
6. Concept hierarchy generation for nominal data, where attributes such as street canbe
generalized to higher-level concepts, like city or country. Many hierarchies fornominal
attributes are implicit within the database schema and can be automaticallydefined at the schema
definition level.
44
2.5.1 Data Transformation by Normalization:

The attribute data are scaled to range within smaller range such as
-1.0 to 1.0
0 to 1
Why normalization?
 The measurement unit can affect the data analysis to help or avoid
dependence on the choice of measurement units for that data should
be normalized or standardized.
 Normalizing the data attempts to give all attributes on equal weight
MethodsforNormalization
1.Min –MaxNormalization
2.Z-ScoreNormalization
3.Decimal scaling
Let A be a numeric attribute with n values are: v1,v2,v3,… .............. Vn

1. Min–Max Normalization:
It performs linear transformation of the original data suppose minA
And maxA are minimum and maximum values of given dataset then
45
Here.,
v=valueusedfornormalization
minA =minimumvaluein givendataset
maxA=maximum valueingiven dataset
new_minA=resultingrangestartsfromi.e.,either0 or-1
new_maxA=resultingrangeendsi.e., 1
EXAMPLE:
marks
8
10
15
20
Apply Min – Max Normalization for above list of marks between 0 – 1?Here,
46
minA
=8maxA=20ne
w_
minA=0new_m
axA=1
marks marks afterMin-Maxnormalization

8 0
10 0.16
15 0.58
20 1
47
2.Z-ScoreNormalization:
Z-Score helps in the normalization of data. If we normalize the data into a simpler form withthehelp
of zscorenormalization, then it’s very easyto understand by ourbrains.
EXAMPLE:HowtocalculateZ-Scoreof thefollowing data?
marks
8
10
15
20
48
marks marksafterz-scorenormalization
8 -1.14
10 -0.7
15 0.3
20 1.4
1. Decimal scaling:
Suppose that the recorded values of A range from -845 to 945 to normalize by
decimal scale divide each value by 1000. (For rangefrom 0-1)
Sothat
-845normalizesto -0.845
945normalizesto0.945
49
2.5.2 Data Discretization:
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the
evaluation and management of data become easy
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Attribute Age Age Age Age
1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Discretization Child Young Mature Old
Techniques of data discretization:

Histogram analysis: Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set
Binning: Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values
Data discretization using decision tree analysis: Data discretization refers to a decision tree analysis in which a top-
down slicing technique is used
In a numeric attribute discretization, first, you need to select the attribute that has the least entropy, and then you need
to run it with the help of a recursive process.
The recursive process divides it into various discretized disjoint intervals, from top to bottom, using the same splitting
criterion
Data discretization using correlation analysis: Discretizing data by linear regression technique, you can get the best
neighboring interval, and then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals.
Data discretization and concept hierarchy generation:
The term hierarchy represents an organizational structure or mapping in which items are ranked according to their levels of
importance.
Let's understand this concept hierarchy for the dimension location with the help of an example.
t🆄🅳🅰🆈
50
A particular city can map with the belonging country. For example, New Delhi can be mapped to India, and India can be
mapped to Asia.
Top-down mapping:
Top-down mapping generally starts with the top with some general information and ends with the bottom to the specialized
information
Bottom-up mapping:
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the top to the
generalized information.
t🆄🅳🅰🆈

r20 DWDM Unit 2 PART 2

Uploaded by

Copyright:

Available Formats

r20 DWDM Unit 2 PART 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

r20 DWDM Unit 2 PART 2

Uploaded by

Copyright:

Available Formats

36

C) Detection and Resolutionof Data Value Conflicts:

Data cube Attribute subset Numerosity Dimesionality

Data cube aggregation:

Need of Attribute Subset Selection:

Methods of Attribute Subset Selection:

1. Stepwise Forward Selection:

Final Reduced set :{A1, A3, A6}

– At each step, it removes the worst attribute remaining in the set

=> {A1, A3, A4, A5, A6}

3. Combination of Forward Selection and Backward Elimination.

4. Decision Tree Induction

In dimensionality reduction, data encoding or transformations are applied so as to

Effective methods in lossy dimensional reduction:

b) Principal components analysis.

2.5.1 Data Transformation by Normalization:

Let A be a numeric attribute with n values are: v1,v2,v3,… .............. Vn

marks marks afterMin-Maxnormalization

EXAMPLE:HowtocalculateZ-Scoreof thefollowing data?

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Techniques of data discretization:

Data discretization and concept hierarchy generation:

You might also like