2 - Data Mining and Warehousing - L2
2 - Data Mining and Warehousing - L2
2 - Data Mining and Warehousing - L2
===========================================================
===========================================================
Page 1 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================
Multimedia data:
Any type of information media that can be represented, processed, stored then
transmitted over a network in digital form Multi-lingual text, numeric, images,
video, audio, graphical, temporal, relational, and categorical data. Relation with
conventional data mining terms.
Spatial data:
Generalizing detailed geographic points into clustered regions, such as
business, residential, industrial, or agricultural areas according to land usage requires
the merge of a set of geographic areas by spatial operations.
Image data:
The extracted by aggregation and/or approximation Size, color, shape, texture,
orientation, and relative positions and structures of the contained objects or regions
in the image.
===========================================================
Page 2 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================
===========================================================
Page 3 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================
===========================================================
Page 5 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================
Example:
Given an attribute say price, how can smooth data to remove the noise?
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34:
Partition into (equal frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean value
of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore,
each original value in this bin is replaced by 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the minimum
and maximum values in each bin are identified as the bin boundaries. Each bin value
is then replaced by the closest boundary value. In general, the larger the width, the
greater the effect of the smoothing. Alternatively, bins may be equal in width, where
the interval range of values in each bin is constant.
2. Regression:
Data can be smoothed by fitting the data to a function, such as regression.
Linear regression involves finding the “best” line to fit two attributes (or variables)
so that one attribute can be used to predict the other. Multiple linear regressions are
an extension of linear regression, where more than two attributes are involved, and
the data fit to a multidimensional surface.
3. Clustering:
Outliers may be detected by clustering, where similar values are organized
into groups, or “clusters”. Intuitively, values that fall outside the cluster set may be
considered outliers.
===========================================================
Page 6 of 6 Dr. Maalim Aljabery_L2