2 - Data Mining and Warehousing - L2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Data Mining and Warehousing 10/2023

===========================================================

2. Advantages and Disadvantages of Data Mining

➢ Advantages of Data Mining


1. Predict future trends and customer purchase habits
2. Help with decision making
3. Improve company revenue and lower costs
4. Market basket analysis
5. Fraud detection

➢ Disadvantages of Data Mining


1. User privacy/security
2. The amount of data is overwhelming
3. Great cost at the implementation stage
4. Possible misuse of information
5. Possible accuracy of data

3. Data Mining Applications


There is a rapidly growing body of successful applications in a wide range of
areas. Below is a list of applications where data mining is currently used:

• Analysis of organic compounds


• Credit card fraud detection
• Medical diagnosis
• Financial forecasting
• Electric load prediction
• Product design
• Real estate valuation
• Targeted marketing
• Toxic hazard analysis
• Weather forecasting
• Electronic commerce, etc.

===========================================================
Page 1 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================

4. Data Mining: On What Kind of Data?


There are some different data repositories on which mining can be performed.
Data mining should apply to any data type repository and data stream. Thus, the
scope of our examination of data repositories will include relational databases, data
warehouses, transactional databases, advanced database systems, flat files, data
streams, and the World Wide Web. The advanced database systems include object-
relational databases and specific application-oriented databases such as spatial,
time-series, text, and multimedia databases.

4.1 Generalizing Spatial and Multimedia Data

Multimedia data:
Any type of information media that can be represented, processed, stored then
transmitted over a network in digital form Multi-lingual text, numeric, images,
video, audio, graphical, temporal, relational, and categorical data. Relation with
conventional data mining terms.

Spatial data:
Generalizing detailed geographic points into clustered regions, such as
business, residential, industrial, or agricultural areas according to land usage requires
the merge of a set of geographic areas by spatial operations.

Image data:
The extracted by aggregation and/or approximation Size, color, shape, texture,
orientation, and relative positions and structures of the contained objects or regions
in the image.
===========================================================
Page 2 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================

5. Preparing Data for Data mining


Data mining is a part of a larger process of Knowledge Discovery in Database
(KDD).
The knowledge discovery as a process consists of an iterative sequence of
the following steps as shown in the following Figure:

Step 1: Data Integration:


Combines multiple and heterogeneous data sources to form an integrated
database.

Step 2: Data Selection:


Appropriate data for the mining task is taken from the databases.

Step 3: Data Cleaning:


Removes noise inconsistency from data.

Step 4: Data Transformation:


Data is transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations.

Step 5: Data Mining:


Different mining methods like association rule generation, clustering, or
classification are applied to discover the patterns.

Step 6: Pattern Evaluation:


Patterns are identified using some constraints like support and confidence.

Step 7: Knowledge Presentation:


Visualization or knowledge presentation techniques are used to present the
knowledge.

Steps 1 to 4 are different forms of data pre-processing, where the data is


prepared for mining. The data mining step may interact with the user or a
knowledge base. Interesting patterns are presented to the user and may be stored as
new knowledge in the knowledge base. According to this view, data mining is only
one step in the entire process, albeit an essential one because it uncovers hidden
patterns for evaluation.

===========================================================
Page 3 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================

(An overview of the KDD Process)

5.1 Why preprocess the data?


The database system users have reported errors, unusual values, and
inconsistencies in the data record for some transactions. In other words, the data we
wish to analyze by data mining techniques are incomplete (lacking attribute
values), noisy (containing errors), and inconsistent (containing differences).

Incomplete, noisy, and inconsistent data are commonplace properties of large


real-world databases and data warehouses.

Incomplete data can occur for some reasons:


1. attributes of interest may not always be available.
2. Relevant data may not be recorded due to a misunderstanding or due to
equipment malfunction.
3. Data that is inconsistent with other recorded data may have been detected.
4. Missing values for tuples with missing values for some attributes may need
to be inferred.
===========================================================
Page 4 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================

➢ There are many methods for preparing data as follows:


1- Data Cleaning:
This refers to the pre-processing of data to remove or reduce noise and
treatment of missing values. Although most classification algorithms have some
mechanisms for handling noisy or missing data, this step can help to increase
confusion during learning.
The data that must be cleaned is as follows:
A. Noisy data:
Noise is a random error or variance in measured variables. Smooth noisy
data is incorrect attribute values.
➢ There are many reasons for noisy data as follows:
1- The collection methods used may be faulty.
2- There may have been human or computer errors occurring at data entry.
3- Errors in data transmission can also occur.
➢ Many methods for data smoothing are also methods for data reduction
involving discretization. For example:
1. Binning:
Binning methods smooth a sorted data value by consulting its
“neighborhood,” the values around it. The sorted values are distributed into several
“buckets,” or bins. Because binning methods consult the neighborhood of values,
they perform local smoothing.

===========================================================
Page 5 of 6 Dr. Maalim Aljabery_L2
Data Mining and Warehousing 10/2023
===========================================================

Example:
Given an attribute say price, how can smooth data to remove the noise?
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34:
Partition into (equal frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values).
In smoothing by bin means, each value in a bin is replaced by the mean value
of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore,
each original value in this bin is replaced by 9.
Similarly, smoothing by bin medians can be employed, in which each bin
value is replaced by the bin median. In smoothing by bin boundaries, the minimum
and maximum values in each bin are identified as the bin boundaries. Each bin value
is then replaced by the closest boundary value. In general, the larger the width, the
greater the effect of the smoothing. Alternatively, bins may be equal in width, where
the interval range of values in each bin is constant.

2. Regression:
Data can be smoothed by fitting the data to a function, such as regression.
Linear regression involves finding the “best” line to fit two attributes (or variables)
so that one attribute can be used to predict the other. Multiple linear regressions are
an extension of linear regression, where more than two attributes are involved, and
the data fit to a multidimensional surface.

3. Clustering:
Outliers may be detected by clustering, where similar values are organized
into groups, or “clusters”. Intuitively, values that fall outside the cluster set may be
considered outliers.

===========================================================
Page 6 of 6 Dr. Maalim Aljabery_L2

You might also like