Data Preprocessing in Data Mining
Data Preprocessing in Data Mining
Data Preprocessing in Data Mining
Data preprocessing is an important process of data mining. In this process, raw data is converted into an
understandable format and made ready for further analysis. The motive is to improve data quality and
make it up to mark for specific tasks.
Data cleaning
Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset. Some
techniques used in data cleaning are −
Standard values can be used to fill up the missing values in a manual way but only for a small
dataset.
Attribute's mean and median values can be used to replace the missing values in normal and non-
normal distribution of data respectively.
Tuples can be ignored if the dataset is quite large and many values are missing within a tuple.
Most appropriate value can be used while using regression or decision tree algorithms
Noisy Data
Noisy data are the data that cannot be interpreted by machine and are containing unnecessary faulty data.
Some ways to handle them are −
Binning − This method handle noisy data to make it smooth. Data gets divided equally and stored
in form of bins and then methods are applied to smoothing or completing the tasks. The methods
are Smoothing by a bin mean method(bin values are replaced by mean values), Smoothing by bin
median(bin values are replaced by median values) and Smoothing by bin
boundary(minimum/maximum bin values are taken and replaced by closest boundary values).
Regression − Regression functions are used to smoothen the data. Regression can be
linear(consists of one independent variable) or multiple(consists of multiple independent
variables).
Clustering − It is used for grouping the similar data in clusters and is used for finding outliers.
Data integration
The process of combining data from multiple sources (databases, spreadsheets,text files) into a single
dataset. Single and consistent view of data is created in this process. Major problems during data
integration are Schema integration(Integrates set of data collected from various sources), Entity
identification(identifying entities from different databases) and detecting and resolving data values
concept.
Data transformation
In this part, change in format or structure of data in order to transform the data suitable for mining
process. Methods for data transformation are −
Normalization − Method of scaling data to represent it in a specific smaller range( -1.0 to 1.0)
Discretization − It helps reduce the data size and make continuous data divide into intervals.
Attribute Selection − To help the mining process, new attributes are derived from the given attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level to higher level in
hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality and quantity of data to
make the result more optimal.
Data reduction
It helps in increasing storage efficiency and reducing data storage to make the analysis easier by producing
almost the same results. Analysis becomes harder while working with huge amounts of data, so reduction
is used to get rid of that.
Data Compression
Data is compressed to make efficient analysis. Lossless compression is when there is no loss of data
while compression. loss compression is when unnecessary information is removed during compression.
Numerosity Reduction
There is a reduction in volume of data i.e. only store model of data instead of whole data, which provides
smaller representation of data without any loss of data.
Dimensionality reduction
In this, reduction of attributes or random variables are done so as to make the data set dimension low.
Attributes are combined without losing its original characteristics.
Conclusion
This article comprises of data preprocessing which help data to get converted into usable format. Tasks
which helps data preprocessing are Data cleaning, data integration, data transformation and data
reduction. Data cleaning remove incomplete data by handling missing values and smoothing noises with
the help of binning, regression and clustering. Data integration combines data from multiple source to
make a single dataset. Data transformation help change the format of data by using discretization,
attribute selection, concept hierarchy generation and aggregation to make the data usable for mining. Data
reduction helps with reducing storage of data to make the analysis easier with the help of some steps like
data compression, numerosity reduction and dimensionality reduction.