DM Unit 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 49

DATA MINING UNIT - 1

UNIT - I
DATA MINING

UNIT - I
Data Mining: Data–Types of Data–, Data Mining Functionalities–
Interestingness Patterns– Classification of Data Mining systems–
Data mining Task primitives –Integration of Data mining system with
a Data warehouse– Major issues in Data Mining–Data Preprocessing.

Introduction

Data mining refers to the process or method that extracts or \


mines" interesting knowledge or patterns from large amounts of
data.
Data mining involves an integration, rather than a simple
transformation, of techniques from multiple disciplines such as
database technology, statistics, machine learning, high-performance
computing, pattern recognition, neural networks, data visualization,
information retrieval, image and signal processing, and spatial data
analysis.
Data mining techniques can be used to support a wide range of
business intelligence applications, such as customer profiling,
targeted marketing, workflow management, store layout, fraud
detection, and automated buying and selling.
Data mining is the process of automatically discovering useful
information in large data repositories. Data mining has also been
quick to adopt ideas from other areas, including optimization,
evolutionary computing, information theory, signal processing,
visualization, and information retrieval, and extending them to solve

SWCET
DATA MINING UNIT - 1

the challenges of mining big data.


Database systems are needed to provide support for efficient
storage, indexing, and query processing. Techniques from high
performance (parallel) computing are often important in addressing
the massive size of some data sets.

SWCET
DATA MINING UNIT - 1

Distributed techniques can also help address the issue of size and
are essential when the data cannot be gathered in one location.
Below figure shows the relationship of data mining to other areas.

Data Mining

Data mining uses statistics, artificial intelligence, machine


learning systems, and some databases to find hidden patterns in the
data. It supports business-related queries that are time-consuming
to resolve.

Data is extracted and analyzed to fetch useful information. In


data mining hidden patterns are researched from the dataset to
predict future behavior. Data mining is used to indicate and discover
relationships through the data.
Classification
Categorization is the process of sorting data into distinct
categories. This data-mining strategy determines a document's
SWCET
DATA MINING UNIT - 1

classification by analyzing the

SWCET
DATA MINING UNIT - 1

values of its characteristics. The goal is to categorize information in


a way that is already known.
Clustering
The clustering method organizes data by grouping similar rows
into larger groups or clusters. In contrast to classification, which
assigns variables to predetermined categories, clustering first
locates these subsets within the dataset before classifying them
according to their attributes.
Cluster analysis is used for data types of data mining analytics
for the Web, Text Mining, Bioinformatics, Medical Diagnosis, social
media mining, etc.
The Association Rule Learning
Association rule learning is used to identify if-then relationships
between two or more independent variables. The most basic
example is the cost of buying bread and butter together.
Bread and butter are often bought together, and vice versa. This is
why you'll often see these two items sold together at a supermarket.
Regression
Regression is used to establish a connection between variables. The
goal is to find a function that adequately characterizes the
connection. The process of applying a linear function (y = axis + b)
is known as linear regression analysis. Data Mining Architecture

Data Mining Architecture is the process of selecting, exploring,


and modeling large amounts of data to discover previously unknown
regularities or relationships to generate clear and valuable findings
for the database owner. Data mining is exploring and analyzing large
amounts of data using automated or semi-automated processes to
identify practical designs and procedures.

SWCET
DATA MINING UNIT - 1

Data Sources
There are so many documents present. That is a database, data
warehouse, World Wide Web (WWW). That are the actual sources of
data. Sometimes, data may reside even in plain text files or
spreadsheets. World Wide Web or the Internet is another big source
of data.
Database or Data Warehouse Server
The database server contains the actual data that is ready to be
processed. Hence, the server handles retrieving the relevant data.
That is based on the data mining request of the user.
Data Mining Engine
In data mining system data mining engine is the core component. As
It consists a number of modules. That we used to perform data
mining tasks. That includes association, classification,
characterization, clustering, prediction, etc.
SWCET
DATA MINING UNIT - 1

Pattern Evaluation Modules


This module is mainly responsible for the measure of interestingness
of the pattern. For this, we use a threshold value. Also, it interacts
with the data mining engine. That’s main focus is to search towards
interesting patterns.
Graphical User Interface
We use this interface to communicate between the user and the
data mining system. Also, this module helps the user use the system
easily and efficiently. They don’t know the real complexity of the
process.
When the user specifies a query, this module interacts with the data
mining system. Thus, displays the result in an easily understandable
manner.
Knowledge Base
In whole data mining process, the knowledge base is beneficial. We
use it to guiding the search for the result patterns. The knowledge
base might even contain user beliefs and data from user
experiences. That can be useful in the process of data mining.
The data mining engine might get inputs from the knowledge. That
is the base to make the result more accurate and reliable. The
pattern evaluation module interacts with the knowledge base. That
is on a regular basis to get inputs and also to update it.

Knowledge Discovery in Databases(KDD)

Knowledge discovery in databases (KDD) is the process of


discovering useful knowledge from a collection of data. This widely
used data mining technique is a process that includes data
preparation and selection, data cleansing, incorporating prior

SWCET
DATA MINING UNIT - 1

knowledge on data sets and interpreting accurate solutions from the


observed results.

SWCET
DATA MINING UNIT - 1

Data mining is discovering patterns and relationships in massive


datasets that both data science and BI (Business Intelligence) can
benefit from. In other words, it is a knowledge discovery or
extraction process from data stored in databases. It is essential to
understand the types of data mining techniques and related
processes for knowledge discovery, done in the following steps:

1. Cleaning data (removal of inconsistent and unnecessary data)


2. Integration of data (combining multiple data sources)
3. Selection of data (choosing and retrieving relevant data from
the database)
4. Transformation of data (when information is prepared for
mining by summarizing or aggregating it)
5. Mining of data (one of the most critical steps involves using
smart techniques to uncover patterns from data.)
6. Evaluation of pattern (Using measurements of interestingness,
we can pick out the structures that best represent our body of
knowledge.)
7. Presentation of Knowledge (types of data that can be mined are
presented to the user using visualization and knowledge

SWCET
DATA MINING UNIT - 1

representation techniques)

SWCET
DATA MINING UNIT - 1

The knowledge discovery process is shown in Figure as an iterative


sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis task are
retrieved from the database)
4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations)
5. Data mining (an essential process where intelligent methods
are applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined
knowledge to users)
Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include
databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.

SWCET
DATA MINING UNIT - 1

Data–Types of Data

Nominal Attributes
Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things. Each value represents
some kind of category, code, or state, and so nominal attributes are
also referred to as categorical. The values do not have any
meaningful order.
Nominal attributes. Suppose that hair color and marital status are
two attributes describing person objects. In our application, possible
values for hair color are black, brown, blond, red, auburn, gray, and
white.
The attribute marital status can take on the values single, married,
divorced, and widowed. Both hair color and marital status are
nominal attributes. Another example of a nominal attribute is

SWCET
DATA MINING UNIT - 1

occupation, with the values teacher, dentist, programmer, farmer,


and so on.

SWCET
DATA MINING UNIT - 1

Binary Attributes
A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present. Binary attributes are referred to as
Boolean if the two states correspond to true and false.
Binary attributes. Given the attribute smoker describing a patient
object, 1 indicates that the patient smokes, while 0 indicates that the
patient does not Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.
Ordinal attributes. Suppose that drink size corresponds to the size of
drinks available at a fast-food restaurant. This nominal attribute has
three possible values: small, medium, and large.
Other examples of ordinal attributes include grade (e.g., A+, A, A−,
B+, and so on) and professional rank. Professional ranks can be
enumerated in a sequential order: for example, assistant, associate,
and full for professors, and private, private first class, specialist,
corporal, and sergeant for army ranks.
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values. Numeric attributes
can be interval-scaled or ratio-scaled.
Interval-Scaled Attributes Interval-scaled attributes are measured on a
scale of equal-size units. The values of interval-scaled attributes
have order and can be positive, 0, or negative.
Interval-scaled attributes. A temperature attribute is interval-scaled.
Suppose that we have the outdoor temperature value for a number
of different days, where each day is an object.
SWCET
DATA MINING UNIT - 1

Ratio-Scaled Attributes A ratio-scaled attribute is a numeric attribute


with an inherent zero-point. That is, if a measurement is ratio-scaled,
we can speak of a value as being a multiple (or ratio) of another
value.

Data Mining Functionalities

Data mining functionalities are used to represent the type of


patterns that have to be discovered in data mining tasks. In general,
data mining tasks can be classified into two types including
descriptive and predictive.

Descriptive mining tasks define the common features of the data in


the database and the predictive mining tasks act inference on the
current information to develop predictions.
1. Class/Concept Descriptions: Classes or definitions can be
correlated with results. In simplified, descriptive and yet accurate
ways, it can be helpful to define individual groups and concepts.
These class or concept definitions are referred to as class/concept
descriptions.

SWCET
DATA MINING UNIT - 1

 Data Characterization: This refers to the summary of general


characteristics or features of the class that is under the study.
For example. To study the characteristics of a software product
whose sales

SWCET
DATA MINING UNIT - 1

increased by 15% two years ago, anyone can collect these type
of data related to such products by running SQL queries.
 Data Discrimination: It compares common features of class
which is under study. The output of this process can be
represented in many forms. Eg., bar charts, curves and pie
charts.
2. Mining Frequent Patterns, Associations, and Correlations: Frequent
patterns are nothing but things that are found to be most common in
the data. There are different kinds of frequencies that can be
observed in the dataset.
 Frequent item set: This applies to a number of items that can be
seen together regularly for eg: milk and sugar.
 Frequent Subsequence: This refers to the pattern series that
often occurs regularly such as purchasing a phone followed by
a back cover.
 Frequent Substructure: It refers to the different kinds of data
structures such as trees and graphs that may be combined with
the itemset or subsequence.
Association Analysis: The process involves uncovering the
relationship between data and deciding the rules of the association.
It is a way of discovering the relationship between various items. for
example, it can be used to determine the sales of items that are
frequently purchased together.
Correlation Analysis: Correlation is a mathematical technique that
can show whether and how strongly the pairs of attributes are
related to each other. For example, Highted people tend to have
more weight.
There are various data mining functionalities which are as follows −
 Data characterization − It is a summarization of the general
SWCET
DATA MINING UNIT - 1

characteristics of an object class of data. The data


corresponding to the user-specified class is generally collected
by a database query. The output of data characterization can
be presented in multiple forms.
 Data discrimination − It is a comparison of the general
characteristics of target class data objects with the general
characteristics of objects

SWCET
DATA MINING UNIT - 1

from one or a set of contrasting classes. The target and


contrasting classes can be represented by the user, and the
equivalent data objects fetched through database queries.
 Association Analysis − It analyses the set of items that
generally occur together in a transactional dataset. There are
two parameters that are used for determining the association
rules −
o It provides which identifies the common item set in the
database.
o Confidence is the conditional probability that an item
occurs in a transaction when another item occurs.
 Classification − Classification is the procedure of discovering a
model that represents and distinguishes data classes or
concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous.
The derived model is established on the analysis of a set of
training data (i.e., data objects whose class label is common).
 Prediction − It defines predict some unavailable data values or
pending trends. An object can be anticipated based on the
attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or
increase/ decrease trends in time-related information.
 Clustering − It is similar to classification but the classes are not
predefined. The classes are represented by data attributes. It is
unsupervised learning. The objects are clustered or grouped,
depends on the principle of maximizing the intraclass similarity
and minimizing the intraclass similarity.
 Outlier analysis − Outliers are data elements that cannot be
grouped in a given class or cluster. These are the data objects
SWCET
DATA MINING UNIT - 1

which have multiple behaviour from the general behaviour of


other data objects. The analysis of this type of data can be
essential to mine the knowledge.

SWCET
DATA MINING UNIT - 1

 Evolution analysis − It defines the trends for objects whose


behaviour changes over some time.

Interestingness Patterns

Data Mining functions are used to define the trends or correlations


contained in data mining activities. In comparison, data mining
activities can be divided into 2 categories:
1. Descriptive Data Mining:
This category of data mining is concerned with finding patterns and
relationships in the data that can provide insight into the underlying
structure of the data.
Descriptive data mining is often used to summarize or explore the
data, and it can be used to answer questions such as: What are the
most common patterns or relationships in the data? Are there any
clusters or groups of data points that share common characteristics?
What are the outliers in the data, and what do they represent? Some
common techniques used in descriptive data mining include:
Cluster analysis:
This technique is used to identify groups of data points that share
similar characteristics. Clustering can be used for segmentation,
anomaly detection, and summarization.
Association rule mining:
This technique is used to identify relationships between variables in
the data. It can be used to discover co-occurring events or to identify
patterns in transaction data.
Visualization:
This technique is used to represent the data in a visual format that
can help users to identify patterns or trends that may not be

SWCET
DATA MINING UNIT - 1

apparent in the raw data.

SWCET
DATA MINING UNIT - 1

2. Predictive Data Mining: This category of data mining is concerned


with developing models that can predict future behavior or
outcomes based on historical data.
Predictive data mining is often used for classification or regression
tasks, and it can be used to answer questions such as: What is the
likelihood that a customer will churn? What is the expected revenue
for a new product launch? What is the probability of a loan
defaulting? Some common techniques used in predictive data
mining include:
Decision trees: This technique is used to create a model that can
predict the value of a target variable based on the values of several
input variables. Decision trees are often used for classification tasks.
Neural networks: This technique is used to create a model that can
learn to recognize patterns in the data. Neural networks are often
used for image recognition, speech recognition, and natural
language processing.
Regression analysis: This technique is used to create a model that
can predict the value of a target variable based on the values of
several input variables. Regression analysis is often used for
prediction tasks.
Both descriptive and predictive data mining techniques are important for
gaining insights and making better decisions. Descriptive data
mining can be used to explore the data and identify patterns, while
predictive data mining can be used to make predictions based on
those patterns.
Together, these techniques can help organizations to understand
their data and make informed decisions based on that
understanding.

SWCET
DATA MINING UNIT - 1

Classification of Data Mining Systems


Data mining refers to the process of extracting important data
from raw data. It analyses the data patterns in huge sets of data
with the help of several software. Ever since the development of
data mining, it is being incorporated by researchers in the research
and development field.

SWCET
DATA MINING UNIT - 1

With Data mining, businesses are found to gain more profit. It has
not only helped in understanding customer demand but also in
developing effective strategies to enforce overall business turnover.
It has helped in determining business objectives for making clear
decisions.
Data collection and data warehousing, and computer processing are
some of the strongest pillars of data mining. Data mining utilizes the
concept of mathematical algorithms to segment the data and assess
the possibility of occurrence of future events.
To understand the system and meet the desired requirements, data
mining can be classified into the following systems:
o Classification based on the mined Databases
o Classification based on the type of mined knowledge
o Classification based on statistics
o Classification based on Machine Learning
o Classification based on visualization
o Classification based on Information Science
o Classification based on utilized techniques
o Classification based on adapted applications
Classification Based on the mined Databases
A data mining system can be classified based on the types of
databases that have been mined. A database system can be further
segmented based on distinct principles, such as data models, types
of data, etc., which further assist in classifying a data mining
system.
For example, if we want to classify a database based on the data
model, we need to select either relational, transactional, object-
relational or data warehouse mining systems.
Classification Based on the type of Knowledge Mined
SWCET
DATA MINING UNIT - 1

A data mining system categorized based on the kind of knowledge


mind may have the following functionalities:

SWCET
DATA MINING UNIT - 1

1. Characterization
2. Discrimination
3. Association and Correlation Analysis
4. Classification
5. Prediction
6. Outlier Analysis
7. Evolution Analysis
Classification Based on the Techniques Utilized
A data mining system can also be classified based on the type of
techniques that are being incorporated. These techniques can be
assessed based on the involvement of user interaction involved or
the methods of analysis employed. Classification Based on the
Applications Adapted
Data mining systems classified based on adapted applications
adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail
Examples of Classification Task
Following is some of the main examples of classification tasks:
o Classification helps in determining tumor cells as benign or
malignant.
o Classification of credit card transactions as fraudulent or
legitimate.
o Classification of secondary structures of protein as alpha-
helix, beta- sheet, or random coil.
o Classification of news stories into distinct categories such as

SWCET
DATA MINING UNIT - 1

finance, weather, entertainment, sports, etc.

SWCET
DATA MINING UNIT - 1

Data mining Task primitives


Data mining task primitives refer to the basic building blocks or
components that are used to construct a data mining process. These
primitives are used to represent the most common and fundamental
tasks that are performed during the data mining process.
The use of data mining task primitives can provide a modular and
reusable approach, which can improve the performance, efficiency,
and understandability of the data mining process.

The Data Mining Task Primitives are as follows:


1. The set of task relevant data to be mined: It refers to the specific
data that is relevant and necessary for a particular task or
analysis being conducted using data mining techniques. This
data may include specific attributes, variables, or
characteristics that are relevant to the task at hand, such as
customer demographics, sales data, or website usage statistics.
The data selected for mining is typically a subset of the overall
data available, as not all data may be necessary or relevant
for the task. For example: Extracting the database name,
database tables, and relevant required attributes from the
dataset from the provided input database.

SWCET
DATA MINING UNIT - 1

2. Kind of knowledge to be mined: It refers to the type of


information or insights that are being sought through the use of
data mining techniques. This describes the data mining tasks
that must be carried out. It includes various tasks such as
classification, clustering, discrimination, characterization,
association, and evolution analysis. For example, It determines
the task to be performed on the relevant data in order to mine
useful information such as classification, clustering, prediction,
discrimination, outlier detection, and correlation analysis.
3. Background knowledge to be used in the discovery process: It
refers to any prior information or understanding that is used to
guide the data mining process. This can include domain-specific
knowledge, such as industry-specific terminology, trends, or
best practices, as well as knowledge about the data itself. The
use of background knowledge can help to improve the accuracy
and relevance of the insights obtained from the data mining
process.
For example, The use of background knowledge such as concept
hierarchies, and user beliefs about relationships in data in order
to evaluate and perform more efficiently.
4. Interestingness measures and thresholds for pattern evaluation: It
refers to the methods and criteria used to evaluate the quality
and relevance of the patterns or insights discovered through
data mining. Interestingness measures are used to quantify the
degree to which a pattern is considered to be interesting or
relevant based on certain criteria, such as its frequency,
confidence, or lift. These measures are used to identify patterns
that are meaningful or relevant to the task. Thresholds for
pattern evaluation, on the other hand, are used to set a
SWCET
DATA MINING UNIT - 1

minimum level of interestingness that a pattern must meet in


order to be considered for further analysis or action.

SWCET
DATA MINING UNIT - 1

For example: Evaluating the interestingness and interestingness


measures such as utility, certainty, and novelty for the data
and setting an appropriate threshold value for the pattern
evaluation.
5. Representation for visualizing the discovered pattern: It refers to
the methods used to represent the patterns or insights
discovered through data mining in a way that is easy to
understand and interpret. Visualization techniques such as
charts, graphs, and maps are commonly used to represent the
data and can help to highlight important trends, patterns, or
relationships within the data. Visualizing the discovered pattern
helps to make the insights obtained from the data mining
process more accessible and understandable to a wider
audience, including non-technical stakeholders.
For example Presentation and visualization of discovered
pattern data using various visualization techniques such as
barplot, charts, graphs, tables, etc.
Advantages of Data Mining Task Primitives
The use of data mining task primitives has several advantages,
including:
1. Modularity: Data mining task primitives provide a modular
approach to data mining, which allows for flexibility and the
ability to easily modify or replace specific steps in the process.
2. Reusability: Data mining task primitives can be reused across
different data mining projects, which can save time and effort.
3. Standardization: Data mining task primitives provide a
standardized approach to data mining, which can improve the
consistency and quality of the data mining process.
4. Understandability: Data mining task primitives are easy to
SWCET
DATA MINING UNIT - 1

understand and communicate, which can improve collaboration


and communication among team members.

SWCET
DATA MINING UNIT - 1

5. Improved Performance: Data mining task primitives can improve


the performance of the data mining process by reducing the
amount of data that needs to be processed, and by optimizing
the data for specific data mining algorithms.
6. Flexibility: Data mining task primitives can be combined and
repeated in various ways to achieve the goals of the data
mining process, making it more adaptable to the specific needs
of the project.
7. Efficient use of resources: Data mining task primitives can help
to make more efficient use of resources, as they allow to
perform specific tasks with the right tools, avoiding
unnecessary steps and reducing the time and computational
power needed.

Integration of Data mining system with a Data warehouse


No Coupling
In no coupling schema, the data mining system does not use any
database or data warehouse system functions.
Loose Coupling
In loose coupling, data mining utilizes some of the database or data
warehouse system functionalities. It mainly fetches the data from
the data repository managed by these systems and then performs
data mining. The results are kept either in the file or any designated
place in the database or data warehouse.
Semi-Tight Coupling
In semi-tight coupling, data mining is linked to either the DB or DW
system and provides an efficient implementation of data mining
primitives within the database.
Tight Coupling
SWCET
DATA MINING UNIT - 1

A data mining system can be effortlessly combined with a database


or data warehouse system in tight coupling.

SWCET
DATA MINING UNIT - 1

Major issues in Data Mining


Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
Mining Methodology and User Interaction Issues
 Mining different kinds of knowledge in databases − Different users
may be interested in different kinds of knowledge. Therefore it
is necessary for data mining to cover a broad range of
knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction −
The data mining process needs to be interactive because it
SWCET
DATA MINING UNIT - 1

allows users

SWCET
DATA MINING UNIT - 1

to focus the search for patterns, providing and refining data


mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the
background knowledge can be used. Background knowledge
may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc
mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data
mining.
 Presentation and visualization of data mining results − Once the
patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods
are required to handle the noise and incomplete objects while
mining the data regularities. If the data cleaning methods are
not there then the accuracy of the discovered patterns will be
poor.
 Pattern evaluation − The patterns discovered should be
interesting because either they represent common knowledge
or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and
SWCET
DATA MINING UNIT - 1

scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of
data, and complexity of data mining methods motivate the
development of parallel

SWCET
DATA MINING UNIT - 1

and distributed data mining algorithms. These algorithms


divide the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is merged.
The incremental algorithms, update databases without mining
the data again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects,
spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured,
semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.

Data Preprocessing

Data preprocessing is an important step in the data mining process


that involves cleaning and transforming raw data to make it suitable
for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such
as imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources
to create a unified dataset. Data integration can be challenging as it
requires handling data with different formats, structures, and
semantics. Techniques such as record linkage and data fusion can

SWCET
DATA MINING UNIT - 1

be used for data integration.


Data Transformation: This involves converting the data into a suitable
format for analysis. Common techniques used in data
transformation include

SWCET
DATA MINING UNIT - 1

normalization, standardization, and discretization. Normalization is


used to scale the data to a common range, while standardization is
used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be
achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while
preserving the important information.
Data Discretization: This involves dividing continuous data into
discrete categories or intervals. Discretization is often used in data
mining and machine learning algorithms that require categorical
data. Discretization can be achieved through techniques such as
equal width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common
range, such as between 0 and 1 or -1 and 1. Normalization is often
used to handle data with different units and scales. Common
normalization techniques include min- max normalization, z-score
normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of
data and the accuracy of the analysis results. The specific steps
involved in data preprocessing may vary depending on the nature of
the data and the analysis goals.
By performing these steps, the data mining process becomes more
efficient and the results become more accurate.
Preprocessing in Data Mining:
SWCET
DATA MINING UNIT - 1

Data preprocessing is a data mining technique which is used to


transform the raw data in a useful and efficient format.

SWCET
DATA MINING UNIT - 1

Steps Involved in Data Preprocessing:


1. Data Cleaning: The data can have many irrelevant and missing
parts. To handle this part, data cleaning is done. It involves handling
of missing data, noisy data etc.
 (a). Missing Data: This situation arises when some data is
missing in the data. It can be handled in various ways. Some of
them are:

SWCET
DATA MINING UNIT - 1

1. Ignore the tuples: This approach is suitable only when the


dataset we have is quite large and multiple values are
missing within a tuple.
2. Fill the Missing values: There are various ways to do this
task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.
3. Use a global constant to fill in the missing value: Replace all
missing attribute values by the same constant such as a
label like “Unknown” or -
4. Use a measure of central tendency for the attribute (e.g., the
mean or median) to fill in the missing value.
5. Use the attribute mean or median for all samples belonging to
the same class as the given tuple: For example, if classifying
customers according to credit risk, we may replace the
missing value with the mean income value for customers
in the same credit risk category as that of the given tuple.
If the data distribution for a given class is skewed, the
median value is a better choice.
6. Use the most probable value to fill in the missing value: This
may be determined with regression, inference-based tools
using a Bayesian formalism, or decision tree3.2 Data
Cleaning 89 induction.
 (b). Noisy Data: Noisy data is a meaningless data that can’t be
interpreted by machines. It can be generated due to faulty data
collection, data entry errors etc. It can be handled in following
ways :
1. Binning Method: This method works on sorted data in order
to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to
SWCET
DATA MINING UNIT - 1

complete the task. Each segmented is handled


separately. One can replace all data

SWCET
DATA MINING UNIT - 1

in a segment by its mean or boundary values can be used


to complete the task.
2. Regression: Here data can be made smooth by fitting it to
a regression function. The regression used may be linear
(having one independent variable) or multiple (having
multiple independent variables).
3. Clustering: This approach groups the similar data in a
cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation: This step is taken in order to transform the
data in appropriate forms suitable for mining process. This involves
following ways:
1. Normalization: It is done in order to scale the data values in a
specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection: In this strategy, new attributes are
constructed from the given set of attributes to help the mining
process.
3. Discretization: This is done to replace the raw values of numeric
attribute by interval levels or conceptual levels.
4. Concept Hierarchy Generation: Here attributes are converted
from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining
process that involves reducing the size of the dataset while
preserving the important information. This is done to improve the
efficiency of data analysis and to avoid over fitting of the model.
Some common steps involved in data reduction are:
Feature Selection: This involves selecting a subset of relevant
features from the dataset. Feature selection is often performed to
SWCET
DATA MINING UNIT - 1

remove irrelevant or redundant features from the dataset. It can be


done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).

SWCET
DATA MINING UNIT - 1

Feature Extraction: This involves transforming the data into a lower-


dimensional space while preserving the important information.
Feature extraction is often used when the original features are high-
dimensional and complex. It can be done using techniques such as
PCA, linear discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
Sampling: This involves selecting a subset of data points from the
dataset. Sampling is often used to reduce the size of the dataset
while preserving the important information. It can be done using
techniques such as random sampling, stratified sampling, and
systematic sampling.
Clustering: This involves grouping similar data points together into
clusters. Clustering is often used to reduce the size of the dataset by
replacing similar data points with a representative centroid. It can be
done using techniques such as k-means, hierarchical clustering, and
density-based clustering.
Compression: This involves compressing the dataset while
preserving the important information. Compression is often used to
reduce the size of the dataset for storage and transmission
purposes. It can be done using techniques such as wavelet
compression, JPEG compression, and gzip compression.

SWCET

You might also like