DM - MOD - 1 Part III

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Advanced Data Mining 221ECS001 Module -1 Part -3

Module - 1

Data Mining and Knowledge Discovery

Desirable properties of Discovered Knowledge - Knowledge representation, Data Mining


Functionalities, Motivation and Importance of Data Mining, Classification of Data Mining
Systems, Integration of a data mining system with a Database or Data Warehouse System,
Classification, Clustering, Regression, Data Pre-Processing: Data Cleaning, Data Integration and
Transformation, normalization, standardization, Data Reduction, Feature vector representation,
importance of feature engineering in machine learning; forward selection and backward selection
for feature selection; curse of dimensionality; data imputation techniques; No Free Lunch
theorem in the context of machine learning, Data Discretization and Concept Hierarchy
Generation.

Feature vector representation


● Feature vector is an n-dimensional vector of numerical features that describe some object
in pattern recognition. A feature vector is a specific observable phenomenon’s
measurable property. The height and weight parameter in a human category is a clear
example since it is observable and measurable.
● A feature vector is an ordered list of numerical properties of observed phenomena. It
represents input features to a machine learning model that makes a prediction.Humans
can analyze qualitative data to make a decision. For example, we see the cloudy sky, feel
the damp breeze, and decide to take an umbrella when going outside. Our five senses can
transform outside stimuli into neural activity in our brains, handling multiple inputs as
they occur in no particular order.
● Feature vectors are used in classification problems, artificial neural networks, and
k-nearest neighbors algorithms in machine learning.

Importance Of Feature Engineering In Machine Learning


● Feature engineering is the pre-processing step of machine learning, which is used to
transform raw data into features that can be used for creating a predictive model using
Machine learning or statistical Modeling.
● Feature engineering in machine learning aims to improve the performance of models.
● We can say a feature is an attribute that impacts a problem or is useful for the problem.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

What is Feature Engineering?


● Feature engineering is the pre-processing step of machine learning, which extracts
features from raw data. It helps to represent an underlying problem to predictive models
in a better way, which as a result, improves the accuracy of the model for unseen data.
The predictive model contains predictor variables and an outcome variable, and while the
feature engineering process selects the most useful predictor variables for the model.
● Feature engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction, and Feature Selection.

These processes are described as below:

1. Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.

2. Transformations: The transformation step of feature engineering involves adjusting the


predictor variable to improve the accuracy and performance of the model. For example, it
ensures that the model is flexible to take input of the variety of data; it ensures that all the
variables are on the same scale, making the model easier to understand. It improves the
model's accuracy and ensures that all the features are within the acceptable range to avoid
any computational error.

3. Feature Extraction: Feature extraction is an automated feature engineering process that


generates new variables by extracting them from the raw data. The main aim of this step
is to reduce the volume of data so that it can be easily used and managed for data
modeling. Feature extraction methods include cluster analysis, text analytics, edge
detection algorithms, and principal components analysis (PCA).

4. Feature Selection: While developing the machine learning model, only a few variables
in the dataset are useful for building the model, and the rest features are either redundant
or irrelevant. If we input the dataset with all these redundant and irrelevant features, it
may negatively impact and reduce the overall performance and accuracy of the model.
Hence it is very important to identify and select the most appropriate features from the
data and remove the irrelevant or less important features, which is done with the help of
feature selection in machine learning. "Feature selection is a way of selecting the subset

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

of the most relevant features from the original features set by removing the redundant,
irrelevant, or noisy features."

Need for Feature Engineering in Machine Learning


● Feature engineering influences how machine learning models perform and how accurate
they are. It helps uncover the hidden patterns in the data and boosts the predictive power
of machine learning.
● For machine algorithms to work properly, users must input the right data that the
algorithms can understand. Feature engineering transforms that input data into a single
aggregated form that's optimized for machine learning. Feature engineering enables
machine learning to do its job, e.g., predicting churn for retailers or preventing fraud for
financial institutions.
● The aim of feature engineering is to prepare an input data set that best fits the machine
learning algorithm as well as to enhance the performance of machine learning models.
● Forward Selection: Forward selection is an iterative method in which we start with
having no feature in the model. In each iteration, we keep adding the feature which best
improves our model till an addition of a new variable does not improve the performance
of the model.
● Backward Elimination: In backward elimination, we start with all the features and
remove the least significant feature at each iteration which improves the performance of
the model. We repeat this until no improvement is observed on removal of features.

What is the curse of dimensionality?

● Curse of Dimensionality refers to a set of problems that arise when working with
high-dimensional data. The dimension of a dataset corresponds to the number of
attributes/features that exist in a dataset.

● A dataset with a large number of attributes, generally of the order of a hundred or more,
is referred to as high dimensional data. Some of the difficulties that come with high
dimensional data manifest during analyzing or visualizing the data to identify patterns,
and some manifest while training machine learning models. The difficulties related to
training machine learning models due to high dimensional data are referred to as the
‘Curse of Dimensionality.

● Curse of Dimensionality describes the explosive nature of increasing data dimensions and
its resulting exponential increase in computational efforts required for its processing
and/or analysis. This term was first introduced by Richard E. Bellman, to explain the

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

increase in volume of Euclidean space associated with adding extra dimensions, in the
area of dynamic programming.

Data Imputation Techniques


● Data imputation is a method for retaining the majority of the dataset's data and
information by substituting missing data with a different value.
Importance of Data Imputation :
We employ imputation since missing data can lead to the following problems:
● Distorts Dataset: Large amounts of missing data can lead to anomalies in the variable
distribution, which can change the relative importance of different categories in the
dataset.
● Impacts on the Final Model: Missing data may lead to bias in the dataset, which could
affect the final model's analysis.
Data Imputation Techniques :
● Next or Previous Value
● K Nearest Neighbors
● Maximum or Minimum Value
● Missing Value Prediction
● Most Frequent Value
● Average or Linear Interpolation
● (Rounded) Mean or Moving Average or Median Value
● Fixed Value

1. Next or Previous Value

For time-series data or ordered data, there are specific imputation techniques. These techniques
take into consideration the dataset's sorted structure, wherein nearby values are likely more
comparable than far-off ones. The next or previous value inside the time series is typically
substituted for the missing value as part of a common method for imputed incomplete data in the
time series. This strategy is effective for both nominal and numerical values.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

2. K Nearest Neighbors

The objective is to find the k nearest examples in the data where the value in the relevant feature
is not absent and then substitute the value of the feature that occurs most frequently in the group.

3. Maximum or Minimum Value

You can use the minimum or maximum of the range as the replacement cost for missing values if
you are aware that the data must fit within a specific range [minimum, maximum] and if you are
aware from the process of data collection that the measurement instrument stops recording and
the message saturates further than one of such boundaries. For instance, if a price cap has been
reached in a financial exchange and the exchange procedure has indeed been halted, the missing
price can be substituted with the exchange boundary's minimum value.

4. Missing Value Prediction

Using a machine learning model to determine the final imputation value for characteristic x
based on other features is another popular method for single imputation. The model is trained
using the values in the remaining columns, and the rows in feature x without missing values are
utilized as the training set.

Depending on the type of feature, we can employ any regression or classification model in this
situation. In resistance training, the algorithm is used to forecast the most likely value of each
missing value in all samples.

A basic imputation approach, such as the mean value, is used to temporarily impute all missing
values when there is missing data in more than a feature field. Then, one column's values are
restored to missing. After training, the model is used to complete the missing variables. In this
manner, an is trained for every feature that has a missing value up until a model can impute all of
the missing values.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

5. Most Frequent Value

The most frequent value in the column is used to replace the missing values in another popular
technique that is effective for both nominal and numerical features.

6. Average or Linear Interpolation

The average or linear interpolation, which calculates between the previous and next accessible
value and substitutes the missing value, is similar to the previous/next value imputation but only
applicable to numerical data. Of course, as with other operations on ordered data, it is crucial to
accurately sort the data in advance, for example, in the case of time series data, according to a
timestamp.

7. (Rounded) Mean or Moving Average or Median Value

Median, Mean, or rounded mean are further popular imputation techniques for numerical
features. The technique, in this instance, replaces the null values with mean, rounded mean, or
median values determined for that feature across the whole dataset. It is advised to utilize the
median rather than the mean when your dataset has a significant number of outliers.

8. Fixed Value

Fixed value imputation is a universal technique that replaces the null data with a fixed value and
is applicable to all data types. You can impute the null values in a survey using "not answered" as
an example of using fixed imputation on nominal features.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

No Free Lunch theorem in the context of machine learning


● The NFLT was proposed back in 1997 by Wolpert and Macready. It states that no
universally better algorithm can solve all types of optimization problems.
● The No Free Lunch Theorem, often abbreviated as NFL or NFLT, is a theoretical finding
that suggests all optimization algorithms perform equally well when their performance is
averaged over all possible objective functions.
● According to the “No Free Lunch” theory, there is no one model that works best for every
situation. Because the assumptions of a great model for one issue may not hold true for
another, it is typical in machine learning to attempt many models to discover the one that
performs best for a specific problem.According to the “No Free Lunch” theorem, all
optimization methods perform equally well when averaged over all optimization tasks
without re-sampling.
● The theorem states that all optimization algorithms perform equally well when their
performance is averaged across all possible problems.
● It implies that there is no single best optimization algorithm. Because of the close
relationship between optimization, search, and machine learning, it also implies that there
is no single best machine learning algorithm for predictive modeling problems such as
classification and regression.

Data Discretization and Concept Hierarchy Generation

Data Discretization
● Dividing the range of a continuous attribute into intervals.
● Interval labels can then be used to replace actual data values.
● Reduce the number of values for a given continuous attribute.
● Some classification algorithms only accept categorical attributes.
● This leads to a concise, easy-to-use, knowledge-level representation of mining results.
● Discretization techniques can be categorized based on whether it uses class information
or not such as follows:
○ Supervised Discretization - This discretization process uses class information.
○ Unsupervised Discretization - This discretization process does not use class
information.
● Discretization techniques can be categorized based on which direction it proceeds as
follows:
Top-down Discretization -
● The process starts by first finding one or a few points called split points or cut points to
split the entire attribute range and then repeat this recursively on the resulting intervals.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

Bottom-up Discretization -
● Start by considering all of the continuous values as potential split-points.
● Removes some by merging neighborhood values to form intervals, and then recursively
applies this process to the resulting intervals.

Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing


the values of x numbers into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help of
a recursive process. The recursive process divides it into various discretized disjoint intervals,
from top to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and
then the large intervals are combined to develop a larger overlap to form the final 20 overlapping
intervals.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

Concept Hierarchies
● Discretization can be performed rapidly on an attribute to provide a hierarchical
partitioning of the attribute values, known as a Concept Hierarchy.
● Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
● In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies.
● This organization provides users with the flexibility to view data from different
perspectives.
● Data mining on a reduced data set means fewer input and output operations and is more
efficient than mining on a larger data set.
● Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
In data mining, the concept of a concept hierarchy refers to the organization of data into a
tree-like structure, where each level of the hierarchy represents a concept that is more general
than the level below it. This hierarchical organization of data allows for more efficient and
effective data analysis, as well as the ability to drill down to more specific levels of detail when
needed. The concept of hierarchy is used to organize and classify data in a way that makes it
more understandable and easier to analyze. The main idea behind the concept of hierarchy is that
the same data can have different levels of granularity or levels of detail and that by organizing
the data in a hierarchical fashion, it is easier to understand and perform analysis.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

Explanation:
As shown in the above diagram, it consists of a concept hierarchy for the dimension location,
where the user can easily retrieve the data. In order to evaluate it easily the data is represented in
a tree-like structure. The top of the tree consists of the main dimension location and further splits
into various sub-nodes. The root node is located, and it further splits into two nodes countries ie.
USA and India. These countries are further then splitted into more sub-nodes, that represent the
province states ie. New York, Illinois, Gujarat, UP. Thus the concept hierarchy as shown in the
above example organizes the data into a tree-like structure and describes and represents in more
general terms than the level below it.

The hierarchical structure represents the abstraction level of the dimension location, which
consists of various footprints of the dimension such as street, city, province, state, and country.

Types of Concept Hierarchies

Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to organize
the schema of a database in a logical and meaningful way, grouping similar objects together. A
schema hierarchy can be used to organize different types of data, such as tables, attributes, and
relationships, in a logical and meaningful way. This can be useful in data warehousing, where
data from multiple sources needs to be integrated into a single database.

Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of concept hierarchy that is based


on set theory, where each set in the hierarchy is defined in terms of its membership in other sets.
Set-grouping hierarchy can be used for data cleaning, data pre-processing and data integration.
This type of hierarchy can be used to identify and remove outliers, noise, or inconsistencies from
the data and to integrate data from multiple sources.

Operation-Derived Hierarchy: An Operation-Derived Hierarchy is a type of concept hierarchy


that is used to organize data by applying a series of operations or transformations to the data. The

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

operations are applied in a top-down fashion, with each level of the hierarchy representing a
more general or abstract view of the data than the level below it. This type of hierarchy is
typically used in data mining tasks such as clustering and dimensionality reduction. The
operations applied can be mathematical or statistical operations such as aggregation,
normalization

Rule-based Hierarchy: Rule-based Hierarchy is a type of concept hierarchy that is used to


organize data by applying a set of rules or conditions to the data. This type of hierarchy is useful
in data mining tasks such as classification, decision-making, and data exploration. It allows the
assignment of a class label or decision to each data point based on its characteristics and
identifies patterns and relationships between different attributes of the data.

Need of Concept Hierarchy in Data Mining

There are several reasons why a concept hierarchy is useful in data mining:
Improved Data Analysis: A concept hierarchy can help to organize and simplify data, making it
more manageable and easier to analyze. By grouping similar concepts together, a concept
hierarchy can help to identify patterns and trends in the data that would otherwise be difficult to
spot. This can be particularly useful in uncovering hidden or unexpected insights that can inform
business decisions or inform the development of new products or services.

Improved Data Visualization and Exploration: A concept hierarchy can help to improve data
visualization and data exploration by organizing data into a tree-like structure, allowing users to
easily navigate and understand large and complex data sets. This can be particularly useful in
creating interactive dashboards and reports that allow users to easily drill down to more specific
levels of detail when needed.

Improved Algorithm Performance: The use of a concept hierarchy can also help to improve
the performance of data mining algorithms. By organizing data into a hierarchical structure,
algorithms can more easily process and analyze the data, resulting in faster and more accurate
results.

Data Cleaning and Preprocessing: A concept hierarchy can also be used in data cleaning and
preprocessing, to identify and remove outliers and noise from the data.

Domain Knowledge: A concept hierarchy can also be used to represent the domain knowledge
in a more structured way, which can help in a better understanding of the data and the problem
domain.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -3

Applications of Concept Hierarchy

Data Warehousing: Concept hierarchy can be used in data warehousing to organize data from
multiple sources into a single, consistent and meaningful structure. This can help to improve the
efficiency and effectiveness of data analysis and reporting.
Business Intelligence: Concept hierarchy can be used in business intelligence to organize and
analyze data in a way that can inform business decisions. For example, it can be used to analyze
customer data to identify patterns and trends that can inform the development of new products or
services.

Online Retail: Concept hierarchy can be used in online retail to organize products into
categories, subcategories and sub-subcategories, it can help customers to find the products they
are looking for more quickly and easily.

Healthcare: Concept hierarchy can be used in healthcare to organize patient data, for example,
to group patients by diagnosis or treatment plan, it can help to identify patterns and trends that
can inform the development of new treatments or improve the effectiveness of existing
treatments.

Natural Language Processing: Concept hierarchy can be used in natural language processing to
organize and analyze text data, for example, to identify topics and themes in a text, it can help to
extract useful information from unstructured data.

Fraud Detection: Concept hierarchy can be used in fraud detection to organize and analyze
financial data, for example, to identify patterns and trends that can indicate fraudulent activity.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE

You might also like