DM - MOD - 1 Part III
DM - MOD - 1 Part III
DM - MOD - 1 Part III
Module - 1
1. Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
4. Feature Selection: While developing the machine learning model, only a few variables
in the dataset are useful for building the model, and the rest features are either redundant
or irrelevant. If we input the dataset with all these redundant and irrelevant features, it
may negatively impact and reduce the overall performance and accuracy of the model.
Hence it is very important to identify and select the most appropriate features from the
data and remove the irrelevant or less important features, which is done with the help of
feature selection in machine learning. "Feature selection is a way of selecting the subset
of the most relevant features from the original features set by removing the redundant,
irrelevant, or noisy features."
● Curse of Dimensionality refers to a set of problems that arise when working with
high-dimensional data. The dimension of a dataset corresponds to the number of
attributes/features that exist in a dataset.
● A dataset with a large number of attributes, generally of the order of a hundred or more,
is referred to as high dimensional data. Some of the difficulties that come with high
dimensional data manifest during analyzing or visualizing the data to identify patterns,
and some manifest while training machine learning models. The difficulties related to
training machine learning models due to high dimensional data are referred to as the
‘Curse of Dimensionality.
● Curse of Dimensionality describes the explosive nature of increasing data dimensions and
its resulting exponential increase in computational efforts required for its processing
and/or analysis. This term was first introduced by Richard E. Bellman, to explain the
increase in volume of Euclidean space associated with adding extra dimensions, in the
area of dynamic programming.
For time-series data or ordered data, there are specific imputation techniques. These techniques
take into consideration the dataset's sorted structure, wherein nearby values are likely more
comparable than far-off ones. The next or previous value inside the time series is typically
substituted for the missing value as part of a common method for imputed incomplete data in the
time series. This strategy is effective for both nominal and numerical values.
2. K Nearest Neighbors
The objective is to find the k nearest examples in the data where the value in the relevant feature
is not absent and then substitute the value of the feature that occurs most frequently in the group.
You can use the minimum or maximum of the range as the replacement cost for missing values if
you are aware that the data must fit within a specific range [minimum, maximum] and if you are
aware from the process of data collection that the measurement instrument stops recording and
the message saturates further than one of such boundaries. For instance, if a price cap has been
reached in a financial exchange and the exchange procedure has indeed been halted, the missing
price can be substituted with the exchange boundary's minimum value.
Using a machine learning model to determine the final imputation value for characteristic x
based on other features is another popular method for single imputation. The model is trained
using the values in the remaining columns, and the rows in feature x without missing values are
utilized as the training set.
Depending on the type of feature, we can employ any regression or classification model in this
situation. In resistance training, the algorithm is used to forecast the most likely value of each
missing value in all samples.
A basic imputation approach, such as the mean value, is used to temporarily impute all missing
values when there is missing data in more than a feature field. Then, one column's values are
restored to missing. After training, the model is used to complete the missing variables. In this
manner, an is trained for every feature that has a missing value up until a model can impute all of
the missing values.
The most frequent value in the column is used to replace the missing values in another popular
technique that is effective for both nominal and numerical features.
The average or linear interpolation, which calculates between the previous and next accessible
value and substitutes the missing value, is similar to the previous/next value imputation but only
applicable to numerical data. Of course, as with other operations on ordered data, it is crucial to
accurately sort the data in advance, for example, in the case of time series data, according to a
timestamp.
Median, Mean, or rounded mean are further popular imputation techniques for numerical
features. The technique, in this instance, replaces the null values with mean, rounded mean, or
median values determined for that feature across the whole dataset. It is advised to utilize the
median rather than the mean when your dataset has a significant number of outliers.
8. Fixed Value
Fixed value imputation is a universal technique that replaces the null data with a fixed value and
is applicable to all data types. You can impute the null values in a survey using "not answered" as
an example of using fixed imputation on nominal features.
Data Discretization
● Dividing the range of a continuous attribute into intervals.
● Interval labels can then be used to replace actual data values.
● Reduce the number of values for a given continuous attribute.
● Some classification algorithms only accept categorical attributes.
● This leads to a concise, easy-to-use, knowledge-level representation of mining results.
● Discretization techniques can be categorized based on whether it uses class information
or not such as follows:
○ Supervised Discretization - This discretization process uses class information.
○ Unsupervised Discretization - This discretization process does not use class
information.
● Discretization techniques can be categorized based on which direction it proceeds as
follows:
Top-down Discretization -
● The process starts by first finding one or a few points called split points or cut points to
split the entire attribute range and then repeat this recursively on the resulting intervals.
Bottom-up Discretization -
● Start by considering all of the continuous values as potential split-points.
● Removes some by merging neighborhood values to form intervals, and then recursively
applies this process to the resulting intervals.
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this
technique can also be used.
Cluster Analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique is
used. It is done through a supervised procedure. In a numeric attribute discretization, first, you
need to select the attribute that has the least entropy, and then you need to run it with the help of
a recursive process. The recursive process divides it into various discretized disjoint intervals,
from top to bottom, using the same splitting criterion.
Discretizing data by linear regression technique, you can get the best neighboring interval, and
then the large intervals are combined to develop a larger overlap to form the final 20 overlapping
intervals.
Concept Hierarchies
● Discretization can be performed rapidly on an attribute to provide a hierarchical
partitioning of the attribute values, known as a Concept Hierarchy.
● Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts with higher-level concepts.
● In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies.
● This organization provides users with the flexibility to view data from different
perspectives.
● Data mining on a reduced data set means fewer input and output operations and is more
efficient than mining on a larger data set.
● Because of these benefits, discretization techniques and concept hierarchies are typically
applied before data mining, rather than during mining.
In data mining, the concept of a concept hierarchy refers to the organization of data into a
tree-like structure, where each level of the hierarchy represents a concept that is more general
than the level below it. This hierarchical organization of data allows for more efficient and
effective data analysis, as well as the ability to drill down to more specific levels of detail when
needed. The concept of hierarchy is used to organize and classify data in a way that makes it
more understandable and easier to analyze. The main idea behind the concept of hierarchy is that
the same data can have different levels of granularity or levels of detail and that by organizing
the data in a hierarchical fashion, it is easier to understand and perform analysis.
Explanation:
As shown in the above diagram, it consists of a concept hierarchy for the dimension location,
where the user can easily retrieve the data. In order to evaluate it easily the data is represented in
a tree-like structure. The top of the tree consists of the main dimension location and further splits
into various sub-nodes. The root node is located, and it further splits into two nodes countries ie.
USA and India. These countries are further then splitted into more sub-nodes, that represent the
province states ie. New York, Illinois, Gujarat, UP. Thus the concept hierarchy as shown in the
above example organizes the data into a tree-like structure and describes and represents in more
general terms than the level below it.
The hierarchical structure represents the abstraction level of the dimension location, which
consists of various footprints of the dimension such as street, city, province, state, and country.
Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to organize
the schema of a database in a logical and meaningful way, grouping similar objects together. A
schema hierarchy can be used to organize different types of data, such as tables, attributes, and
relationships, in a logical and meaningful way. This can be useful in data warehousing, where
data from multiple sources needs to be integrated into a single database.
operations are applied in a top-down fashion, with each level of the hierarchy representing a
more general or abstract view of the data than the level below it. This type of hierarchy is
typically used in data mining tasks such as clustering and dimensionality reduction. The
operations applied can be mathematical or statistical operations such as aggregation,
normalization
There are several reasons why a concept hierarchy is useful in data mining:
Improved Data Analysis: A concept hierarchy can help to organize and simplify data, making it
more manageable and easier to analyze. By grouping similar concepts together, a concept
hierarchy can help to identify patterns and trends in the data that would otherwise be difficult to
spot. This can be particularly useful in uncovering hidden or unexpected insights that can inform
business decisions or inform the development of new products or services.
Improved Data Visualization and Exploration: A concept hierarchy can help to improve data
visualization and data exploration by organizing data into a tree-like structure, allowing users to
easily navigate and understand large and complex data sets. This can be particularly useful in
creating interactive dashboards and reports that allow users to easily drill down to more specific
levels of detail when needed.
Improved Algorithm Performance: The use of a concept hierarchy can also help to improve
the performance of data mining algorithms. By organizing data into a hierarchical structure,
algorithms can more easily process and analyze the data, resulting in faster and more accurate
results.
Data Cleaning and Preprocessing: A concept hierarchy can also be used in data cleaning and
preprocessing, to identify and remove outliers and noise from the data.
Domain Knowledge: A concept hierarchy can also be used to represent the domain knowledge
in a more structured way, which can help in a better understanding of the data and the problem
domain.
Data Warehousing: Concept hierarchy can be used in data warehousing to organize data from
multiple sources into a single, consistent and meaningful structure. This can help to improve the
efficiency and effectiveness of data analysis and reporting.
Business Intelligence: Concept hierarchy can be used in business intelligence to organize and
analyze data in a way that can inform business decisions. For example, it can be used to analyze
customer data to identify patterns and trends that can inform the development of new products or
services.
Online Retail: Concept hierarchy can be used in online retail to organize products into
categories, subcategories and sub-subcategories, it can help customers to find the products they
are looking for more quickly and easily.
Healthcare: Concept hierarchy can be used in healthcare to organize patient data, for example,
to group patients by diagnosis or treatment plan, it can help to identify patterns and trends that
can inform the development of new treatments or improve the effectiveness of existing
treatments.
Natural Language Processing: Concept hierarchy can be used in natural language processing to
organize and analyze text data, for example, to identify topics and themes in a text, it can help to
extract useful information from unstructured data.
Fraud Detection: Concept hierarchy can be used in fraud detection to organize and analyze
financial data, for example, to identify patterns and trends that can indicate fraudulent activity.