DEV U2
DEV U2
DEV U2
2
Please read this disclaimer before
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
3
22AI502/22AM502
Data Exploration, Feature
Engineering and Visualization
Department: AI & DS
Batch/Year: 2022-2024 /III YEAR
Created by
Dr.M.Shanthi Associate Professor /ADS
1. Table of Contents
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 9
7. Lecture Plan 13
9. Lecture notes 16
10. Assignments 81
12. Part B Qs 87
5
2. Course Objectives
6
3. Pre-Requisites
Semester-IV
Data Analytics
Semester-II
Introduction to Data
Science
Semester-II
Python Programming
Semester-I
C Programming
7
4. SYLLABUS
22AI502 Data Exploration and Visualization LTPC
2023
The Seven Stages of Visualizing Data, Processing-load and displaying data – functions,
sketching and scripting, Mapping-Location, Data, two sided data ranges, smooth
interpolation of values over time - Visualization of numeric data and non numeric data.
UNIT IV TIME SERIES ANALYSIS 6+6
Overview of time series analysis-showing data as an area, drawing tabs, handling
mouse input, Connections And Correlations – Preprocessing- introducing regular
expression, sophisticated sorting, Scatterplot Maps- deployment issues
Treemaps - treemap library, directory structure, maintaining context, file item, folder
item, Networks and Graphs-approaching network problems-advanced graph example,
Acquiring data, Parsing data
8
5. COURSE OUTCOMES
9
CO – PO/PSO Mapping
10
6.CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 1 1 1 1 1 1 2 3 3
2 3 2 1 3 1 1 1 1 2 3 3
3 3 2 1 3 3 3 3 3 2 3 3
4 3 3 2 3 3 3 3 3 2 3 3
5 3 2 2 3 3 3 3 3 2 3 3
11
Lecture Plan
Unit - II
12
LECTURE PLAN – Unit II - EATURE ENGINEERING
1.8.2024 2
3 Visual 1 K2 PPT/C
Data halk & Talk
Feature- 2
2.8.2024 PPT/C
4 based Time- 1 K2
halk & Talk
Series
Analysis
Feature- 2.8.2024 2
5 1 K2 PPT/C
based Time-
halk & Talk
Series
Analysis
13
8. ACTIVITY BASED LEARNING
Guidelines to do an activity :
3) Conduct Peer review. ( Each team will be reviewed by all other teams and
mentors )
Useful link:
https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning /
14
UNIT-II
FEATURE ENGINEERING
15
9. LECTURE NOTES
Do you know what takes the maximum amount of time and effort in a
Machine Learning workflow?
Material
https://www.youtube.com/watch?v=YLjM5WjJEm0
https://www.youtube.com/watch?v=yQ5wTC4E5us
16
This pie-chart shows the results of a survey conducted by Forbes. It is
abundantly clear from the numbers that one of the main jobs of a Data
Scientist is to clean and process the raw data. This can take up to 80% of
the time of a data scientist. This is where Feature Engineering comes into
play. After the data is cleaned and processed it is then ready to be fed
into the machine learning models to train and generate outputs.
So far we have established that Data Engineering is an extremely
important part of a Machine Learning Pipeline, but why is it needed
in the first place?
To understand that, let us understand how we collect the data in the first
place. In most cases, Data Scientists deal with data extracted from massive
open data sources such as the internet, surveys, or reviews. This data is
crude and is known as raw data. It may contain missing values,
unstructured data, incorrect inputs, and outliers. If we directly use this raw,
un-processed data to train our models, we will land up with a model
having a very poor efficiency.
Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature
engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction, and Feature Selection.
2. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the overall
performance and accuracy of the model. Hence it is very important to identify and
select the most appropriate features from the data and remove the irrelevant or
less important features, which is done with the help of feature selection in
machine learning.
18
"Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features."
o It helps in the simplification of the model so that the researchers can easily
interpret it.
19
o Better features mean better results.
As already discussed, in machine learning, as data we will provide will get the
same output. So, to obtain better results, we must need to use better features.
20
For example, removing the missing values from the complete row or complete
column by a huge percentage of missing values. But at the same time, to maintain
the data size, it is required to impute the missing data, which can be done as:
o For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from
other data points in such a way that they badly affect the performance of the model.
Outliers can be handled with this feature engineering technique. This technique first
identifies the outliers and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value
within a space has a definite to an average distance, but if a value is greater distant
than a certain value, it can be considered as an outlier. Z-score can also be used to
detect outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used
mathematical techniques in machine learning. Log transform helps in handling the
skewed data, and it makes the distribution more approximate to normal after
transformation. It also reduces the effects of outliers on the data, as because of the
normalization of magnitude differences, a model becomes much robust.
4. Binning
In machine learning, overfitting is one of the main issues that degrade the
performance of the model and which occurs due to a greater number of parameters
and noisy data. However, one of the popular techniques of feature engineering,
"binning", can be used to normalize the noisy data. This process involves segmenting
different features into bins.
21
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately
into two or more parts and performing to make new features. This technique
helps the algorithms to better understand and learn the patterns in the
dataset.
The feature splitting process enables the new features to be clustered and
binned, which results in extracting useful information and improving the
performance of the data models.
Below, you will find an overview of some of the best libraries and frameworks you
can use for automating feature engineering.
Featuretools
Featuretools is one of the most widely used libraries for feature engineering
automation. It supports a wide range of operations such as selecting features and
constructing new ones with relational databases, etc. In addition, it offers simple
conversions utilizing max, sum, mode, and other terms. But one of its most
important functionalities is the possibility to build features using deep feature
synthesis (DFS).
TSFresh
PyCaret
Matplotlib and Seaborn that will help you with plotting and visualization.
23
Text Data
Text data broadly refers to all kinds of natural language data, including both
written text and spoken language. Recent years have seen a dramatic growth of
online text data with many examples such as web pages, news articles,
scientific literature, emails, enterprise documents, and social media such as blog
articles, forum posts, product reviews, and tweets. Text data contain rich
knowledge about the world, including human opinions and preferences.
Because of this, mining and analyzing vast amounts of text data (“big text
data”) can enable us to support user tasks and optimize decision making in all
application domains. Nearly all text data is generated by humans and for human
consumption. It is useful to imagine that text data is generated by humans
operating as intelligent subjective sensors: we humans observe the world
from a particular perspective (and thus are subjective) and we express our
observations in the form of text data. When we take this view, we can see that
as a special kind of big data, text data has unique values. First, since all
domains involve humans, text data are useful in all applications of big data.
Second, because text data are subjective, they offer opportunities for mining
knowledge about people’s behaviors, attitudes, and opinions. Finally, text data
directly express knowledge about our world, so small text data are also useful
(provided computers can understand it).
2. Visual Data
Most visual computing tasks involve prediction, regression or decision making using
features extracted from the original, raw visual data (images or videos). Feature
engineering typically refers to this (often creative) process of extracting new
representations from the raw data that are more conducive toa computing task.
Indeed, the performance of many machine learning algorithms heavily depends on
having insightful input representations that expose the underlying explanatory
factors of the output for the observed input . An effective data representation would also
reduce data redundancy and adapt to undesired, external factors of variation
introduced by sensor noise, labeling errors, missing samples, etc. In the case of
images or videos, dimensionality reduction is often an integral part of feature
engineering, since the raw data are typically high dimensional. Over the years, many
feature engineering schemes have been proposed and researched for producing
good representations of raw images and videos. 25
Many existing feature engineering approaches may be categorized into one of three
broad groups:
1. Classical, sometimes hand-crafted, feature representations: In general, these may
refer to rudimentary features such as image gradients as well as fairly sophisticated
features from elaborate algorithms such as the histogram of oriented gradients feature.
More often
than not, such features are designed by domain experts who have good knowledge
about the data properties and the demands of the task. Hence sometimes such
features are
called hand-crafted features. Hand-engineering features for each task requires a lot of
manual labor and domain knowledge, and optimality is hardly guaranteed. However, it
allows integration of human knowledge of the real world and of that specific task into
the feature design process, hence making it possible to obtain good results for the said
task. These types of features are easy to interpret. Note that it is not completely correct
to call all classical features as being hand-crafted, since some of them are general-
purpose features with little task-specific tuning (such as outputs of simple gradient filters).
Time series analysis is a technique in statistics that deals with time series data and
trend analysis. Time series data follows periodic time intervals that have been
measured in regular time intervals or have been collected in particular time intervals. In
other words, a time series is simply a series of data points ordered in time, and time
series analysis is the process of making sense of this data.
In a business context, examples of time series data include any trends that need to
be captured over a period of time. A Google trends report is a type of time series data
that can be analyzed. There are also far more complex applications such as demand
and supply forecasting based on past trends.
In economics, time series data could be the Gross Domestic Product (GDP), the
Consumer Price Index, S&P 500 Index, and unemployment rates. The data set could be
a country’s gross domestic product from the federal reserve economic data.
From a social sciences perspective, time series data could be birth rate, migration
data, population rise, and political factors.
The statistical characteristics of time series data does not always fit conventional
statistical methods. As a result, analyzing time series data accurately requires a unique
set of tools and methods, collectively known as time series analysis.
Material
https://www.youtube.com/watch?v=9QtL7m3YS9I&t=264s 27
https://www.youtube.com/watch?v=2vMNiSeNUjI&t=64s
Seasonality refers to periodic fluctuations. For example, if you consider
electricity consumption, it is typically high during the day and
lowers during the night. In the case of shopping patterns, online
sales spike during the holidays before slowing down and dropping.
These are some of the terms and concepts associated with time series data
analysis:
Dependence: Dependence refers to the association of two observations
with the same variable at prior time points.
While the exact mathematical models are beyond the scope of this article, these are
some specific applications of these models that are worth discussing here.
The Box-Jenkins models of both the ARIMA and multivariate varieties use the past
behaviour of a variable to decide which model is best to analyse it. The assumption is
that any time series data for analysis can be characterized by a linear function of its
past values, past errors, or both. When the model was first developed, the data used
was from a gas furnace and its variable behaviour over time.
29
In contrast, the Holt-Winters exponential smoothing model is best suited to
analyzing time series data that exhibits a defining trend and varies by seasons.
Such mathematical models are a combination of several methods of measurement; the Holt-
Winters method uses weighted averages which can seem simple enough, but these values
are layered on the equations for exponential smoothing.
Applications of Time Series Analysis
While the data is numerical and the analysis process seems mathematical, time series
analysis can seem almost abstract. However, any organization can realize a number of
present-day applications of such methods. For example, it is interesting to imagine that
large, global supply chains such as those of Amazon are only kept afloat due to the
interpretation of such complex data across various time periods. Even during the COVID-
19 pandemic where supply chains suffered maximum damage, the fact that they have
been able to bounce back faster is thanks to the numbers, and the comprehension of
these numbers, that continues to happen throughout each day and week.
Time series analysis is used to determine the best model that can be used to forecast
business metrics. For instance, stock market price fluctuations, sales,
turnover, and any other process that can use time series data to make predictions about
the future. It enables management to understand time-dependent patterns in
data and analyze trends in business metrics.
30
From a practical standpoint, time series analysis in organizations are mostly
used for:
Economic forecasting
Sales forecasting
Utility studies
Budgetary analysis
Yield projections
Census analysis
Inventory studies
Workload projections
Time series analysis and forecasting essential processes for explaining the
dynamic and influential behaviour of financial markets. Via examining
financial data, an expert can predict required forecasts for important
financial applications in several areas such as risk evolution, option pricing &
trading, portfolio construction, etc.
For example, time series analysis has become the intrinsic part of
financial analysis and can be used in predicting interest rates, foreign
currency risk, volatility in stock markets and many more. Policymakers and
business experts use financial forecasting to make decisions about
production, purchases, market sustainability, allocation of resources, etc.
Time series analysis is extremely useful to observe how a given asset, security, or
economic variable behaves/changes over time. For example, it can be deployed to
evaluate how the underlying changes associated with some data observation
behave after shifting to other data observations in the same time period.
Medicine has evolved as a data-driven field and continues to contribute in time series
analysis to human knowledge with enormous developments.
Case study
Consider the case of combining time series with a medical method CBR (case- based
reasoning) and data mining, these synergies are essential as the pre- processing
for feature mining from time series data and can be useful to study the progress of
patients over time.
However, time series in the context of the epidemiology domain has emerged very
recently and incrementally as time series analysis approaches demand
recordkeeping systems such that records should be connected over time and
collected precisely at regular intervals.
32
Medical Instruments
Time series analysis has made its way into medicine with the advent of medical
devices such as Electrocardiograms (ECGs), invented in 1901: For diagnosing
cardiac conditions by recording the electrical pulses passing through the heart.
These inventions made more opportunities for medical practitioners to deploy time
series for medical diagnosis. With the advent of wearable sensors and smart
electronic healthcare devices, now persons can take regular measurements
automatically with minimal inputs, resulting in a good collection of longitudinal
medical data for both sick and healthy individuals consistently.
One of the contemporary and modern applications where time series plays a
significant role are different areas of astronomy and astrophysics,
Time series data had an intrinsic impact on knowing and measuring anything
about the universe, it has a long history in the astronomy domain, for example,
sunspot time series were recorded in China in 800 BC, which made sunspot data
collection as well-recorded natural phenomena.
To discover variable stars that are used to surmise stellar distances, and
To observe transitory events such as supernovae to understand the
mechanism of the changing of the universe with time.
Such mechanisms are the results of constant monitoring of live streaming of time
series data depending upon the wavelengths and intensities of light that allows
astronomers to catch events as they are occurring.
33
In the last few decades, data-driven astronomy introduced novel areas of research as
Astro informatics and Astro statistics; these paradigms involve major disciplines such
as statistics, data mining, machine learning and computational intelligence. And here,
the role of time series analysis would be detecting and classifying astronomical objects
swiftly along with the characterization of novel phenomena independently.
Anciently, the Greek philosopher Aristotle researched weather phenomena with the
idea to identify causes and effects in weather changes. Later on, scientists started to
accumulate weather-related data using the instrument “barometer” to compute the
state of atmospheric conditions, they recorded weather-related data on intervals of
hourly or daily basis and kept them in different locations.
With the time, customized weather forecasts began printed in newspapers and later on
with the advancement in technology, currently forecasts are beyond the general
weather conditions.
These stations are equipped with highly functional devices and are interconnected
with each other to accumulate weather data at different geographical locations and
forecast weather conditions at every bit of time as per requirements.
Time series forecasting helps businesses to make informed business decisions, as the
process analyses past data patterns it can be useful in forecasting future possibilities and
events in the following ways;
Reliability: When the data incorporates a broad spectrum of time intervals in the
form of massive observations for a longer time period, time series forecasting is
highly reliable. It provides elucidate information by exploiting data observations at
various time intervals. 34
Growth: In order to evaluate the overall financial performance and growth as well as
endogenous, time series is the most suitable asset. Basically, endogenous growth is the
progress within organizations’ internal human capital resulting in economic growth.
For example, studying the impact of any policy variables can be manifested by applying
time series forecasting.
Trend estimation: Time series methods can be conducted to discover trends, for
example, these methods inspect data observations to identify when measurements
reflect a decrease or increase in sales of a particular product.
Seasonal patterns: Recorded data points variances could unveil seasonal patterns &
fluctuations that act as a base for data forecasting. The obtained information is
significant for markets whose products fluctuate seasonally and assist organizations in
planning product development and delivery requirements.
Data cleansing filters out noise, removes outliers, or applies various averages to
gain a better overall perspective of data. It means zoning in on the signal by
filtering out the noise. The process of time series analysis removes all the noise and
allows businesses to truly get a clearer picture of what is happening day-to-day.
The models used in time series analysis do help to interpret the true meaning of the
data in a data set, making life easier for data analysts. Autocorrelation patterns and
seasonality measures can be applied to predict when a certain data point can be
expected. Furthermore, stationarity measures can gain an estimate of the value of
said data point.
35
This means that businesses can look at data and see patterns across time and space,
rather than a mass of figures and numbers that aren’t meaningful to the core function
of the organization.
Forecasting Data
Time series analysis can be the basis to forecast data. Time series analysis is
inherently equipped to uncover patterns in data which form the base to predict future
data points. It is this forecasting aspect of time series analysis that makes it extremely
popular in the business area. Where most data analytics use past data to retroactively
gain insights, time series analysis helps predict the future. It is this very edge that helps
management make better business decisions.
Linear methods for streaming feature construction are techniques that involve
creating new features in a linear fashion from streaming data, often with a focus on
efficiency and adaptability to data arriving sequentially. These methods are suitable
for real-time or near-real-time applications were data streams continuously. Here are
some linear methods for streaming feature construction:
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.
37
Some common terms used in PCA algorithm:
o Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.
Material
https://www.youtube.com/watch?v=fkf4IBRSeEc
https://www.youtube.com/watch?v=IbE0tbjy6JQ&list=PLBv09BD7ez_5_yapAg86Od6J
eeypkS4YM
38
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number of columns is the dimensions of
the dataset.
39
Applications of Principal Component Analysis
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.
40
To overcome the overlapping issue in the classification process, we must
increase the number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data
points in a 2-dimen sional plane as shown below image:
Let's consider an example where we have two classes in a 2-D plane having an
X-Y axis, and we need to classify them efficiently. As we have already seen in the
above example that LDA enables us to draw a straight line that can completely
separate the two classes of the data points. Here, LDA uses an X-Y axis to create
a new axis by separating them using a straight line and projecting data onto a new
axis.
Hence, we can maximize the separation between these classes and reduce the 2-
D plane into 1-D. 41
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the
variation within each class.
In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.
Why LDA?
o LDA can also be used in data pre-processing to reduce the number of features,
just as PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisher faces, LDA is used to
extract useful data from different faces. Coupled with eigenfaces, it produces
effective results.
Although, LDA is specifically used to solve supervised classification problems for two or
more classes which are not possible using logistic regression in machine learning. But LDA
also fails in some cases where the Mean of the distributions is shared. In this case, LDA f4
ai2
ls
to create a new axis that makes both the classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in machine
learning.
Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as
follows:
Some of the common real-world applications of Linear discriminant Analysis are given
below:
o Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is
used to minimize the number of features to a manageable number before going
through the classification process. It generates a new template in which each
dimension consists of a linear combination of pixel values. If a linear combination is
generated using Fisher's linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment. 43
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can
be helpful when we want to identify a group of customers who mostly purchase a
product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For
example, "will you buy this product” will give a predicted result of either one or
two possible classes as a buying or not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work,
and it can also be considered a classification problem. In this case, LDA builds similar
groups on the basis of different parameters, including pitches, frequencies, sound,
tunes, etc.
o PCA is an unsupervised algorithm that does not care about classes and labels
and only aims to find the principal components to maximize the variance in the
given dataset. At the same time, LDA is a supervised algorithm that aims to
find the linear discriminants to represent the axes that maximize separation
between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to PCA.
However, PCA is assumed to be an as good performer for a comparatively
small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where
PCA is first followed by LDA.
44
How to Prepare Data for LDA
Below are some suggestions that one should always consider while
preparing the data to build the LDA model:
o Same Variance: As LDA always assumes that all the input variables
have the same variance, hence it is always a better way
to firstly standardize the data before implementing an LDA
model. By this, the Mean will be 0, and it will have a standard
deviation of 1.
45
3. Feature-Based Time-Series Analysis
The passing of time is a fundamental component of the human experience and the
dynamics of real-world processes is a key driver of human curiosity. On observing a leaf
in the wind, we might contemplate the burstiness of the wind speed, whether the
wind direction now is related to what it was a few seconds ago, or whether the dynamics
might be similar if observed tomorrow. This line of questioning about dynamics has been
followed to understand a wide range of real-world processes, including in
seismology (e.g., recordings of earthquake tremors), biochemistry (e.g., cell
potential fluctuations), biomedicine (e.g., recordings of heart rate dynamics), ecology
(e.g., animal population levels over time), astrophysics (e.g., radiation dynamics),
meteorology (e.g., air pressure recordings), economics (e.g., inflation rates variations),
human machine interfaces (e.g., gesture recognition from accelerometer data), and
industry (e.g., quality control sensor measurements on a production line). In each case,
the dynamics can be captured as a set of repeated measurements of the system over
time, or a time series. Time series are a fundamental data type for understanding
dynamics in real-world systems. Note that throughout this work we use the convention
of hyphenating “time-series” when used as an adjective, but not when used as a noun
(as “time series”). In general, time series can be sampled non-uniformly through time,
and can therefore be represented as a vector of time stamps, ti, and associated
measurements, xi. However, time series are frequently sampled uniformly through time
(i.e., at a constant sampling period, ∆t), facilitating a more compact representation as an
ordered vector x = (x1, x2, ..., xN), where N measurements have been taken at times
t = (0, ∆t, 2∆t, ...,(N − 1)∆t). Representing a uniformly sampled time series as an
ordered vector allows other types of real-valued sequential data to be represented
in the same way, such as spectra (where measurements are ordered by frequency),
word length sequences of sentences in books (where measurements are ordered
through the text), widths of rings in tree trunks (ordered across the radius of the trunk
cross section), and even the shape of objects (where the distance from a central point
in a shape can be measured and ordered by the angle of rotation of the shape) .
Some examples are shown in Fig. 4.1. Given this common representation for
sequential data, methods developed for analysing time series (which order
measurements by time), can also be applied to understand patterns in any s e q u e n4t i a6l
data.
Figure: Sequential data can be ordered in many ways, including A temperature measured
over time (a time series), B a sequence of ring widths, ordered across the cross section of
a tree trunk, and C a frequency spectrum of astrophysical data (ordered by
frequency). All of these sequential measurements can be analyzed by methods that take
their sequential ordering into account, including time-series analysis methods.
While the time series described above are the result of a single
measurement taken repeatedly through time, or univariate time series, measurements are
frequently made from multiple parts of a system simultaneously, yielding multivariate time
series. Examples of multivariate time series include measurements of the activity
dynamics of multiple brain regions through time, or measuring the air temperature, air
pressure, and humidity levels together through time. Techniques have been developed to
model and understand multivariate time series, and infer models of statistical
associations between different parts of a system that may explain its multivariate
dynamics. Methods for characterizing inter-relationships between time series are vast,
including the simple measures of statistical dependencies, like linear cross correlation,
mutual information, and to infer causal (directed) relationships using methods like
transfer entropy and Granger causality. A range of information-theoretic methods for
characterizing time series, particularly the dynamics of information transfer between time
series, are described and implemented in the excellent Java Information Dynamics
Toolkit (JIDT). Feature-based representations of multivariate systems can include both
features of individual time series, and features of inter-relationships between (e.g., pairs of)
time series. However, in this chapter we focus on individual univariate time series
sampled uniformly through time (that can be represented as ordered vectors, xi).
47
3.2 Time-Series Characterization
As depicted in the left box of Fig. 4.2, real-world and model-generated time series
are highly diverse, ranging from the dynamics of sets of ordinary differential
equations simulated numerically, to fast (nanosecond timescale) dynamics of
plasmas, the bursty patterns of daily rainfall, or the complex fluctuations of global
financial markets. How can we capture the different types of patterns in these data
to understand the dynamical processes underlying them? Being such a ubiquitous
data type, part of the excitement of time-series analysis is the large
interdisciplinary toolkit of analysis methods and quantitative models that have
been developed to quantify interesting structures in time series, or time-series
characterization.
We distinguish the characterization of unordered sets of data, which is
restricted to the distribution of values, and allows questions to be asked like the
following: Does the sample have a high mean or spread of values? Does the sample
contain outliers? Are the data approximately Gaussian distributed? While these types of
questions can also be asked of time series, the most interesting types of questions
probe the temporal dependencies and hence the dynamic processes that might
underlie the data, e.g., How bursty is the time series? How correlated is the value of
the time series to its value one second in the future? Does the time series contain
strong periodicities? Interpreting the answers to these questions in their domain
context provides understanding of the process being measured.
48
Figure: Time-series characterization. Left: A sample of nine real world time
series reveals a diverse range of temporal patterns [25, 29]. Right:
Examples of different classes of methods for quantifying different types of
structure, such as those seen in time series on the left:
(i) distribution (the distribution of values in the time series, regardless of their
sequential ordering);
(ii) autocorrelation properties (how values of a time series are
correlated to themselves through time);
(iii) stationarity (how statistical properties
change across a recording);
(iv) entropy (measures of complexity or
predictability of the time series quantified using
information theory); and
(v) nonlinear time series analysis (methods that quantify nonlinear
properties of the
dynamics).
Some key classes of methods developed for characterizing time series are
depicted in the right panel of Fig., and include autocorrelation, stationarity,
entropy, and methods from the physics-based nonlinear time-series
analysis literature. Within each broad methodological class, hundreds of
time-series analysis methods have been developed across decades of diverse
research.
In their simplest form, these methods can be represented as algorithms that
capture time- series properties as real numbers, or features. Many
different feature-based
representations for time series have been developed and been used in
applications ranging from time-series modeling, forecasting, and classification.
49
4 Feature Selection for Data Streams with Streaming Features
For the feature selection problem with streaming features, the number of instances is fixed while
candidate features arrive one at a time; the task is to timely select a subset of relevant features
from all features seen so far. A typical framework for streaming feature selection consists of
Step 1: a new feature arrives; Step 2: decide whether to add the new feature to the selected
features; step 3: determine whether to remove features from the selected features; and Step 4:
repeat Step 1 to Step 3. Different algorithms may have distinct implementations for Step 2 and
Step 3; next we will review some representative methods. Note that Step 3 is optional and some
streaming feature selection algorithms only provide Step 2.
50
e. Update the classifier by retraining it with the selected features.
Repeat: Continue this process as new data arrives. The algorithm will adaptively select features
that are most informative for the classification task.
Incremental Feature Selection: Grafting incrementally selects features one at a time, taking
into account their contributions to the classifier's performance.
Adaptive Feature Selection: It dynamically adjusts the set of selected features as new data
arrives, ensuring that only the most relevant features are retained.
Efficiency: Grafting is efficient because it avoids exhaustive search over feature subsets and
only evaluates the utility of adding or removing one feature at a time.
Thresholds: The algorithm relies on a predefined threshold for evaluating whether adding a
feature is beneficial. This threshold can be set based on domain knowledge or through cross-
validation.
Grafting is particularly useful in scenarios where you have a large number of features and
limited computational resources or when dealing with data streams where the feature set may
evolve over time. It strikes a balance between maintaining model performance and reducing
feature dimensionality, which can be beneficial for both efficiency and interpretability of
machine learning models. Keep in mind that the specific implementation and parameter
settings of the Grafting Algorithm may vary depending on the machine learning framework
and problem domain.
It is a statistical method used for sequential hypothesis testing, primarily in the context of
multiple hypothesis testing or feature selection. It was introduced as an enhancement to the
Sequential Bonferroni method, aiming to control the Family-Wise Error Rate (FWER) while
being more powerful and efficient in adaptive and sequential settings.
51
Here's a high-level overview of the Alpha-Investing algorithm:
Initialization: Start with an empty set of selected hypotheses (features) and set an
initial significance level (alpha). This alpha level represents the desired FWER control
and guides the decision-making process.
Alpha Update: After each hypothesis test, update the alpha level dynamically
based on the test results and the number of hypotheses tested so far. Alpha-
Investing adjusts the significance level to maintain FWER control while adapting to
the increasing number of tests.
Output: The selected hypotheses at the end of the process are considered
statistically significant, and the others are rejected or not selected.
FWER Control: It controls the Family-Wise Error Rate, which is the probability of
making at least one false discovery among all the hypotheses tested. This makes
it suitable for applications where controlling the overall error rate is critical.
53
Streaming Feature Selection Loop: Repeat the above steps as new
data points continue to arrive in the stream. The feature selection
process is ongoing and adaptive to the changing data distribution.
Data Ingestion:
Stream social media data from platforms like Twitter, Facebook, or Instagram.
Online Clustering:
Cluster the incoming data based on the extracted features. The number of
clusters can be determined using heuristics or adaptively based on data
54
characteristics.
Feature Importance within Clusters:
For each cluster, calculate feature importance scores. Feature importance can be
determined using various unsupervised methods, such as term frequency-inverse
document frequency (TF-IDF), mutual information, or chi-squared statistics, within the
context of each cluster.
Rank features within each cluster based on their importance scores. Select a fixed
number or percentage of top-ranked features from each cluster. Alternatively, you can
use dynamic thresholds to adaptively select features based on their importance scores
within each cluster.
Dynamic Updating:
Continuously update the clustering and feature selection process as new data
arrives.
Periodically recluster the data to adapt to changing trends and topics in social media
discussions.
Modeling or Analysis:
Utilize the selected features for various downstream tasks such as sentiment
analysis, topic modeling, recommendation systems, or anomaly detection.
Regular Maintenance:
Regularly review and update the feature selection process as the social media
55
landscape evolves. Consider adding new features or modifying existing ones based
Unsupervised streaming feature selection in social media data requires a flexible and
adaptive approach due to the dynamic nature of social media content. It aims to
extract relevant features that capture the current themes and trends in the data
without requiring labeled training data. Keep in mind that the choice of clustering
algorithm and feature importance metric should be tailored to your specific social
media data and objectives.
Non-linear methods for streaming feature construction are essential for extracting
meaningful patterns and representations from streaming data where the relationships
among features may not be linear. These methods transform the input data into a new
feature space, often with higher dimensionality, to capture complex and non-linear
relationships that may exist in the data. Here are some non-linear methods commonly
used for streaming feature construction:
Kernel Methods:
Kernel Trick: Apply the kernel trick to transform data into a higher-dimensional space
without explicitly computing the feature vectors. Common kernels include the Radial
Basis Function (RBF) kernel and polynomial kernels.
Online Kernel Methods: Adapt kernel methods to streaming data by updating kernel
matrices incrementally as new data arrives. Online kernel principal component analysis
(KPCA) and online kernel support vector machines (SVM) are examples.
Neural Networks:
Deep Learning: Utilize deep neural networks, including convolutional neural networks
(CNNs) and recurrent neural networks (RNNs), for feature extraction. Deep
architectures can capture intricate non-linear relationships in the data.
Autoencoders:
Manifold Learning:
Isomap:
Isomap is another manifold learning method that can be used for non-linear
feature construction in streaming data by incrementally updating the geodesic
distances between data points.
Random Features:
Feature Mapping:
Ensemble Techniques:
57
Online Clustering and Density Estimation:
Clustering and density estimation methods, such as DBSCAN and Gaussian Mixture
Models (GMM), can be used to create features that represent the underlying non-linear
structures in streaming data.
When selecting a non-linear feature construction method for streaming data, consider
factors such as computational efficiency, scalability, and the adaptability of the method to
evolving data distributions. The choice of method should align with the specific
characteristics and requirements of your streaming data application.
2. Sliding Window:
Construct local linear models for each data point based on its neighbors. This
involves finding weights that best reconstruct the data point as a linear
combination of its neighbors.
58
1. Local Reconstruction Weights:
Calculate the reconstruction weights for each data point in the local
neighborhood. These weights represent the contribution of each neighbor to the
reconstruction of the data point.
2. Global Embedding:
Combine the local linear models and reconstruction weights to compute a global
embedding for the entire dataset. This embedding represents the lower-
dimensional representation of the data stream.
3. Continuous Update:
4. Memory Management:
Manage memory efficiently to ensure that the sliding window remains within
a predefined size limit. You may need to adjust the window size dynamically
based on available memory and computational resources.
5. Hyperparameter Tuning:
7. Application:
Kernel learning for data streams is an area of machine learning that focuses on
adapting kernel methods, which are originally designed for batch data, to the
streaming data setting. Kernel methods are powerful techniques for dealing with non-
linear relationships and high-dimensional data. Adapting them to data streams requires
efficient processing and storage of data as it arrives in a sequential and potentially
infinite manner. Here are some key considerations and techniques for kernel learning in
data streams:
3. Memory Management:
Efficiently manage memory to ensure that the kernel matrix doesn't grow
too large as data accumulates. This may involve storing only a subset of the
most recent data points or employing methods like forgetting mechanisms.
60
1. Streaming Feature Selection:
Monitor the data stream for concept drift, which occurs when the data
distribution changes over time. When drift is detected, consider retraining
or adapting the kernel model.
4. Kernel Approximations:
61
1. Resource Constraints:
Kernel learning for data streams is an active area of research, and various algorithms and
techniques have been proposed to address the unique challenges posed by streaming
data. The choice of approach should be based on the specific requirements and constraints
of your streaming data application.
Using neural networks for data streams, where data arrives continuously and in a
potentially infinite sequence, presents unique challenges and opportunities. Neural
networks are powerful models for various machine learning tasks, including
classification, regression, and sequence modelling. Adapting them to data streams
requires specialized techniques to handle the dynamic nature of the data. Here's an
overview of considerations when using neural networks for data streams:
1. Online Learning:
2. Sliding Window:
3. Model Architecture:
62
1. Mini-Batch Learning:
3. Memory-efficient Models:
4. Feature Engineering:
5. Regularization:
6. Hyperparameter Tuning:
7. Ensemble Methods:
4. Resource Constraints:
• Internet traffic
• Sensors data
• Call records
• Satellite data
• Audio listening
• Watching videos
• Online transactions
65
Data Streams in Data Mining is extracting knowledge and valuable insights
from a continuous stream of data using stream processing software. Data
Streams in Data Mining can be considered a subset of general concepts of
machine learning, knowledge extraction, and data mining. In Data Streams in
Data Mining, data analysis of a large amount of data needs to be done in real-
time. The structure of knowledge is extracted in data steam mining
represented in the case of models and patterns of infinite streams of information.
stream resulting in big data. In data streaming, multiple data streams are passed
simultaneously.
• Time Sensitive: Data Streams are time-sensitive, and elements of data
streams carry timestamps with them. After a particular time, the data stream
loses its significance and is relevant for a certain period.
• Data Volatility: No data is stored in data streaming as It is volatile. Once
the data mining and analysis are done, information is summarized or discarded.
• Concept Drifting: Data Streams are very unpredictable. The data changes
or evolves with time, as in this dynamic world, nothing is constant.
Data Stream is generated through various data stream generators. Then,
data mining
techniques are implemented to extract knowledge and patterns from the data
streams. Therefore, these techniques need to process multi-dimensional,
multi-level, single pass, and online data streams.
66
4.5 Tools and Software for Data Streams in Data Mining
There are many tools available for Data Streams in Data Mining. Let’s learn
about the most used tools for Data Streams in Data Mining.
Scikit-Multiflow
Scikit-Multiflow is also a free and open-source machine learning framework
for multi- output and Data Streams in Data Mining implemented in
Python. Scikit multi-flow contains stream generators, concept drift
detections, stream learning methods for single and multi-target, concept
drift detectors, data transformers evaluation, and visualization methods.
67
RapidMiner
programming language and used for data loading and transformation (ETL),
analytical workflows.
StreamDM
Mining that uses Spark Streaming, extending the core Spark API. It is a
specialized framework for Spark Streaming that handles much of the complex
problems of the underlying data sources, such as out-of-order data and recovery
from failures.
River
River is a new Python framework for machine learning with online Data
tasks. It is the product of merging the best parts of the creme and scikit
multi-flow libraries, both of which were built with the same objective of its
68
5. Feature Selection and Evaluation
Feature selection is one of the important tasks in machine learning and data mining. It is
an important and frequently used technique for dimension reduction by removing
irrelevant and redundant information from the data set to obtain an optimal feature
subset. It is also a knowledge discovery tool for providing insights into the problems
through the interpretation of the most relevant features. Feature selection research dates
back to the 1960s. Hughes used a general parametric model to study the accuracy of
a Bayesian classifier as a function of the number of features. Since then the research
in feature selection has been a challenging field and some
researchers have doubted its computational feasibility. Despite the
computationally challenging scenario, the research in this direction continued. To deal
with these data, feature selection faces some new challenges. So it is timely and
significant to review the relevant topics of these emerging challenges and give some
suggestions to the practitioners.
Feature selection brings the immediate effects of speeding up a data mining algorithm,
improving learning accuracy, and enhancing model comprehensibility. However, finding an
optimal feature subset is usually intractable and many problems related to feature
selection have been shown to be NPhard. To efficiently solve this problem, two
frameworks are proposed up to now. One is the search-based framework, and the other
is the correlation- based framework. For the former, the search strategy and evaluation
criterion are two key components. The search strategy is about how to produce a
candidate feature subset, and each candidate subset is evaluated and compared with the
previous best one according to a certain evaluation criterion. The process of subset
generation and evaluation is repeated until a given stopping criterion is satisfied. For the
latter, the redundancy and relevance of features are calculated based on some correlation
measure. The entire original feature set can then be divided into four basic disjoint
subsets: (1) irrelevant features, (2) redundant feature, (3) weakly relevant but non-
redundant features, and (4) strongly relevant features. An optimal feature selection
algorithm should select non-redundant and strongly relevant features. When the best
subset is selected, generally, it will be validated by prior knowledge or different tests via
synthetic and/or real-world data sets. One of the most
69
well-known data repositories is in UCI, which contains many kinds of data sets with
different sizes of sample and dimensionality. Feature Selection @ ASU
(http://featureselection.asu.edu) also provides many benchmark data sets and source
codes for different feature selection algorithms. In addition, some microarray data, such as
Leukemia, Prostate, Lung and Colon, are often used to evaluate the performance of
feature selection algorithms on the high-dimensionality small sample size (HDSSS)
problem.
For the search-based framework, a typical feature selection process consists of three basic
steps, namely, subset generation, subset evaluation, and stopping criterion. Subset
generation aims to generate a candidate feature subset. Each candidate subset is
evaluated and compared with the previous best one according to a certain evaluation
criterion. If the newly generated subset is better than the previous one, it will be the latest best
subset. The first two steps of search-based feature selection are repeated until a
given stopping criterion is satisfied. According to the evaluation criterion, feature selection
algorithms are categorized into filter, wrapper, and hybrid (embedded) models. Feature
selection algorithms under the filter model rely on analyzing the general characteristics of
data and evaluating features without involving any learning algorithms. Wrapper utilizes a
predefined learning algorithm instead of an independent measure for subset evaluation.
A typical hybrid algorithm makes use of both an independent measure and a learning
algorithm to evaluate feature subsets.
On the other hand, search strategies are usually categorized into complete, sequential, and
random models. Complete search evaluates all feature subsets and guarantees to find the
optimal result according to the evaluation criterion. Sequential search likes to add or
remove features for the previous subset at a time. Random search starts with a randomly
selected subset and injects randomness into the procedure of subsequent search.
Nowadays, as big data with high dimensionality are emerging, the filter
model has attracted more attention than ever. Feature selection algorithms under the filter
model rely on analyzing the general characteristics of data and evaluating features without
70
involving any learning algorithms, therefore most of them do not have bias on specific
learner models. Moreover, the filter model has a straightforward search strategy and
feature evaluation criterion, then its structure is always simple. The advantages of the
simple structure are evident: First, it is easy to design and easily understood by other
researchers. Second, it is usually very fast, and is often appropriate for high-
dimensional data.
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is
about selecting the subset of the original feature set, whereas feature extraction
creates new features. Feature selection is a way of reducing the input variable for
the model by using only relevant data in order to reduce overfitting in the model.
71
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in
model building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.
the researchers.
72
Fig: Feature Selection Techniques
73
On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.
but it is the opposite of forward selection. This technique begins the process
by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve the
performance of the model.
• Exhaustive Feature Selection- Exhaustive feature selection is one of the
best feature selection methods, which evaluates each feature set as brute-force.
It means this method tries & make each possible combination of features and
return the best performing feature set.
• Recursive Feature Elimination
Recursive feature elimination is a recursive greedy optimization approach,
where features are selected by recursively taking a smaller and smaller
subset of features. Now, an estimator is trained with each set of features,
and the importance of each feature is determined using coef_attribute or
through a feature_importances_attribute.
Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step. The filter method filters out the irrelevant
feature and redundant columns from the
model by using different metrics through ranking. The advantage of using filter
methods is that it needs low computational time and does not overfit the data.
74
Some common techniques of Filter methods are as follows:
• Information Gain
• Chi-square Test
• Fisher's Score
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection.
It returns the rank of the variable on the fisher's criteria in descending order.
Then we can select the variables with a large fisher's score.
75
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally
finds the most important features that contribute the most to training in a
particular iteration. Some techniques of embedded methods are:
• Regularization- Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in the model.
This penalty term is added to the coefficients; hence it shrinks some
coefficients to zero. Those features with zero coefficients can be removed
from the dataset. The types of regularization techniques are L1 Regularization
(Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of selecting
features. Here, feature importance specifies which feature has more
importance in model building or has a great
impact on the target variable. Random Forest is such a tree-based method,
which is a type of bagging algorithm that aggregates a different number of
decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all
the trees. Nodes are arranged as per the impurity values, and thus it allows to
pruning of trees below a specific node. The remaining nodes create a subset of
To know this, we need to first identify the type of input and output variables.
In machine learning, variables are of mainly two types:
ordinal, nominals.
Below are some univariate statistical measures, which can be used for filter-
based feature selection:
1. Numerical Input, Numerical Output:
Numerical Input variables are used for predictive regression modelling. The
common method to be used for such a case is the Correlation coefficient.
77
2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should
be used, but with categorical output.
78
VIDEO LINKS
Unit – II
79
Sl. Topic Video Link
No.
1 Visual Data https://www.youtube.com/watch?v=rf5dGtn4Nkk
80
10. ASSIGNMENT : UNIT – II
Toppers
1. Given a dataset with text data, extract meaningful features using techniques such as TF-IDF, word
embeddings, or N-grams. Demonstrate the impact of these features on the performance of a text
classification model.
2. Use a dataset with numerical features and create polynomial features to capture non-linear
relationships. Compare the performance of a regression model with and without polynomial features
and discuss the results.
Above Average Learners:
1. For a given dataset, create interaction features by multiplying or combining existing features.
Analyze the effect of these interaction features on the performance of a predictive model.
2. Given a dataset with missing values, compare different methods for handling missing data (e.g.,
mean imputation, median imputation, KNN imputation, and using a model to predict missing
values). Assess the impact of these methods on model performance.
Average Learners:
1. Take a dataset with date and time information and create new features such as day of the week,
month, hour, and time since the last event. Show how these features can improve the performance
of a time series or event prediction model.
2. Use a dataset with features of different scales and apply various feature scaling techniques (e.g.,
standardization, normalization, min-max scaling). Evaluate how feature scaling affects the
performance of a model, such as a support vector machine or k-nearest neighbors.
Below Learners:
1. Given a dataset with categorical variables, compare different encoding techniques (e.g., one-hot
encoding, label encoding, target encoding) and analyze their impact on the performance of a
machine learning model.
2. Apply dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-
Distributed Stochastic Neighbor Embedding (t-SNE) to a high-dimensional dataset. Visualize the
results and discuss how dimensionality reduction affects the performance of a classifier or
regressor.
Slow Learners:
1. Using a dataset with many features, apply feature selection techniques (e.g., recursive feature
elimination, LASSO, mutual information) to identify the most important features. Compare the
performance of a model before and after feature selection.
2. Choose a dataset from a specific domain (e.g., finance, healthcare, e-commerce) and createdomain- specific
features that could improve model performance. Explain the rationale behind thenew features and
how they contribute to better predictions. 81
82
PART-A Q&A UNIT-II
83
11 : PART A: Q & A UNIT – II
poor generalization.
Q3: What are the main types of feature selection methods? CO2,K1
A3: Feature selection methods can be broadly categorized into three types:
1. Filter Methods: These methods assess the relationship between each feature
and the target variable independently. Examples include correlation-based
selection and statistical tests.
2. Wrapper Methods: These methods use a specific machine learning algorithm
to evaluate subsets of features, considering their combined predictive power.
Examples include recursive feature elimination and forward/backward selection.
3. Embedded Methods: These methods incorporate feature selection as part of
84
Q4: How can you evaluate the importance of features in a dataset? CO2,K1
A4: Feature importance can be evaluated through various methods, including:
• Correlation analysis: Measuring the correlation between each feature and the target
variable.
• Tree-based models: Assessing the decrease in impurity (e.g., Gini impurity)
85
Q8: Can you explain the concept of recursive feature elimination (RFE)?
CO2,K1
A8: Recursive Feature Elimination (RFE) is a wrapper method for feature selection.
It involves repeatedly training a model, removing the least important feature(s)
based on a defined criterion, and then retraining the model. This process is
repeated until a specified number of features is reached or until the desired model
performance is achieved. RFE helps identify the subset of features that contribute
the most to the model's predictive power.
Q9: What is the curse of dimensionality, and how does feature selection
address it? CO2,K1
A9: The curse of dimensionality refers to the challenges posed by high-dimensional
data, where the number of features is much larger than the number of
observations. In such cases, data becomes sparse, and model performance can
degrade. Feature selection helps address this issue by reducing the number of
dimensions, making the data more manageable, improving model generalization,
and reducing the risk of overfitting.
Q10: In what scenarios would you prefer using embedded feature
selection methods? CO2,K1
A10: Embedded feature selection methods are preferred when the feature selection
process is integrated into the model training itself. They are especially useful when:
• The dataset has a high number of features.
86
Q12: How do you extract features from time series data? CO2,K1
A12: Feature extraction from time series data can involve methods like:
• Statistical features: Mean, variance, skewness, kurtosis, etc.
A13: Feature-based time series analysis has various applications, such as:
• Stock price prediction.
87
12. PART B QUESTIONS : UNIT – II
88
13. SUPPORTIVE ONLINE CERTIFICATION COURSES
NPTEL : https://nptel.ac.in/courses/110106064
coursera : https://www.coursera.org/learn/time-series-features
Udemy https://www.udemy.com/course/feature-selection-for-machine-
learning/
89
14. REAL TIME APPLICATIONS
Feature selection and evaluation are critical steps in various real-time applications
across different domains. Here are some real-time applications where feature
selection and evaluation play a significant role:
Real-Time Predictive Maintenance: In industries such as manufacturing and
aviation, feature selection and evaluation help identify the most relevant sensor data for
predicting equipment failures. By selecting key features and evaluating their
importance, predictive maintenance models can detect anomalies and trigger
maintenance actions before critical failures occur.
Financial Fraud Detection: Feature selection and evaluation are crucial for
identifying patterns indicative of fraudulent transactions in real-time financial data. By
selecting informative features and evaluating their impact, fraud detection
algorithms can swiftly detect and flag potentially fraudulent activities.
Healthcare Monitoring and Diagnostics: In real-time patient monitoring,
selecting relevant physiological features and evaluating their significance can aid in
early disease detection and patient risk assessment. Continuous evaluation of
features can provide timely alerts for medical interventions.
Web Traffic Anomaly Detection: For websites and online services, feature
selection and evaluation assist in detecting unusual patterns in real-time web traffic
data. Relevant features are chosen to capture normal behavior, and evaluation helps
distinguish anomalies that could indicate cyberattacks or system glitches.
Autonomous Vehicles and Robotics: Feature selection and evaluation are
essential in real-time decision-making for autonomous vehicles and robots. Selecting
pertinent sensory inputs and evaluating their impact on navigation and control
algorithms ensure safe and efficient operations.
Energy Management and Smart Grids: Feature selection and evaluation are
used in real-time energy consumption analysis and forecasting. By selecting relevant
features related to energy usage patterns and evaluating their influence, smart grids can
optimize energy distribution and consumption.
90
Network Intrusion Detection: In cybersecurity, feature selection and
evaluation help identify network anomalies and potential security
breaches in real-time. By selecting key network traffic features and
evaluating their importance, intrusion detection systems can quickly flag
suspicious activities.
Social Media Sentiment Analysis: Real-time sentiment analysis on
social media platforms relies on feature selection and evaluation to
determine which textual and linguistic features are most indicative of user
sentiment. This enables businesses to respond promptly to customer
feedback and trends.
Environmental Monitoring: Real-time analysis of environmental data,
such as air quality and pollution levels, benefits from feature selection
and evaluation. By
selecting relevant environmental indicators and evaluating their impact,
authorities can make informed decisions to protect public health.
Human Activity Recognition: In wearable devices and IoT
applications, feature selection and evaluation help in real-time recognition
of human activities. Relevant
sensor data features are chosen, and evaluation ensures accurate
tracking and classification of activities.
91
Assessment Schedule
(Proposed Date &
Actual Date)
92
16. Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Propose Actual
No. d Date Date
1 FIRST INTERNAL ASSESSMENT 22-8-24 to 30-8-24
93
17. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS
TEXT BOOKS:
1. Suresh Kumar Mukhiya and Usman Ahmed, “Hands-on Exploratory Data
Analysis
with Python”, Packt publishing , March 2020.
2. Guozhu Dong, Huan Liu, "Feature Engineering for Machine Learning
and Data Analytics", First Publication, CRC Press, First edition, 2018.
REFERENCES:
1. Danyel Fisher & Miriah Meyer, “Making Data Visual: A Practical Guide
To Using
Visualization For Insight”, O’reilly publications, 2018.
2. Claus O. Wilke, ”Fundamentals of Data Visualization”, O’reilly
publications, 2019.
3. EMC Education Services, “Data Science and Big data analytics:
Discovering,
Analyzing, Visualizing and Presenting Data”, Wiley Publishers, 2015.
4. Tamara Munzner, “Visualization Analysis and Design”, A K Peters/CRC
Press; 1st edition, 2014.
5. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive
Data Visualization: Foundations, Techniques, and Applications”, 2nd
Edition, CRC press,
2015.
94
18. MINI PROJECT SUGGESTION
Mini Project Ideas:
Toppers:
1. Spam Email Classification:
Dataset: A collection of emails labeled as spam or not spam.
Task: Perform feature selection and evaluation to build a spam email classifier. Techniques:
Compare different feature selection methods such as chi-squared test, mutual
information, and recursive feature elimination. Evaluate model performance using
metrics like accuracy, precision, recall, and F1-score.
Above Average Learners
2. House Price Prediction:
Dataset: Housing dataset containing various features related to houses.
Task: Select relevant features and build a regression model to predict house prices.
Techniques: Utilize correlation analysis and feature importance from tree-based
models. Compare the performance of different regression algorithms and evaluate
using metrics like RMSE and R-squared.
Average Learners
3. Customer Churn Prediction:
Dataset: Customer data from a telecommunications company, including various
customer attributes.
Task: Identify key features affecting customer churn and build a churn prediction
model.
Techniques: Apply feature importance from ensemble methods like Random Forest or
XGBoost. Evaluate model performance using accuracy, precision, recall, and ROC- AUC.
Below Average Learners
4. Medical Diagnostics:
Dataset: Medical dataset with patient attributes and diagnostic outcomes. Task:
Perform feature selection and evaluation to develop a diagnostic model.
Techniques: Implement L1-based regularization to select important features. Evaluate the
model using sensitivity, specificity, and overall accuracy.
Slow Learners:
Take a dataset with date and time information and create new features such as day of the
week, month, hour, and time since the last event. Show how these features can improve
the performance of a time series or event prediction model.
95
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not the
intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.
96