DEV U2

1
2
Please read this disclaimer before
proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document contains
proprietary information and is intended only to the respective group / learning
community as intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of this
information is strictly prohibited.
3
22AI502/22AM502
Data Exploration, Feature
Engineering and Visualization
Department: AI & DS
Batch/Year: 2022-2024 /III YEAR
Created by
Dr.M.Shanthi Associate Professor /ADS
1. Table of Contents
S.NO Topic Page No.
1. Contents 5
2. Course Objectives 6
3. Pre-Requisites 7
4. Syllabus 8
5. Course outcomes 9
6. CO- PO/PSO Mapping 11
7. Lecture Plan 13
8. Activity based learning 14
9. Lecture notes 16
10. Assignments 81
11. Part A Q & A 82
12. Part B Qs 87
13. Supportive online Certification courses 88
14. Real time Applications in day to day life 89

and to Industry
15. Assessment Schedule 91
16. Prescribed Text Books and Reference Books 93
17. Mini Project Suggestions 94
5
2. Course Objectives
➢ To outline an overview of exploratory data analysis and phases involved in

data analytics
➢ To acquire an in-depth knowledge in EDA techniques

➢ To experiment the data visualization
➢ To describe the methods of time series analysis
➢ To explain the basics of tree and hierarchical representation of big data
6
3. Pre-Requisites
Semester-IV
Data Analytics
Semester-II
Introduction to Data
Science
Semester-II
Python Programming
Semester-I
C Programming
7
4. SYLLABUS
22AI502 Data Exploration and Visualization LTPC
2023
UNIT I EXPLORATORY DATA ANALYSIS 6+ 6
EDA fundamentals – Understanding data science – Significance of EDA – Making sense

of data – Comparing EDA with classical and Bayesian analysis – Software tools for EDA.
Visual Aids For EDA- Data transformation techniques-merging database, reshaping and
pivoting, Transformation techniques -Descriptive Statistics-types of kurtosis, quartiles,
Grouping Datasets-data aggregation, group wise transformation
UNIT II FEATURE ENGINEERING 6+6

Text Data – Visual Data – Feature-based Time-Series Analysis – Data Streams – Feature
Selection and Evaluation.
UNIT III VISUALIZING DATA 6+6
The Seven Stages of Visualizing Data, Processing-load and displaying data – functions,
sketching and scripting, Mapping-Location, Data, two sided data ranges, smooth
interpolation of values over time - Visualization of numeric data and non numeric data.
UNIT IV TIME SERIES ANALYSIS 6+6
Overview of time series analysis-showing data as an area, drawing tabs, handling
mouse input, Connections And Correlations – Preprocessing- introducing regular
expression, sophisticated sorting, Scatterplot Maps- deployment issues
UNIT V TREES, HIERARCHIES, AND RECURSION 6+6
Treemaps - treemap library, directory structure, maintaining context, file item, folder
item, Networks and Graphs-approaching network problems-advanced graph example,
Acquiring data, Parsing data
TOTAL: 30 +30 PERIODS
8
5. COURSE OUTCOMES
At the end of this course, the students will be able to:
COURSE OUTCOMES HKL
CO1 Explain the overview of exploratory data analysis and phases K2

involved in data analytics
CO2 Explore in-depth knowledge in EDA techniques K4
CO3 Apply the visualization techniques in data K3
CO4 Describe the methods of time series analysis K4
CO5 Represent the data in tree and hierarchical formats K4
9
CO – PO/PSO Mapping
10
6.CO – PO /PSO Mapping Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 1 1 1 1 1 1 2 3 3
2 3 2 1 3 1 1 1 1 2 3 3
3 3 2 1 3 3 3 3 3 2 3 3
4 3 3 2 3 3 3 3 3 2 3 3
5 3 2 2 3 3 3 3 3 2 3 3
11
Lecture Plan
Unit - II
12
LECTURE PLAN – Unit II - EATURE ENGINEERING
Sl. N Topic Number Propose Actual CO Tax Mode of

o of d Date Lectur ono Deliver y
Periods e Date my
Leve l
Text Data 28.7.2024 2 T/
Chalk &
1 1 K2 Talk
29.7.2024 2
Text Data PPT/C
2 1 K2
halk & Talk
1.8.2024 2
3 Visual 1 K2 PPT/C
Data halk & Talk
Feature- 2
2.8.2024 PPT/C
4 based Time- 1 K2
halk & Talk
Series
Analysis
Feature- 2.8.2024 2
5 1 K2 PPT/C
based Time-
halk & Talk
Series
Analysis
Data Streams 10.8.2024 2 PPT/C

6 1 K2 halk &
Talk
7 Data 1 6.8.2024 2 K1 PPT/C
Streams halk &
Talk
8 1 19.8.2024 2 K1 PPT/C
Feature
halk &
Selection
Talk
and
Evaluatio
n.
9 Feature 1 20.8.2024 2 K1 PPT/C

Selection and halk & Talk
Evaluation.
13
8. ACTIVITY BASED LEARNING
Feature Selection Using Heart Disease Dataset.

● Understanding the given dataset and preprocess the given dataset.
● It gives you a clear picture of the features and the relationships between
them.
● Providing guidelines for essential of feature selection.
● Feature selection process would be maximizing the accuracy of the model.
Guidelines to do an activity :
1) Students can form group. ( 3 students / team)
2) Install Weka Tool.
3) Conduct Peer review. ( Each team will be reviewed by all other teams and
mentors )
Useful link:
https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning /
14
UNIT-II
FEATURE ENGINEERING
15
9. LECTURE NOTES
What is Feature Engineering?

Feature Engineering is the process of extracting and organizing the important
features from raw data in such a way that it fits the purpose of the machine
learning model. It can be thought of as the art of selecting the important
features and transforming them into refined and meaningful features that suit
the needs of the model.
Feature Engineering encapsulates various data engineering techniques such as

selecting relevant features, handling missing data, encoding the
data, and normalizing it.
It is one of the most crucial tasks and plays a major role in
determining the outcome of a model. In order to ensure that the
chosen algorithm can perform to its optimum capability, it is important
to engineer the features of the input data effectively.
Why is Feature Engineering so important?
Do you know what takes the maximum amount of time and effort in a
Machine Learning workflow?
Well to analyse that, let us have a look at this diagram.
Material
https://www.youtube.com/watch?v=YLjM5WjJEm0
https://www.youtube.com/watch?v=yQ5wTC4E5us
16
This pie-chart shows the results of a survey conducted by Forbes. It is
abundantly clear from the numbers that one of the main jobs of a Data
Scientist is to clean and process the raw data. This can take up to 80% of
the time of a data scientist. This is where Feature Engineering comes into
play. After the data is cleaned and processed it is then ready to be fed
into the machine learning models to train and generate outputs.
So far we have established that Data Engineering is an extremely
important part of a Machine Learning Pipeline, but why is it needed
in the first place?
To understand that, let us understand how we collect the data in the first
place. In most cases, Data Scientists deal with data extracted from massive
open data sources such as the internet, surveys, or reviews. This data is
crude and is known as raw data. It may contain missing values,
unstructured data, incorrect inputs, and outliers. If we directly use this raw,
un-processed data to train our models, we will land up with a model
having a very poor efficiency.
Thus Feature Engineering plays an extremely pivotal role in

determining the performance of any machine learning model
Benefits of Feature Engineering

An effective Feature Engineering implies:
 Higher efficiency of the model
 Easier Algorithms that fit the data
 Easier for Algorithms to detect patterns in the data

17
 Greater Flexibility of the features
Well, cleaning up bulks of raw, unstructured, and dirty data may seem like a daunting task,
but that is exactly what this guide is all about.
Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature
engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction, and Feature Selection.
These processes are described as below:

1. Feature Creation: Feature creation is finding the most useful variables to be
used in a predictive model. The process is subjective, and it requires human
creativity and intervention. The new features are created by mixing existing
features using addition, subtraction, and ration, and these new features have
great flexibility.
Transformations: The transformation step of feature engineering involves adjusting the

predictor variable to improve the accuracy and performance of the model. For example,
it ensures that the model is flexible to take input of the variety of data; it ensures that
all the variables are on the same scale, making the model easier to understand. It
improves the model's accuracy and ensures that all the features are within the
acceptable range to avoid any computational error.
1. Feature Extraction: Feature extraction is an automated feature engineering
process that generates new variables by extracting them from the raw data. The
main aim of this step is to reduce the volume of data so that it can be easily used
and managed for data modelling. Feature extraction methods include cluster
analysis, text analytics, edge detection algorithms, and principal
components analysis (PCA).
2. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features
are either redundant or irrelevant. If we input the dataset with all these
redundant and irrelevant features, it may negatively impact and reduce the overall
performance and accuracy of the model. Hence it is very important to identify and
select the most appropriate features from the data and remove the irrelevant or
less important features, which is done with the help of feature selection in
machine learning.
18
"Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features."
Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.
o It helps in the simplification of the model so that the researchers can easily
interpret it.
o It reduces the training time.
o It reduces overfitting hence enhancing the generalization.
Need for Feature Engineering in Machine Learning

In machine learning, the performance of the model depends on data pre-processing
and data handling. But if we create a model without pre-processing or data handling,
then it may not give good accuracy. Whereas, if we apply feature engineering on the
same model, then the accuracy of the model is enhanced. Hence, feature
engineering in machine learning improves the model's performance. Below are some
points that explain the need for feature engineering:
o Better features mean flexibility.
In machine learning, we always try to choose the optimal model to get good
results. However, sometimes after choosing the wrong model, still, we can get
better predictions, and this is because of better features. The flexibility in
features will enable you to select the less complex models. Because less
complex models are faster to run, easier to understand and maintain, which is
always desirable.
o Better features mean simpler models.

If we input the well-engineered features to our model, then even after selecting
the wrong parameters (Not much optimal), we can have good outcomes. After
feature engineering, it is not necessary to do hard for picking the right model with
the most optimized parameters. If we have good features, we can better represent
the complete data and use it to best characterize the given problem.
19
o Better features mean better results.
As already discussed, in machine learning, as data we will provide will get the
same output. So, to obtain better results, we must need to use better features.
Steps in Feature Engineering

The steps of feature engineering may vary as per different data scientists and ML
engineers. However, there are some common steps that are involved in most
machine learning algorithms, and these steps are as follows:
o Data Preparation: The first step is data preparation. In this step, raw data
acquired from different resources are prepared to make it in a suitable format
so that it can be used in the ML model. The data preparation may contain
cleaning of data, delivery, data augmentation, fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA)

is an important step of features engineering, which is mainly used by data
scientists. This step involves analysis, investing data set, and summarization of
the main characteristics of data. Different data visualization techniques are used
to better understand the manipulation of data sources, to find the most
appropriate statistical technique for data analysis, and to select the best
features for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for

accuracy to compare all the variables from this baseline. The benchmarking
process is used to improve the predictability of the model and reduce the error
rate.
Feature Engineering Techniques

Some of the popular feature engineering techniques include:
1. Imputation
Feature engineering deals with inappropriate data, missing values, human
interruption, general errors, insufficient data sources, etc. Missing values within the
dataset highly affect the performance of the algorithm, and to deal with them
"Imputation" technique is used. Imputation is responsible for handling
irregularities within the dataset.
20
For example, removing the missing values from the complete row or complete
column by a huge percentage of missing values. But at the same time, to maintain
the data size, it is required to impute the missing data, which can be done as:
o For numerical data imputation, a default value can be imputed in a column, and
missing values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the
maximum occurred value in a column.
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from
other data points in such a way that they badly affect the performance of the model.
Outliers can be handled with this feature engineering technique. This technique first
identifies the outliers and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value
within a space has a definite to an average distance, but if a value is greater distant
than a certain value, it can be considered as an outlier. Z-score can also be used to
detect outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used
mathematical techniques in machine learning. Log transform helps in handling the
skewed data, and it makes the distribution more approximate to normal after
transformation. It also reduces the effects of outliers on the data, as because of the
normalization of magnitude differences, a model becomes much robust.
4. Binning
In machine learning, overfitting is one of the main issues that degrade the
performance of the model and which occurs due to a greater number of parameters
and noisy data. However, one of the popular techniques of feature engineering,
"binning", can be used to normalize the noisy data. This process involves segmenting
different features into bins.
21
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately
into two or more parts and performing to make new features. This technique
helps the algorithms to better understand and learn the patterns in the
dataset.
The feature splitting process enables the new features to be clustered and
binned, which results in extracting useful information and improving the
performance of the data models.
6. One hot encoding

One hot encoding is the popular encoding technique in machine learning. It
is a technique that converts the categorical data in a form so that they can
be easily understood by machine learning algorithms and hence can make a good
prediction. It enables group the of categorical data without losing any information.
Tools for feature engineering
Below, you will find an overview of some of the best libraries and frameworks you
can use for automating feature engineering.
Featuretools
Featuretools is one of the most widely used libraries for feature engineering
automation. It supports a wide range of operations such as selecting features and
constructing new ones with relational databases, etc. In addition, it offers simple
conversions utilizing max, sum, mode, and other terms. But one of its most
important functionalities is the possibility to build features using deep feature
synthesis (DFS).
TSFresh
TSFresh is a free Python library containing well-known algorithms from statistics,

time series analysis, signal processing, and nonlinear dynamics. It is used for the
systematic extraction of time-series features. These include the number of
peaks, the average value, the largest value, or more sophisticated properties, such
as the statistics of time reversal symmetry. 22
Feature Selector
As the name suggests, Feature Selector is a Python library for choosing

features. It determines attribute significance based on missing data, single
unique values, collinear or insignificant features. For that, it uses “lightgbm”
tree-based learning methods. The package also includes a set of
visualization techniques that can provide more information about the dataset.
PyCaret
PyCaret is a Python-based open-source library. Although it is not a dedicated

tool for automated feature engineering, it does allow for the automatic
generation of features before model training. Its advantage is that it lets you
replace hundreds of code lines with just a handful, thus increasing
productivity and exponentially speeding up the experimentation cycle.
Some other useful tools for feature engineering include:
 the NumPy library with numeric and matrix operations;

 Pandas where you can find the DataFrame, one of the most
important elements of data science in Python;
 Scikit-learn framework, a general ML package with multiple models
and feature transformers;
 Matplotlib and Seaborn that will help you with plotting and visualization.
Advantages and drawbacks of feature engineering

Benefits Drawbacks
Making a proper feature list
Models with engineered features
requires deep analysis and
result in faster data
understanding of the business
processing.
context and processes.
Less complex models are Feature engineering is
easier to often time-
maintain. consuming.
Complex ML solutions achieved
Engineered features allow through complicated feature
for more accurate engineering are difficult to
estimations/predictions. explain because the model's logic
remains unclear.
23
Text Data
Text data broadly refers to all kinds of natural language data, including both
written text and spoken language. Recent years have seen a dramatic growth of
online text data with many examples such as web pages, news articles,
scientific literature, emails, enterprise documents, and social media such as blog
articles, forum posts, product reviews, and tweets. Text data contain rich
knowledge about the world, including human opinions and preferences.
Because of this, mining and analyzing vast amounts of text data (“big text
data”) can enable us to support user tasks and optimize decision making in all
application domains. Nearly all text data is generated by humans and for human
consumption. It is useful to imagine that text data is generated by humans
operating as intelligent subjective sensors: we humans observe the world
from a particular perspective (and thus are subjective) and we express our
observations in the form of text data. When we take this view, we can see that
as a special kind of big data, text data has unique values. First, since all
domains involve humans, text data are useful in all applications of big data.
Second, because text data are subjective, they offer opportunities for mining
knowledge about people’s behaviors, attitudes, and opinions. Finally, text data
directly express knowledge about our world, so small text data are also useful
(provided computers can understand it).
As data generated by human sensors, text data should in general be combined

with all the related data generated by other non-human sensors in all
big data applications. As such, it is desirable to convert raw text data into
some form of representation that can be easily integrated with other non-text
data. One way to achieve this goal is to extract features that are relevant to a
problem from the text data so as to transform the unstructured raw text data
into a more structured form of representation that can be directly combined with
features extracted from other non-text data in a machine learning framework. As
is the case for all supervised
24
machine learning and predictive modeling applications, identification and extraction of
effective features from text data (i.e., feature engineering for text data) is a very
important initial step in all applications using text data. Without an effective feature
representation that provides the needed discrimination for an application task, no
matter how advanced a machine learning model may be, non-text data in a machine
learning framework. As is the case for all supervised machine learning and
predictive modeling applications, identification and extraction of effective features from
text data (i.e., feature engineering for text data) is a very important initial step in all
applications using text data. Without an effective feature representation that provides
the needed discrimination for an application task, no matter how advanced a machine
learning model may be, it is impossible to obtain satisfactory application performance.
The techniques for computing features from text data have been developed by
researchers in multiple communities, especially information retrieval, natural language
processing, and data mining. This chapter provides a systematic review of all the major
techniques developed in these communities over the years for computing a wide range
of features from text data. We emphasize techniques that are relatively general and
robust since such techniques can be potentially applied to text data on any topic and in
any natural language.
2. Visual Data
Most visual computing tasks involve prediction, regression or decision making using
features extracted from the original, raw visual data (images or videos). Feature
engineering typically refers to this (often creative) process of extracting new
representations from the raw data that are more conducive toa computing task.
Indeed, the performance of many machine learning algorithms heavily depends on
having insightful input representations that expose the underlying explanatory
factors of the output for the observed input . An effective data representation would also
reduce data redundancy and adapt to undesired, external factors of variation
introduced by sensor noise, labeling errors, missing samples, etc. In the case of
images or videos, dimensionality reduction is often an integral part of feature
engineering, since the raw data are typically high dimensional. Over the years, many
feature engineering schemes have been proposed and researched for producing
good representations of raw images and videos. 25
Many existing feature engineering approaches may be categorized into one of three
broad groups:
1. Classical, sometimes hand-crafted, feature representations: In general, these may
refer to rudimentary features such as image gradients as well as fairly sophisticated
features from elaborate algorithms such as the histogram of oriented gradients feature.
More often
than not, such features are designed by domain experts who have good knowledge
about the data properties and the demands of the task. Hence sometimes such
features are
called hand-crafted features. Hand-engineering features for each task requires a lot of
manual labor and domain knowledge, and optimality is hardly guaranteed. However, it
allows integration of human knowledge of the real world and of that specific task into
the feature design process, hence making it possible to obtain good results for the said
task. These types of features are easy to interpret. Note that it is not completely correct
to call all classical features as being hand-crafted, since some of them are general-
purpose features with little task-specific tuning (such as outputs of simple gradient filters).
2. Advanced, latent-feature representations. While the raw data may be of high

dimensions, the factors relevant to a computing task may lie on a lower dimensional
space. Latent representations may expose the underlying properties of data that exist but
cannot be readily measured from the original data. These features usually seek a specific
structure such as sparsity, decorrelation of reduced dimension, low rank, etc. The
structure being enforced depends on the task. The sparsity and low
dimensionality of these representations is often encouraged as real-world visual
data have naturally sparse representations with respect to some basis (e.g., Fourier
basis) and may also be embedded in a lower-dimensional manifold. However, obtaining
latent representations is often a difficult optimization process that may require
extensive reformulation and/or clever optimization techniques such as alternating
minimization.
3. Deep representations through end-to-end learning. Deep representations are
obtained by passing raw input data with minimal preprocessing through a learned neural
network,
often consisting of a stack of convolutional and/or fully connected layers. As the input
is propagated through each network layer, different data representations are obtained
that abstract higher-level concepts.
26
These networks are being trained iteratively by minimizing a task-specific loss that
alters the parameters/weights in all layers. Recently, deep features have been found
extremely effective in many visual computing tasks, leading to tremendous gain in
performance. Their most attractive property is their ability to learn from raw input with
minimal preprocessing. Moreover, it appears that such learned representations
can provide a reasonable performance on many tasks, alleviating the need for domain
experts for each task. However, learning deep representations needs not only
huge computational resources but also large data collections, making them
suitable primarily only for computing clusters or servers with powerful GPUs, and for
applications where abundant labeled data are readily available.
Feature-Based Time Series Analysis

What is Time Series Analysis?
Time series analysis is a technique in statistics that deals with time series data and
trend analysis. Time series data follows periodic time intervals that have been
measured in regular time intervals or have been collected in particular time intervals. In
other words, a time series is simply a series of data points ordered in time, and time
series analysis is the process of making sense of this data.
In a business context, examples of time series data include any trends that need to
be captured over a period of time. A Google trends report is a type of time series data
that can be analyzed. There are also far more complex applications such as demand
and supply forecasting based on past trends.
Examples of Time Series Data
In economics, time series data could be the Gross Domestic Product (GDP), the
Consumer Price Index, S&P 500 Index, and unemployment rates. The data set could be
a country’s gross domestic product from the federal reserve economic data.
From a social sciences perspective, time series data could be birth rate, migration
data, population rise, and political factors.
The statistical characteristics of time series data does not always fit conventional
statistical methods. As a result, analyzing time series data accurately requires a unique
set of tools and methods, collectively known as time series analysis.
Material
https://www.youtube.com/watch?v=9QtL7m3YS9I&t=264s 27
https://www.youtube.com/watch?v=2vMNiSeNUjI&t=64s
 Seasonality refers to periodic fluctuations. For example, if you consider
electricity consumption, it is typically high during the day and
lowers during the night. In the case of shopping patterns, online
sales spike during the holidays before slowing down and dropping.
 Autocorrelation is the similarity between observations as a function of

the time lag between them. Plotting autocorrelated data
yields a graph similar to a sinusoidal function.
Data: Types, Terms, and Concepts
Data, in general, is considered to be one of these three types:

1. Time series data: A set of observations on the values that a variable
takes on at different points of time.
2. Cross-sectional data: Data of one or more variables, collected at the

same point in time.
3. Pooled data: A combination of time series data and cross-sectional data.
These are some of the terms and concepts associated with time series data
analysis:
 Dependence: Dependence refers to the association of two observations
with the same variable at prior time points.
 Stationarity: This parameter measures the mean or average value of

the series. If a value remains constant over the given
time period, if there are spikes throughout the data, or if these
values tend toward infinity, then it is not stationarity.
 Differencing: Differencing is a technique to make the time series

stationary and to control the correlations that arise automatically.
That said, not all time series analyses need differencing and doing
so can produce inaccurate estimates.
 Curve fitting: Curve fitting as a regression method is useful for data

not in a linear relationship. In such cases, the
mathematical equation for curve fitting ensures that data that
falls too much on the fringes to have any real impact is “regressed”
onto a curve with a distinct formula that systems can use and
28
interpret.
Identifying Cross Sectional Data vs Time Series Data
The opposite of time series data is cross-sectional data. This is when various entities
such as individuals and organizations are observed at a single point in time to draw
inferences. Both forms of data analysis have their own value, and sometimes
businesses use both forms of analysis to draw better conclusions.
Time series data can be found in nearly every area of business and organizational
application affected by the past. This ranges from economics, social sciences, and
anthropology to climate change, business, finance, operations, and even
epidemiology. In a time series, time is often the independent variable, and the goal is
to make a forecast for the future.
The most prominent advantage of time series analysis is that—because data points in
a time series are collected in a linear manner at adjacent time periods—it can
potentially make correlations between observations. This feature sets time series data
apart from cross-sectional data.
Time Series Analysis Techniques

As we have seen above, time series analysis can be an ambitious goal for
organizations. In order to gain accurate results from model-fitting, one of several
mathematical models may be used in time series analysis such as:
 Box-Jenkins autoregressive integrated moving average (ARIMA) models
 Box-Jenkins multivariate models
 Holt-Winters exponential smoothing
While the exact mathematical models are beyond the scope of this article, these are
some specific applications of these models that are worth discussing here.
The Box-Jenkins models of both the ARIMA and multivariate varieties use the past
behaviour of a variable to decide which model is best to analyse it. The assumption is
that any time series data for analysis can be characterized by a linear function of its
past values, past errors, or both. When the model was first developed, the data used
was from a gas furnace and its variable behaviour over time.
29
In contrast, the Holt-Winters exponential smoothing model is best suited to
analyzing time series data that exhibits a defining trend and varies by seasons.
Such mathematical models are a combination of several methods of measurement; the Holt-
Winters method uses weighted averages which can seem simple enough, but these values
are layered on the equations for exponential smoothing.
Applications of Time Series Analysis
Time series analysis models yield two outcomes:

 Obtain an understanding of the underlying forces and structure that produced
the observed data patterns. Complex, real-world scenarios very rarely fall into
set patterns, and time series analysis allows for their study—along with all of
their variables as observed over time. This application is usually meant to
understand processes that happen gradually and over a period of time such as
the impact of climate change on the rise of infection rates.
 Fit a mathematical model as accurately as possible so the process can move

into forecasting, monitoring, or even certain feedback loops. This is a use-case
for businesses that look to operate at scale and need all the input they can get
to succeed.
While the data is numerical and the analysis process seems mathematical, time series
analysis can seem almost abstract. However, any organization can realize a number of
present-day applications of such methods. For example, it is interesting to imagine that
large, global supply chains such as those of Amazon are only kept afloat due to the
interpretation of such complex data across various time periods. Even during the COVID-
19 pandemic where supply chains suffered maximum damage, the fact that they have
been able to bounce back faster is thanks to the numbers, and the comprehension of
these numbers, that continues to happen throughout each day and week.
Time series analysis is used to determine the best model that can be used to forecast
business metrics. For instance, stock market price fluctuations, sales,
turnover, and any other process that can use time series data to make predictions about
the future. It enables management to understand time-dependent patterns in
data and analyze trends in business metrics.
30
From a practical standpoint, time series analysis in organizations are mostly
used for:
 Economic forecasting
 Sales forecasting
 Utility studies
 Budgetary analysis
 Stock market analysis
 Yield projections
 Census analysis
 Process and quality control
 Inventory studies
 Workload projections
Time series in Financial and Business Domain
Most financial, investment and business decisions are taken into

consideration on the basis of future changes and demands forecasts in the
financial domain.
Time series analysis and forecasting essential processes for explaining the
dynamic and influential behaviour of financial markets. Via examining
financial data, an expert can predict required forecasts for important
financial applications in several areas such as risk evolution, option pricing &
trading, portfolio construction, etc.
For example, time series analysis has become the intrinsic part of
financial analysis and can be used in predicting interest rates, foreign
currency risk, volatility in stock markets and many more. Policymakers and
business experts use financial forecasting to make decisions about
production, purchases, market sustainability, allocation of resources, etc.
In investment, this analysis is employed to track the price fluctuations and

price of a security over time. For instance, the price of a security can be
recorded;
31
For the short term, such as the observation per hour for a business day, and For the
long term, such as observation at the month end for five years.
Time series analysis is extremely useful to observe how a given asset, security, or
economic variable behaves/changes over time. For example, it can be deployed to
evaluate how the underlying changes associated with some data observation
behave after shifting to other data observations in the same time period.
Time series in Medical Domain
Medicine has evolved as a data-driven field and continues to contribute in time series
analysis to human knowledge with enormous developments.
Case study
Consider the case of combining time series with a medical method CBR (case- based
reasoning) and data mining, these synergies are essential as the preprocessing
for feature mining from time series data and can be useful to study the progress of
patients over time.
In the medical domain, it is important to examine the transformation of behaviour over

time as compared to derive inferences depending on the absolute values in the time
series. For example, to diagnose heart rate variability in occurrence with respiration
based on the sensor readings is the characteristic illustration of connecting time
series with case-based monitoring.
However, time series in the context of the epidemiology domain has emerged very
recently and incrementally as time series analysis approaches demand
recordkeeping systems such that records should be connected over time and
collected precisely at regular intervals.
As soon as the government has placed sufficient scientific instruments to

accumulate good and lengthy temporal data, healthcare applications using time series
analysis have resulted in huge prognostication for the industry as well as for individuals’
health diagnoses.
32
Medical Instruments
Time series analysis has made its way into medicine with the advent of medical
devices such as Electrocardiograms (ECGs), invented in 1901: For diagnosing
cardiac conditions by recording the electrical pulses passing through the heart.
Electroencephalogram (EEG), invented in 1924: For measuring electrical

activity/impulses in the brain.
These inventions made more opportunities for medical practitioners to deploy time
series for medical diagnosis. With the advent of wearable sensors and smart
electronic healthcare devices, now persons can take regular measurements
automatically with minimal inputs, resulting in a good collection of longitudinal
medical data for both sick and healthy individuals consistently.
Time Series in Astronomy
One of the contemporary and modern applications where time series plays a
significant role are different areas of astronomy and astrophysics,
Being specific in its domain, astronomy hugely relies on plotting objects,

trajectories and accurate measurements, and due to the same, astronomical
experts are proficient in time series in calibrating instruments and studying objects
of their interest.
Time series data had an intrinsic impact on knowing and measuring anything
about the universe, it has a long history in the astronomy domain, for example,
sunspot time series were recorded in China in 800 BC, which made sunspot data
collection as well-recorded natural phenomena.
Similarly, in past centuries, time series analysis was used.
 To discover variable stars that are used to surmise stellar distances, and
 To observe transitory events such as supernovae to understand the
mechanism of the changing of the universe with time.
Such mechanisms are the results of constant monitoring of live streaming of time
series data depending upon the wavelengths and intensities of light that allows
astronomers to catch events as they are occurring.
33
In the last few decades, data-driven astronomy introduced novel areas of research as
Astro informatics and Astro statistics; these paradigms involve major disciplines such
as statistics, data mining, machine learning and computational intelligence. And here,
the role of time series analysis would be detecting and classifying astronomical objects
swiftly along with the characterization of novel phenomena independently.
Time series in Forecasting Weather
Anciently, the Greek philosopher Aristotle researched weather phenomena with the
idea to identify causes and effects in weather changes. Later on, scientists started to
accumulate weather-related data using the instrument “barometer” to compute the
state of atmospheric conditions, they recorded weather-related data on intervals of
hourly or daily basis and kept them in different locations.
With the time, customized weather forecasts began printed in newspapers and later on
with the advancement in technology, currently forecasts are beyond the general
weather conditions.
In order to conduct atmospheric measurements with computational methods for fast

compilations, many governments have established thousands of weather forecasting
stations around the world.
These stations are equipped with highly functional devices and are interconnected
with each other to accumulate weather data at different geographical locations and
forecast weather conditions at every bit of time as per requirements.
Time series in Business Development
Time series forecasting helps businesses to make informed business decisions, as the
process analyses past data patterns it can be useful in forecasting future possibilities and
events in the following ways;
Reliability: When the data incorporates a broad spectrum of time intervals in the
form of massive observations for a longer time period, time series forecasting is
highly reliable. It provides elucidate information by exploiting data observations at
various time intervals. 34
Growth: In order to evaluate the overall financial performance and growth as well as
endogenous, time series is the most suitable asset. Basically, endogenous growth is the
progress within organizations’ internal human capital resulting in economic growth.
For example, studying the impact of any policy variables can be manifested by applying
time series forecasting.
Trend estimation: Time series methods can be conducted to discover trends, for
example, these methods inspect data observations to identify when measurements
reflect a decrease or increase in sales of a particular product.
Seasonal patterns: Recorded data points variances could unveil seasonal patterns &
fluctuations that act as a base for data forecasting. The obtained information is
significant for markets whose products fluctuate seasonally and assist organizations in
planning product development and delivery requirements.
Advantages of Time Series Analysis

Data analysts have much to gain from time series analysis. From cleaning raw data,
making sense of it, and uncovering patterns to help with projections much can be
accomplished through the application of various time series models.
Here are a few advantages of time series analysis:

It Cleans Data and Removes Confounding Factors
Data cleansing filters out noise, removes outliers, or applies various averages to
gain a better overall perspective of data. It means zoning in on the signal by
filtering out the noise. The process of time series analysis removes all the noise and
allows businesses to truly get a clearer picture of what is happening day-to-day.
Provides Understanding of Data
The models used in time series analysis do help to interpret the true meaning of the
data in a data set, making life easier for data analysts. Autocorrelation patterns and
seasonality measures can be applied to predict when a certain data point can be
expected. Furthermore, stationarity measures can gain an estimate of the value of
said data point.
35
This means that businesses can look at data and see patterns across time and space,
rather than a mass of figures and numbers that aren’t meaningful to the core function
of the organization.
Forecasting Data
Time series analysis can be the basis to forecast data. Time series analysis is
inherently equipped to uncover patterns in data which form the base to predict future
data points. It is this forecasting aspect of time series analysis that makes it extremely
popular in the business area. Where most data analytics use past data to retroactively
gain insights, time series analysis helps predict the future. It is this very edge that helps
management make better business decisions.
Disadvantages of Time Series Analysis

Time series analysis is not perfect. It can suffer from generalization from a single study
where more data points and models were warranted. Human error could misidentify
the correct data model, which can have a snowballing effect on the output.
It could also be difficult to obtain the appropriate data points. A major point of
difference between time-series analysis and most other statistical problems is that in a
time series, observations are not always independent.
Video Material
https://www.youtube.com/watch?v=2vMNiSeNUjI
https://www.youtube.com/watch?v=9QtL7m3YS9I
Future of Time Series Analysis

Time series analysis represents a highly advanced area of data analysis. It focuses on
describing, processing, and forecasting time series. Time series are time- ordered
data sets. When interpreting a time series, autocorrelation patterns, seasonality, and
stationarity must be taken into account before selecting the right model for analysis.
There are several time series analysis models, ranging from basic, fine-tuned, and
advanced. Advanced models help data analysts to predict time series behavior with
much greater accuracy.
With the advent of automation and machine learning techniques, comprehending this
information and conducting complex calculations is not as tough as it once was, paving
the way for a better understanding of our past, and our future.
36
Linear Methods for Streaming Feature Construction
Linear methods for streaming feature construction are techniques that involve
creating new features in a linear fashion from streaming data, often with a focus on
efficiency and adaptability to data arriving sequentially. These methods are suitable
for real-time or near-real-time applications were data streams continuously. Here are
some linear methods for streaming feature construction:
(i) Principal Component Analysis (PCA)

(ii) Linear Discriminant Analysis (LDA)
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for

the dimensionality reduction in machine learning. It is a statistical process that
converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed
features are called the Principal Components. It is one of the popular tools that is
used for exploratory data analysis and predictive modeling. It is a technique to draw
strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.
The PCA algorithm is based on some mathematical concepts such as:
o Variance and Covariance
o Eigenvalues and Eigen factors
37
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the given

dataset. More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other.
Such as if one changes, the other variable also gets changed. The correlation value
ranges from -1 to +1. Here, -1 occurs if variables are inversely proportional to each
other, and +1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence
the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then

v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of

variables is called the Covariance Matrix.
Principal Components in PCA

As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original
features present in the dataset. Some properties of these principal components are given
below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.
Steps for PCA algorithm
1. Getting the dataset

Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.
Material
https://www.youtube.com/watch?v=fkf4IBRSeEc
https://www.youtube.com/watch?v=IbE0tbjy6JQ&list=PLBv09BD7ez_5_yapAg86Od6J
eeypkS4YM
38
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number of columns is the dimensions of
the dataset.
3. Standardizing the data

In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column. Here we will name the
matrix as Z.
4. Calculating the Covariance of Z

To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors

Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix
Z. Eigenvectors or the covariance matrix are the directions of the axes with high information.
And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors

In this step, we will take all the eigenvalues and will sort them in decreasing order, which means
from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components

Here we will calculate the new features. To do this, we will multiply the P* matrix to the
Z. In the resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.

The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new dataset,
and unimportant features will be removed out.
39
Applications of Principal Component Analysis
o PCA is mainly used as the dimensionality reduction technique in various AI

applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality

reduction techniques in machine learning to solve more than two-class classification
problems . It is also known as Normal Discriminant Analysis (NDA) or Discriminant
Function Analysis(DFA).
This can be used to project the features of higher dimensional space into lower-
dimensional space in order to reduce resources and dimensional costs. In this
topic, "Linear Discriminant Analysis (LDA) in machine learning”, we will
discuss the LDA algorithm for classification predictive modelling problems,
limitation of logistic regression, representation of linear Discriminant analysis
model, how to make a prediction using LDA, how to prepare data for LDA,
extensions to LDA and much more. So, let's start with a quick introduction to
Linear Discriminant Analysis (LDA) in machine learning.
What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to only two-class, linear
Discriminant analysis is applicable for more than two classes of classification
problems.
It is also Linear Discriminant analysis is one of the most popular dimensionality
reduction techniques used for supervised classification problems in
machine learning. On considered a pre-processing step for modelling
differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple
features efficiently, the Linear Discriminant Analysis model is considered the most
common technique to solve such classification problems. For e.g., if we have two
classes with multiple features and need to separate them efficiently. When we
classify them using a single feature, then it may show overlapping.
40
To overcome the overlapping issue in the classification process, we must
increase the number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data
points in a 2-dimen sional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can

separate these data points efficiently but using linear Discriminant
analysis; we can dimensionally reduce the 2-D plane into the 1-D plane. Using
this technique, we can also maximize the separability between multiple classes.
How Linear Discriminant Analysis (LDA) works?
Linear Discriminant analysis is used as a dimensionality reduction technique

in machine learning, using which we can easily transform a 2-D and 3-D graph
into a 1-dimensional plane.
Let's consider an example where we have two classes in a 2-D plane having an
X-Y axis, and we need to classify them efficiently. As we have already seen in the
above example that LDA enables us to draw a straight line that can completely
separate the two classes of the data points. Here, LDA uses an X-Y axis to create
a new axis by separating them using a straight line and projecting data onto a new
axis.
Hence, we can maximize the separation between these classes and reduce the 2-
D plane into 1-D. 41
To create a new axis, Linear Discriminant Analysis uses the following criteria:
o It maximizes the distance between means of two classes.
o It minimizes the variance within the individual class.
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the
variation within each class.
In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that

perform well for binary classification but falls short in the case of multiple
classification problems with well-separated classes. At the same time, LDA
handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features,
just as PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisher faces, LDA is used to
extract useful data from different faces. Coupled with eigenfaces, it produces
effective results.
Drawbacks of Linear Discriminant Analysis (LDA)
Although, LDA is specifically used to solve supervised classification problems for two or
more classes which are not possible using logistic regression in machine learning. But LDA
also fails in some cases where the Mean of the distributions is shared. In this case, LDA f4
ai2
ls
to create a new axis that makes both the classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in machine
learning.
Extension to Linear Discriminant Analysis (LDA)
Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as
follows:
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each

class deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear

groups of inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate

of the variance (actually covariance) and hence moderates the influence of
different variables on LDA.
Real-world Applications of LDA
Some of the common real-world applications of Linear discriminant Analysis are given
below:
o Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is
used to minimize the number of features to a manageable number before going
through the classification process. It generates a new template in which each
dimension consists of a linear combination of pixel values. If a linear combination is
generated using Fisher's linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment. 43
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can
be helpful when we want to identify a group of customers who mostly purchase a
product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For
example, "will you buy this product” will give a predicted result of either one or
two possible classes as a buying or not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work,
and it can also be considered a classification problem. In this case, LDA builds similar
groups on the basis of different parameters, including pitches, frequencies, sound,
tunes, etc.
Difference between Linear Discriminant Analysis and PCA
Below are some basic differences between LDA and PCA:
o PCA is an unsupervised algorithm that does not care about classes and labels
and only aims to find the principal components to maximize the variance in the
given dataset. At the same time, LDA is a supervised algorithm that aims to
find the linear discriminants to represent the axes that maximize separation
between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to PCA.
However, PCA is assumed to be an as good performer for a comparatively
small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where
PCA is first followed by LDA.
44
How to Prepare Data for LDA
Below are some suggestions that one should always consider while
preparing the data to build the LDA model:
o Classification Problems: LDA is mainly applied for classification

problems to classify the categorical output variable. It is
suitable for both binary and multi- class classification problems.
o Gaussian Distribution: The standard LDA model applies the

Gaussian Distribution of the input variables. One
should review the univariate distribution of each attribute and
transform them into more Gaussian-looking distributions. For
e.g., use log and root for exponential distributions and Box- Cox for
skewed distributions.
o Remove Outliers: It is good to firstly remove the outliers from

your data because these outliers can skew the basic
statistics used to separate classes in LDA, such as the
mean and the standard deviation.
o Same Variance: As LDA always assumes that all the input variables
have the same variance, hence it is always a better way
to firstly standardize the data before implementing an LDA
model. By this, the Mean will be 0, and it will have a standard
deviation of 1.
45
3. Feature-Based Time-Series Analysis
1. The Time Series Data Type
The passing of time is a fundamental component of the human experience and the
dynamics of real-world processes is a key driver of human curiosity. On observing a leaf
in the wind, we might contemplate the burstiness of the wind speed, whether the
wind direction now is related to what it was a few seconds ago, or whether the dynamics
might be similar if observed tomorrow. This line of questioning about dynamics has been
followed to understand a wide range of real-world processes, including in
seismology (e.g., recordings of earthquake tremors), biochemistry (e.g., cell
potential fluctuations), biomedicine (e.g., recordings of heart rate dynamics), ecology
(e.g., animal population levels over time), astrophysics (e.g., radiation dynamics),
meteorology (e.g., air pressure recordings), economics (e.g., inflation rates variations),
human machine interfaces (e.g., gesture recognition from accelerometer data), and
industry (e.g., quality control sensor measurements on a production line). In each case,
the dynamics can be captured as a set of repeated measurements of the system over
time, or a time series. Time series are a fundamental data type for understanding
dynamics in real-world systems. Note that throughout this work we use the convention
of hyphenating “time-series” when used as an adjective, but not when used as a noun
(as “time series”). In general, time series can be sampled non-uniformly through time,
and can therefore be represented as a vector of time stamps, ti, and associated
measurements, xi. However, time series are frequently sampled uniformly through time
(i.e., at a constant sampling period, ∆t), facilitating a more compact representation as an
ordered vector x = (x1, x2, ..., xN), where N measurements have been taken at times
t = (0, ∆t, 2∆t, ...,(N − 1)∆t). Representing a uniformly sampled time series as an
ordered vector allows other types of real-valued sequential data to be represented
in the same way, such as spectra (where measurements are ordered by frequency),
word length sequences of sentences in books (where measurements are ordered
through the text), widths of rings in tree trunks (ordered across the radius of the trunk
cross section), and even the shape of objects (where the distance from a central point
in a shape can be measured and ordered by the angle of rotation of the shape) .
Some examples are shown in Fig. 4.1. Given this common representation for
sequential data, methods developed for analysing time series (which order
measurements by time), can also be applied to understand patterns in any s e q u e n4t i a6l
data.
Figure: Sequential data can be ordered in many ways, including A temperature measured
over time (a time series), B a sequence of ring widths, ordered across the cross section of
a tree trunk, and C a frequency spectrum of astrophysical data (ordered by
frequency). All of these sequential measurements can be analyzed by methods that take
their sequential ordering into account, including time-series analysis methods.
While the time series described above are the result of a single
measurement taken repeatedly through time, or univariate time series, measurements are
frequently made from multiple parts of a system simultaneously, yielding multivariate time
series. Examples of multivariate time series include measurements of the activity
dynamics of multiple brain regions through time, or measuring the air temperature, air
pressure, and humidity levels together through time. Techniques have been developed to
model and understand multivariate time series, and infer models of statistical
associations between different parts of a system that may explain its multivariate
dynamics. Methods for characterizing inter-relationships between time series are vast,
including the simple measures of statistical dependencies, like linear cross correlation,
mutual information, and to infer causal (directed) relationships using methods like
transfer entropy and Granger causality. A range of information-theoretic methods for
characterizing time series, particularly the dynamics of information transfer between time
series, are described and implemented in the excellent Java Information Dynamics
Toolkit (JIDT). Feature-based representations of multivariate systems can include both
features of individual time series, and features of inter-relationships between (e.g., pairs of)
time series. However, in this chapter we focus on individual univariate time series
sampled uniformly through time (that can be represented as ordered vectors, xi).
47
3.2 Time-Series Characterization
As depicted in the left box of Fig. 4.2, real-world and model-generated time series
are highly diverse, ranging from the dynamics of sets of ordinary differential
equations simulated numerically, to fast (nanosecond timescale) dynamics of
plasmas, the bursty patterns of daily rainfall, or the complex fluctuations of global
financial markets. How can we capture the different types of patterns in these data
to understand the dynamical processes underlying them? Being such a ubiquitous
data type, part of the excitement of time-series analysis is the large
interdisciplinary toolkit of analysis methods and quantitative models that have
been developed to quantify interesting structures in time series, or time-series
characterization.
We distinguish the characterization of unordered sets of data, which is
restricted to the distribution of values, and allows questions to be asked like the
following: Does the sample have a high mean or spread of values? Does the sample
contain outliers? Are the data approximately Gaussian distributed? While these types of
questions can also be asked of time series, the most interesting types of questions
probe the temporal dependencies and hence the dynamic processes that might
underlie the data, e.g., How bursty is the time series? How correlated is the value of
the time series to its value one second in the future? Does the time series contain
strong periodicities? Interpreting the answers to these questions in their domain
context provides understanding of the process being measured.
48
Figure: Time-series characterization. Left: A sample of nine real world time
series reveals a diverse range of temporal patterns [25, 29]. Right:
Examples of different classes of methods for quantifying different types of
structure, such as those seen in time series on the left:
(i) distribution (the distribution of values in the time series, regardless of their
sequential ordering);
(ii) autocorrelation properties (how values of a time series are
correlated to themselves through time);
(iii) stationarity (how statistical properties
change across a recording);
(iv) entropy (measures of complexity or
predictability of the time series quantified using
information theory); and
(v) nonlinear time series analysis (methods that quantify nonlinear
properties of the
dynamics).
Some key classes of methods developed for characterizing time series are
depicted in the right panel of Fig., and include autocorrelation, stationarity,
entropy, and methods from the physics-based nonlinear time-series
analysis literature. Within each broad methodological class, hundreds of
time-series analysis methods have been developed across decades of diverse
research.
In their simplest form, these methods can be represented as algorithms that
capture time- series properties as real numbers, or features. Many
different feature-based
representations for time series have been developed and been used in
applications ranging from time-series modeling, forecasting, and classification.
49
4 Feature Selection for Data Streams with Streaming Features
For the feature selection problem with streaming features, the number of instances is fixed while
candidate features arrive one at a time; the task is to timely select a subset of relevant features
from all features seen so far. A typical framework for streaming feature selection consists of
Step 1: a new feature arrives; Step 2: decide whether to add the new feature to the selected
features; step 3: determine whether to remove features from the selected features; and Step 4:
repeat Step 1 to Step 3. Different algorithms may have distinct implementations for Step 2 and
Step 3; next we will review some representative methods. Note that Step 3 is optional and some
streaming feature selection algorithms only provide Step 2.
1. The Grafting Algorithm

It is a machine learning technique used for incremental or online feature selection in the context of
classification problems. It is designed to select a subset of relevant features from a larger set
while training a classifier, thus improving both classification accuracy and computational
efficiency.
Here's an overview of how the Grafting Algorithm works:

Initialization: Start with an empty set of selected features and an initial classifier. This classifier
can be a simple one, such as a linear model or a decision tree.
Feature Selection and Classifier Update: For each incoming data point or batch of data points,
follow these steps:
a. Train the current classifier on the selected features.

b. Evaluate the performance of the classifier on the new data points or batch. You can use
metrics like accuracy, F1-score, or log-likelihood, depending on the problem.
c. For each feature not yet in the selected set, evaluate its potential contribution to the classifier's
performance. This is typically done by temporarily adding the feature to the selected set and
measuring the change in performance (e.g., increase in accuracy or decrease in error).
d. Select the feature that provides the greatest improvement in classification performance. If the
improvement exceeds a predefined threshold (a hyperparameter), add the feature to the selected
set.
Material
https://www.youtube.com/watch?v=5bHpPQ6_OU4
https://www.youtube.com/watch?v=LTE7YbRexl8
50
e. Update the classifier by retraining it with the selected features.
Repeat: Continue this process as new data arrives. The algorithm will adaptively select features
that are most informative for the classification task.
Key Points about the Grafting Algorithm:
Incremental Feature Selection: Grafting incrementally selects features one at a time, taking
into account their contributions to the classifier's performance.
Adaptive Feature Selection: It dynamically adjusts the set of selected features as new data
arrives, ensuring that only the most relevant features are retained.
Efficiency: Grafting is efficient because it avoids exhaustive search over feature subsets and
only evaluates the utility of adding or removing one feature at a time.
Performance Improvement: By selecting informative features during the learning process,

Grafting aims to improve classification accuracy while potentially reducing the computational
complexity of the model.
Thresholds: The algorithm relies on a predefined threshold for evaluating whether adding a
feature is beneficial. This threshold can be set based on domain knowledge or through cross-
validation.
Grafting is particularly useful in scenarios where you have a large number of features and
limited computational resources or when dealing with data streams where the feature set may
evolve over time. It strikes a balance between maintaining model performance and reducing
feature dimensionality, which can be beneficial for both efficiency and interpretability of
machine learning models. Keep in mind that the specific implementation and parameter
settings of the Grafting Algorithm may vary depending on the machine learning framework
and problem domain.
4.2 The Alpha-Investing algorithm
It is a statistical method used for sequential hypothesis testing, primarily in the context of
multiple hypothesis testing or feature selection. It was introduced as an enhancement to the
Sequential Bonferroni method, aiming to control the Family-Wise Error Rate (FWER) while
being more powerful and efficient in adaptive and sequential settings.
51
Here's a high-level overview of the Alpha-Investing algorithm:
Initialization: Start with an empty set of selected hypotheses (features) and set an
initial significance level (alpha). This alpha level represents the desired FWER control
and guides the decision-making process.
Sequential Testing: As you encounter new hypotheses (features) or updates to

existing ones, perform hypothesis tests (e.g., p-value tests) to assess their
significance. The tests are often related to whether a feature is associated with an
outcome of interest.
Alpha Update: After each hypothesis test, update the alpha level dynamically
based on the test results and the number of hypotheses tested so far. Alpha-
Investing adjusts the significance level to maintain FWER control while adapting to
the increasing number of tests.
Decision Rules: Make decisions on whether to reject or retain each hypothesis

based on the adjusted alpha level. Common decision rules include rejecting a
hypothesis if its p-value is less than the current alpha.
Continue or Terminate: Continue the process as long as you encounter new

hypotheses or updates to existing ones. You can choose a stopping criterion, such as
reaching a fixed number of hypotheses or achieving a certain level of significance
control.
Output: The selected hypotheses at the end of the process are considered
statistically significant, and the others are rejected or not selected.
Key Advantages of Alpha-Investing:
Adaptivity: Alpha-Investing adapts its significance level as more hypotheses are

tested. This adaptivity helps maintain better statistical power compared to fixed
significance levels like Bonferroni correction.
FWER Control: It controls the Family-Wise Error Rate, which is the probability of
making at least one false discovery among all the hypotheses tested. This makes
it suitable for applications where controlling the overall error rate is critical.
Efficiency: Alpha-Investing is often more efficient than other multiple testing

correction methods like Bonferroni correction because it tends to use higher alpha
52
levels for early tests and lower alpha levels for later tests.
Selective: It allows for the selection of relevant features or hypotheses from a large
pool
while controlling the overall error rate.

Alpha-Investing is commonly used in fields like bioinformatics, genomics, finance, and
any domain where multiple hypothesis testing or feature selection is necessary and
maintaining a strong control over the FWER is important. It offers a balance between
adaptivity and statistical rigor, making it a valuable tool in the data analysis toolkit.
4.3 The Online Streaming Feature Selection Algorithm

The feature selection in the context of data streams and online learning often
involves adapting traditional feature selection methods to handle streaming data.
Here is a conceptual outline of how feature selection can be performed in an
online streaming setting:
Data Stream Ingestion: Start by ingesting your streaming data, which arrives
continuously over time. This data can be in the form of individual instances or mini-
batches.
Initialization: Initialize your feature selection process by setting up the necessary
data structures and variables.
Feature Selection Process:

a. Receive New Data: As new data points arrive in the streaming fashion, preprocess
and prepare them for feature selection.
b. Compute Feature Importance: Calculate the importance or relevance of each feature
in the current data batch. Various statistical measures, machine learning models, or
domain knowledge can guide this calculation.
c. Update Feature Set: Decide whether to keep or discard each feature based on
its importance. You can use a threshold or ranking mechanism to select the most
relevant features. This step can be done incrementally as new data arrives.
d. Retrain Models (Optional): If you are using machine learning models, retrain them
using the updated feature set to ensure that the model adapts to the changing data
distribution.
e. Update Metrics: Continuously monitor and evaluate the performance of your models or
the selected features using appropriate evaluation metrics. This can help you assess the
quality of your feature selection process.
53
Streaming Feature Selection Loop: Repeat the above steps as new
data points continue to arrive in the stream. The feature selection
process is ongoing and adaptive to the changing data distribution.
Termination: Decide on a stopping criterion for the feature selection

process. This could be a fixed time duration, a certain number of data
points processed, or a change in model performance.
Final Feature Set: The selected features at the end of the

streaming feature selection process are considered the final set for
modelling or analysis.
It's important to note that the exact algorithm and methodology used for
feature selection in a streaming context can vary based on the specific
problem, data, and goals. The choice of feature importance measure, update
frequency, and stopping criteria should be tailored to your particular application.
4.4 Unsupervised Streaming Feature Selection in social media
Unsupervised streaming feature selection in social media data presents a

unique set of challenges and opportunities. Unlike traditional feature selection
in batch data, where you have a fixed dataset, social media data arrives
continuously, often with varying topics, trends, and user behaviour. Here's
an approach to unsupervised streaming feature selection in social media:
Data Ingestion:
Stream social media data from platforms like Twitter, Facebook, or Instagram.
Preprocess the data, including text cleaning, tokenization, and potentially

feature extraction techniques like TF-IDF or word embeddings.
Online Clustering:
Implement an online clustering algorithm like Online K-Means, Mini-Batch K-

Means, or DBSCAN.
Cluster the incoming data based on the extracted features. The number of
clusters can be determined using heuristics or adaptively based on data
54
characteristics.
Feature Importance within Clusters:
For each cluster, calculate feature importance scores. Feature importance can be
determined using various unsupervised methods, such as term frequency-inverse
document frequency (TF-IDF), mutual information, or chi-squared statistics, within the
context of each cluster.
Feature Ranking and Selection:
Rank features within each cluster based on their importance scores. Select a fixed
number or percentage of top-ranked features from each cluster. Alternatively, you can
use dynamic thresholds to adaptively select features based on their importance scores
within each cluster.
Dynamic Updating:
Continuously update the clustering and feature selection process as new data
arrives.
Periodically recluster the data to adapt to changing trends and topics in social media
discussions.
Evaluation and Monitoring:
Monitor the quality of the selected features over time.
Use unsupervised evaluation metrics such as silhouette score, Davies-Bouldin index, or

within-cluster sum of squares to assess the quality of clusters and feature
importance.
Anomaly Detection (Optional):
Incorporate anomaly detection techniques to identify unusual or emerging topics or

trends in the data. Anomalies may indicate the need for adaptive feature selection.
Modeling or Analysis:
Utilize the selected features for various downstream tasks such as sentiment
analysis, topic modeling, recommendation systems, or anomaly detection.
Regular Maintenance:
Regularly review and update the feature selection process as the social media
55
landscape evolves. Consider adding new features or modifying existing ones based
Unsupervised streaming feature selection in social media data requires a flexible and
adaptive approach due to the dynamic nature of social media content. It aims to
extract relevant features that capture the current themes and trends in the data
without requiring labeled training data. Keep in mind that the choice of clustering
algorithm and feature importance metric should be tailored to your specific social
media data and objectives.
Non-Linear Methods for Streaming Feature Construction
Non-linear methods for streaming feature construction are essential for extracting
meaningful patterns and representations from streaming data where the relationships
among features may not be linear. These methods transform the input data into a new
feature space, often with higher dimensionality, to capture complex and non-linear
relationships that may exist in the data. Here are some non-linear methods commonly
used for streaming feature construction:
Kernel Methods:
Kernel Trick: Apply the kernel trick to transform data into a higher-dimensional space
without explicitly computing the feature vectors. Common kernels include the Radial
Basis Function (RBF) kernel and polynomial kernels.
Online Kernel Methods: Adapt kernel methods to streaming data by updating kernel
matrices incrementally as new data arrives. Online kernel principal component analysis
(KPCA) and online kernel support vector machines (SVM) are examples.
Neural Networks:
Deep Learning: Utilize deep neural networks, including convolutional neural networks
(CNNs) and recurrent neural networks (RNNs), for feature extraction. Deep
architectures can capture intricate non-linear relationships in the data.
Online Learning: Implement online learning techniques to continuously update neural

network parameters as new data streams in. This enables real-time feature
construction.
Autoencoders:
Variational Autoencoders (VAEs): VAEs can be used to learn non-linear representations

and reduce dimensionality. They are useful for capturing latent variables and c o m p5l e6x
patterns in streaming data.
Online Autoencoders: Design autoencoders that update their weights as new data
arrives, allowing them to adapt to changing data distributions.
Manifold Learning:
t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a dimensionality

reduction technique that can reveal non-linear relationships in high-dimensional
data. It can be adapted to streaming data by updating the t-SNE embedding.
Isomap:
Isomap is another manifold learning method that can be used for non-linear
feature construction in streaming data by incrementally updating the geodesic
distances between data points.
Random Features:
Random Fourier Features: Use random Fourier features to approximate

kernel methods' non-linear transformations in a computationally efficient manner.
This can be suitable for streaming data when kernel-based methods are too slow.
Non-linear Dimensionality Reduction:
Locally Linear Embedding (LLE) and Spectral Embedding: These

dimensionality reduction techniques aim to preserve local relationships, making
them suitable for capturing non-linear structures in data streams.
Feature Mapping:
Apply non-linear feature mappings, such as polynomial expansions or trigonometric

transformations, to create new features that capture non-linear relationships
among the original features.
Ensemble Techniques:
Utilize ensemble methods like Random Forests and Gradient Boosting,

which inherently capture non-linear relationships by combining multiple decision
trees.
57
Online Clustering and Density Estimation:
Clustering and density estimation methods, such as DBSCAN and Gaussian Mixture
Models (GMM), can be used to create features that represent the underlying non-linear
structures in streaming data.
When selecting a non-linear feature construction method for streaming data, consider
factors such as computational efficiency, scalability, and the adaptability of the method to
evolving data distributions. The choice of method should align with the specific
characteristics and requirements of your streaming data application.
Locally Linear Embedding for Data Streams
Locally Linear Embedding (LLE) is a dimensionality reduction technique commonly used

for nonlinear manifold learning and feature extraction. While it was originally developed
for batch data, it can be adapted for data streams with some modifications and
considerations. Here's an overview of how LLE can be applied to data streams:
1. Data Stream Preprocessing:
 Ingest the streaming data and preprocess it as it arrives, including cleaning,

normalization, and transformation.
2. Sliding Window:
 Implement a sliding window mechanism to maintain a fixed-size buffer of the

most recent data points. This buffer will be used for performing LLE on the
data stream.
3. Local Neighborhood Selection:
 For each incoming data point, determine its local neighborhood by

considering a fixed number of nearest neighbors within the sliding window.
4. Local Linear Models:
 Construct local linear models for each data point based on its neighbors. This
involves finding weights that best reconstruct the data point as a linear
combination of its neighbors.
58
1. Local Reconstruction Weights:
 Calculate the reconstruction weights for each data point in the local
neighborhood. These weights represent the contribution of each neighbor to the
reconstruction of the data point.
2. Global Embedding:
 Combine the local linear models and reconstruction weights to compute a global
embedding for the entire dataset. This embedding represents the lower-
dimensional representation of the data stream.
3. Continuous Update:
 Continuously update the sliding window and recompute the LLE

embedding as new data points arrive. The old data points are removed from
the window, and the new ones are added.
4. Memory Management:
 Manage memory efficiently to ensure that the sliding window remains within
a predefined size limit. You may need to adjust the window size dynamically
based on available memory and computational resources.
5. Hyperparameter Tuning:
 Tune hyperparameters such as the number of nearest neighbors, the

dimensionality of the embedding space, and any regularization terms based
on the specific characteristics of your data stream.
6. Evaluation and Monitoring:
 Periodically evaluate the quality of the LLE embedding using appropriate

metrics, such as reconstruction error or visual inspection. Monitoring the
quality helps ensure that the embedding captures meaningful patterns in the
data stream.
7. Application:
 Use the lower-dimensional representation obtained through LLE for

downstream tasks such as clustering, visualization, or classification,
depending on your specific objectives. 59
Adapting LLE to data streams requires careful management of the sliding window and
efficient computation of the local linear models. Additionally, choosing an appropriate
neighborhood size and dimensionality for the embedding is crucial for achieving
meaningful results. Consider the computational resources available and the real-time
constraints of your application when implementing LLE for data streams.
Kernel learning for data streams
Kernel learning for data streams is an area of machine learning that focuses on
adapting kernel methods, which are originally designed for batch data, to the
streaming data setting. Kernel methods are powerful techniques for dealing with non-
linear relationships and high-dimensional data. Adapting them to data streams requires
efficient processing and storage of data as it arrives in a sequential and potentially
infinite manner. Here are some key considerations and techniques for kernel learning in
data streams:
1. Online Kernel Methods:
 Traditional kernel methods, such as Support Vector Machines (SVM) and

Kernel Principal Component Analysis (KPCA), can be adapted to data
streams using online learning techniques.
 Online SVM and Online KPCA algorithms update model parameters

incrementally as new data arrives.
2. Incremental Kernel Matrix Updates:
 A key challenge in kernel methods for data streams is efficiently updating

the kernel matrix as new data points arrive. Techniques like the Nyström
approximation and random Fourier features can be employed to
approximate kernel matrices and update them incrementally.
3. Memory Management:
 Efficiently manage memory to ensure that the kernel matrix doesn't grow
too large as data accumulates. This may involve storing only a subset of the
most recent data points or employing methods like forgetting mechanisms.
60
1. Streaming Feature Selection:
 Apply feature selection techniques to the input data to reduce

dimensionality before applying kernel methods. This can help in
maintaining computational efficiency.
2. Online Hyperparameter Tuning:
 Tune kernel hyperparameters (e.g., the kernel width or the regularization

parameter in SVM) adaptively based on the streaming data to maintain
model performance.
3. Concept Drift Detection:
 Monitor the data stream for concept drift, which occurs when the data
distribution changes over time. When drift is detected, consider retraining
or adapting the kernel model.
4. Kernel Approximations:
 Use kernel approximations such as Random Kitchen Sinks or Fastfood to

approximate kernel operations with linear time complexity, making them
suitable for streaming data.
5. Parallel and Distributed Computing:
 Utilize parallel or distributed computing frameworks to handle large-scale

streaming data and kernel computations efficiently.
6. Online Ensemble Methods:
 Consider ensemble methods like Online Random Forest or Online

Boosting, which combine multiple models with kernels to adapt to
changing data.
7. Evaluation and Monitoring:
 Continuously monitor the performance of the kernel learning model using

appropriate evaluation metrics, such as classification accuracy, mean
squared error, or others relevant to your task.
61
1. Resource Constraints:
 Adapt your kernel learning approach to resource constraints, such as

processing power and memory, which may be limited in streaming
environments.
Kernel learning for data streams is an active area of research, and various algorithms and
techniques have been proposed to address the unique challenges posed by streaming
data. The choice of approach should be based on the specific requirements and constraints
of your streaming data application.
Neural Networks for Data Streams
Using neural networks for data streams, where data arrives continuously and in a
potentially infinite sequence, presents unique challenges and opportunities. Neural
networks are powerful models for various machine learning tasks, including
classification, regression, and sequence modelling. Adapting them to data streams
requires specialized techniques to handle the dynamic nature of the data. Here's an
overview of considerations when using neural networks for data streams:
1. Online Learning:
 Implement online learning techniques, also known as incremental or

streaming learning, where the neural network is updated incrementally as new
data arrives. This is crucial for maintaining model performance in a changing
data distribution.
2. Sliding Window:
 Use a sliding window mechanism to manage the memory and computational

resources. Maintain a fixed-size window of the most recent data points for training
and updating the model.
3. Model Architecture:
 Choose neural network architectures that are amenable to online learning.

Feedforward neural networks (multilayer perceptrons), recurrent neural
networks (RNNs), and online versions of convolutional neural networks (CNNs)
can be adapted for data streams.
62
1. Mini-Batch Learning:
 Train neural networks in mini-batches as new data points arrive. This

helps in utilizing efficient gradient descent algorithms, such as
stochastic gradient descent (SGD) or variants like ADAM, RMSprop, and
AdaGrad.
2. Concept Drift Detection:
 Implement mechanisms to detect concept drift, which occurs when

the data distribution changes over time. When drift is detected,
consider retraining or adapting the neural network.
3. Memory-efficient Models:
 Explore memory-efficient neural network architectures designed

for streaming data, such as online memory networks, which adapt to
the limited memory capacity of the sliding window.
4. Feature Engineering:
 Perform feature engineering to extract relevant information from the

data stream. Preprocessing steps like text tokenization, feature
scaling, or dimensionality reduction may be necessary.
5. Regularization:
 Apply regularization techniques, such as dropout or weight decay, to

prevent overfitting, especially when data is limited in the sliding window.
6. Hyperparameter Tuning:
 Tune hyperparameters adaptively based on the streaming data, such

as learning rates or network architectures.
7. Ensemble Methods:
 Consider ensemble techniques that combine multiple neural networks

or models to improve robustness and adaptability in the presence of
concept drift. 63
1. Model Evaluation:
 Continuously monitor and evaluate the neural network's

performance using appropriate evaluation metrics relevant to
your task, such as accuracy, F1-score, or mean squared error.
2. Online Anomaly Detection (Optional):
 Incorporate anomaly detection methods, including neural

network-based approaches, to identify unusual or unexpected
patterns in the data stream.
3. Scalability and Parallel Processing:
 Utilize parallel or distributed computing frameworks to

handle the computational load when processing large-scale data
streams.
4. Resource Constraints:
 Adapt your neural network approach to resource constraints,

such as processing power and memory, which may be
limited in streaming environments.
Adapting neural networks to data streams is an active area of research,

and various approaches, architectures, and libraries are available to
address the challenges of streaming data. The choice of approach
should be tailored to the specific requirements of your streaming data
application.
Feature Selection for Data Streams with Streaming Instances
In this subsection, we review feature selection with streaming instances

where the set of features is fixed, while new instances are
consistently and continuously arriving.
(i) Online Feature Selection

(ii) Unsupervised Feature Selection on Data Streams
64
Data Streams
Data Stream is a continuous, fast-changing, and ordered chain of data
transmitted at a very high speed. It is an ordered sequence of information for a
specific interval. The sender’s data is transferred from the sender’s side and
immediately shows in data streaming at the receiver’s side. Streaming does not
mean downloading the data or storing the information on storage devices.
Sources of Data Stream

There are so many sources of the data stream, and a few widely used sources
are listed below:
• Internet traffic
• Sensors data
• Real-time ATM transaction
• Live event data
• Call records
• Satellite data
• Audio listening
• Watching videos
• Real-time surveillance systems
• Online transactions
What are Data Streams in Data Mining?
65
Data Streams in Data Mining is extracting knowledge and valuable insights
from a continuous stream of data using stream processing software. Data
Streams in Data Mining can be considered a subset of general concepts of
machine learning, knowledge extraction, and data mining. In Data Streams in
Data Mining, data analysis of a large amount of data needs to be done in real-
time. The structure of knowledge is extracted in data steam mining
represented in the case of models and patterns of infinite streams of information.
4.3 Characteristics of Data Stream in Data Mining

Data Stream in Data Mining should have the following characteristics:
• Continuous Stream of Data: The data stream is an infinite continuous
stream resulting in big data. In data streaming, multiple data streams are passed
simultaneously.
• Time Sensitive: Data Streams are time-sensitive, and elements of data
streams carry timestamps with them. After a particular time, the data stream
loses its significance and is relevant for a certain period.
• Data Volatility: No data is stored in data streaming as It is volatile. Once
the data mining and analysis are done, information is summarized or discarded.
• Concept Drifting: Data Streams are very unpredictable. The data changes
or evolves with time, as in this dynamic world, nothing is constant.
Data Stream is generated through various data stream generators. Then,
data mining
techniques are implemented to extract knowledge and patterns from the data
streams. Therefore, these techniques need to process multi-dimensional,
multi-level, single pass, and online data streams.
4.4 Data Streams in Data Mining Techniques

Data Streams in Data Mining techniques are implemented to extract patterns
and insights from a data stream. A vast range of algorithms is available for
stream mining. There are four main algorithms used for Data Streams in Data
Mining techniques.
66
4.5 Tools and Software for Data Streams in Data Mining
There are many tools available for Data Streams in Data Mining. Let’s learn
about the most used tools for Data Streams in Data Mining.
MOA (Massive Online Analysis)

MOA is the most popular open-source software developed in Java for Data
Streams in Data Mining. Several machine learning algorithms like regression,
classification, outlier detection, clustering, and recommender systems are
implemented in MOA for data mining. In addition, it contains stream
generators, concept drift detection, and evaluation tools with bi-directional
interaction with Machine Learning.
Scikit-Multiflow
Scikit-Multiflow is also a free and open-source machine learning framework
for multi- output and Data Streams in Data Mining implemented in
Python. Scikit multi-flow contains stream generators, concept drift
detections, stream learning methods for single and multi-target, concept
drift detectors, data transformers evaluation, and visualization methods.
67
RapidMiner
RapidMIner is a commercial software used for Data Streams in Data Mining,
knowledge discovery, and machine learning. RapidMiner is written in the Java
programming language and used for data loading and transformation (ETL),
data preprocessing, and visualization. In addition,
RapidMiner offers an interactive GUI to design and execute mining and
analytical workflows.
StreamDM
StreamDM is an open-source framework for extensive Data Streams in Data
Mining that uses Spark Streaming, extending the core Spark API. It is a
specialized framework for Spark Streaming that handles much of the complex
problems of the underlying data sources, such as out-of-order data and recovery
from failures.
River
River is a new Python framework for machine learning with online Data
Streams in Data Mining. It provides state-of-the-art learning algorithms, data
transformation methods, and performance metrics for different stream learning
tasks. It is the product of merging the best parts of the creme and scikit
multi-flow libraries, both of which were built with the same objective of its
usage in real-world applications.
68
5. Feature Selection and Evaluation
Feature selection is one of the important tasks in machine learning and data mining. It is
an important and frequently used technique for dimension reduction by removing
irrelevant and redundant information from the data set to obtain an optimal feature
subset. It is also a knowledge discovery tool for providing insights into the problems
through the interpretation of the most relevant features. Feature selection research dates
back to the 1960s. Hughes used a general parametric model to study the accuracy of
a Bayesian classifier as a function of the number of features. Since then the research
in feature selection has been a challenging field and some
researchers have doubted its computational feasibility. Despite the
computationally challenging scenario, the research in this direction continued. To deal
with these data, feature selection faces some new challenges. So it is timely and
significant to review the relevant topics of these emerging challenges and give some
suggestions to the practitioners.
Feature selection brings the immediate effects of speeding up a data mining algorithm,
improving learning accuracy, and enhancing model comprehensibility. However, finding an
optimal feature subset is usually intractable and many problems related to feature
selection have been shown to be NPhard. To efficiently solve this problem, two
frameworks are proposed up to now. One is the search-based framework, and the other
is the correlation- based framework. For the former, the search strategy and evaluation
criterion are two key components. The search strategy is about how to produce a
candidate feature subset, and each candidate subset is evaluated and compared with the
previous best one according to a certain evaluation criterion. The process of subset
generation and evaluation is repeated until a given stopping criterion is satisfied. For the
latter, the redundancy and relevance of features are calculated based on some correlation
measure. The entire original feature set can then be divided into four basic disjoint
subsets: (1) irrelevant features, (2) redundant feature, (3) weakly relevant but non-
redundant features, and (4) strongly relevant features. An optimal feature selection
algorithm should select non-redundant and strongly relevant features. When the best
subset is selected, generally, it will be validated by prior knowledge or different tests via
synthetic and/or real-world data sets. One of the most
69
well-known data repositories is in UCI, which contains many kinds of data sets with
different sizes of sample and dimensionality. Feature Selection @ ASU
(http://featureselection.asu.edu) also provides many benchmark data sets and source
codes for different feature selection algorithms. In addition, some microarray data, such as
Leukemia, Prostate, Lung and Colon, are often used to evaluate the performance of
feature selection algorithms on the high-dimensionality small sample size (HDSSS)
problem.
5.1 Search-Based Feature Selection Framework
For the search-based framework, a typical feature selection process consists of three basic
steps, namely, subset generation, subset evaluation, and stopping criterion. Subset
generation aims to generate a candidate feature subset. Each candidate subset is
evaluated and compared with the previous best one according to a certain evaluation
criterion. If the newly generated subset is better than the previous one, it will be the latest best
subset. The first two steps of search-based feature selection are repeated until a
given stopping criterion is satisfied. According to the evaluation criterion, feature selection
algorithms are categorized into filter, wrapper, and hybrid (embedded) models. Feature
selection algorithms under the filter model rely on analyzing the general characteristics of
data and evaluating features without involving any learning algorithms. Wrapper utilizes a
predefined learning algorithm instead of an independent measure for subset evaluation.
A typical hybrid algorithm makes use of both an independent measure and a learning
algorithm to evaluate feature subsets.
On the other hand, search strategies are usually categorized into complete, sequential, and
random models. Complete search evaluates all feature subsets and guarantees to find the
optimal result according to the evaluation criterion. Sequential search likes to add or
remove features for the previous subset at a time. Random search starts with a randomly
selected subset and injects randomness into the procedure of subsequent search.
Nowadays, as big data with high dimensionality are emerging, the filter
model has attracted more attention than ever. Feature selection algorithms under the filter
model rely on analyzing the general characteristics of data and evaluating features without
70
involving any learning algorithms, therefore most of them do not have bias on specific
learner models. Moreover, the filter model has a straightforward search strategy and
feature evaluation criterion, then its structure is always simple. The advantages of the
simple structure are evident: First, it is easy to design and easily understood by other
researchers. Second, it is usually very fast, and is often appropriate for high-
dimensional data.
5.2 Correlation-Based Feature Selection Framework

Besides search-based feature selection, another important framework for feature
selection is based on the correlation analysis between features and classes. The
correlation-based framework considers the feature-feature correlation and feature-
class correlation. Generally, the correlation between features is known as feature
redundancy, while the feature-class correlation is viewed as feature relevance. Then an
entire feature set can be divided into four basic disjoint subsets: (1) irrelevant features,
(2) redundant features, (3) weakly relevant but non-redundant features, and (4) strongly
relevant features.
The correlation-based feature selection framework consists of two steps:

relevance analysis determines the subset of relevant features, and redundancy analysis
determines and eliminates the redundant features from relevant ones to produce the final
subset. This framework has advantages over the search-based framework as it
circumvents subset search and allows for an efficient and effective way of finding an
approximate optimal subset. The most well-known feature selection algorithms under
this framework are mRMR, Mitra’s, CFS, and FCBF.
A feature is an attribute that has an impact on a problem or is useful for the problem,
and choosing the important features for the model is known as feature selection. Each
machine learning process depends on feature engineering, which mainly contains two
processes; which are Feature Selection and Feature Extraction. Although feature
selection and extraction processes may have the same objective, both are completely
different from each other. The main difference between them is that feature selection is
about selecting the subset of the original feature set, whereas feature extraction
creates new features. Feature selection is a way of reducing the input variable for
the model by using only relevant data in order to reduce overfitting in the model.
71
So, we can define feature Selection as, "It is a process of automatically or manually
selecting the subset of most appropriate and relevant features to be used in
model building." Feature selection is performed by either including the important features
or excluding the irrelevant features in the dataset without changing them.
Need for Feature Selection

Before implementing any technique, it is really important to understand, need for the
technique and so for the Feature Selection. As we know, in machine learning, it is
necessary to provide a pre-processed and good input dataset in order to get better
outcomes. We collect a huge amount of data to train our model and help it to learn better.
Generally, the dataset consists of noisy data, irrelevant data, and some part of useful data.
Moreover, the huge amount of data also slows down the training process of the model,
and with noise and irrelevant data, the model may not predict and perform well. So, it is
very necessary to remove such noises and less-important data from the dataset and to do
this, and Feature selection techniques are used.
Selecting the best features helps the model to perform well. For example, Suppose we
want to create a model that automatically decides which car should be crushed for a spare
part, and to do this, we have a dataset. This dataset contains a Model of the car, Year,
Owner's name, Miles. So, in this dataset, the name of the owner does not contribute to the
model performance as it does not decide if the car should be crushed or not, so we can
remove this column and select the rest of the features(column) for the model building.
Below are some benefits of using feature selection in machine learning:
• It helps in avoiding the curse of dimensionality.
• It helps in the simplification of the model so that it can be easily interpreted by
the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.
72
Fig: Feature Selection Techniques
There are mainly three techniques under supervised feature Selection:

1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as
a search problem, in which different combinations are made, evaluated, and
compared with other combinations. It trains the algorithm by using the subset of
features iteratively.
73
On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.
Some techniques of wrapper methods are:

• Forward selection - Forward selection is an iterative process, which
begins with an empty set of features. After each iteration, it keeps adding on a
feature and evaluates the performance to check whether it is improving the
performance or not. The process continues until the addition of a new
variable/feature does not improve the performance of the model.
• Backward elimination - Backward elimination is also an iterative approach,
but it is the opposite of forward selection. This technique begins the process
by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve the
performance of the model.
• Exhaustive Feature Selection- Exhaustive feature selection is one of the
best feature selection methods, which evaluates each feature set as brute-force.
It means this method tries & make each possible combination of features and
return the best performing feature set.
• Recursive Feature Elimination
Recursive feature elimination is a recursive greedy optimization approach,
where features are selected by recursively taking a smaller and smaller
subset of features. Now, an estimator is trained with each set of features,
and the importance of each feature is determined using coef_attribute or
through a feature_importances_attribute.
Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step. The filter method filters out the irrelevant
feature and redundant columns from the
model by using different metrics through ranking. The advantage of using filter
methods is that it needs low computational time and does not overfit the data.
74
Some common techniques of Filter methods are as follows:
• Information Gain
• Chi-square Test
• Fisher's Score
• Missing Value Ratio

Information Gain: Information gain determines the reduction in
entropy while transforming the dataset. It can be used as a feature selection
technique by calculating the information gain of each variable with respect to the
target variable.
Chi-square Test: Chi-square test is a technique to determine the
relationship between the categorical variables. The chi-square value is calculated
between each feature and the target variable, and the desired number of
features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection.
It returns the rank of the variable on the fisher's criteria in descending order.
Then we can select the variables with a large fisher's score.
Missing Value Ratio:

The value of the missing value ratio can be used for evaluating the feature set
against the threshold value. The formula for obtaining the missing value ratio is
the number of missing
values in each column divided by the total number of observations. The variable
is having more than the threshold value can be dropped.
75
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally
finds the most important features that contribute the most to training in a
particular iteration. Some techniques of embedded methods are:
• Regularization- Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in the model.
This penalty term is added to the coefficients; hence it shrinks some
coefficients to zero. Those features with zero coefficients can be removed
from the dataset. The types of regularization techniques are L1 Regularization
(Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance - Different tree-based methods of feature
selection help us with feature importance to provide a way of selecting
features. Here, feature importance specifies which feature has more
importance in model building or has a great
impact on the target variable. Random Forest is such a tree-based method,
which is a type of bagging algorithm that aggregates a different number of
decision trees. It automatically
ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all
the trees. Nodes are arranged as per the impurity values, and thus it allows to
pruning of trees below a specific node. The remaining nodes create a subset of
the most important features. 76

How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that
which feature selection method will work properly for their model. The more
we know the datatypes of variables, the easier it is to choose the appropriate
statistical measure for feature selection.
To know this, we need to first identify the type of input and output variables.
In machine learning, variables are of mainly two types:
• Numerical Variables: Variable with continuous values such as integer, float

• Categorical Variables: Variables with categorical values such as Boolean,
ordinal, nominals.
Below are some univariate statistical measures, which can be used for filter-
based feature selection:
1. Numerical Input, Numerical Output:
Numerical Input variables are used for predictive regression modelling. The
common method to be used for such a case is the Correlation coefficient.
• Pearson's correlation coefficient (For linear Correlation).
• Spearman's rank coefficient (for non-linear correlation).
77
2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive
modelling problems. In this case, also, correlation-based techniques should
be used, but with categorical output.
• ANOVA correlation coefficient (linear)
• Kendall's rank coefficient (nonlinear)
3. Categorical Input, Numerical Output:

This is the case of regression predictive modelling with categorical input. It is
a different example of a regression problem. We can use the same measures
as discussed in the above case but in reverse order.
4. Categorical Input, Categorical Output:

This is a case of classification predictive modelling with categorical Input
variables. The commonly used technique for such a case is Chi-Squared
Test. We can also use Information gain in this case.
We can summarize the above cases with appropriate measures in the below table:
Input Outpu Feature Selection technique

Variabl t
e Variabl
e
Numerical Numerical • Pearson's correlation coefficient (For
linear Correlation).
• Spearman's rank coefficient (for non-linear
correlation).
Numerical Categorical • ANOVA correlation coefficient (linear).
• Kendall's rank coefficient (nonlinear).
Categorical Numerical • Kendall's rank coefficient (linear).

• ANOVA correlation coefficient (nonlinear).
Categorical Categorical • Chi-Squared test (contingency tables).

• Mutual Information.
78
VIDEO LINKS
Unit – II
79
Sl. Topic Video Link
No.
1 Visual Data https://www.youtube.com/watch?v=rf5dGtn4Nkk
2 Feature Selection https://www.youtube.com/watch?v=t4mTeqZ8YZk
3 Feature Evaluation https://www.youtube.com/watch?v=tWWV9zXMUFY
80
10. ASSIGNMENT : UNIT – II
Toppers
1. Given a dataset with text data, extract meaningful features using techniques such as TF-IDF, word
embeddings, or N-grams. Demonstrate the impact of these features on the performance of a text
classification model.
2. Use a dataset with numerical features and create polynomial features to capture non-linear
relationships. Compare the performance of a regression model with and without polynomial features
and discuss the results.
Above Average Learners:
1. For a given dataset, create interaction features by multiplying or combining existing features.
Analyze the effect of these interaction features on the performance of a predictive model.
2. Given a dataset with missing values, compare different methods for handling missing data (e.g.,
mean imputation, median imputation, KNN imputation, and using a model to predict missing
values). Assess the impact of these methods on model performance.
Average Learners:
1. Take a dataset with date and time information and create new features such as day of the week,
month, hour, and time since the last event. Show how these features can improve the performance
of a time series or event prediction model.
2. Use a dataset with features of different scales and apply various feature scaling techniques (e.g.,
standardization, normalization, min-max scaling). Evaluate how feature scaling affects the
performance of a model, such as a support vector machine or k-nearest neighbors.
Below Learners:
1. Given a dataset with categorical variables, compare different encoding techniques (e.g., one-hot
encoding, label encoding, target encoding) and analyze their impact on the performance of a
machine learning model.
2. Apply dimensionality reduction techniques such as Principal Component Analysis (PCA) or t-
Distributed Stochastic Neighbor Embedding (t-SNE) to a high-dimensional dataset. Visualize the
results and discuss how dimensionality reduction affects the performance of a classifier or
regressor.
Slow Learners:
1. Using a dataset with many features, apply feature selection techniques (e.g., recursive feature
elimination, LASSO, mutual information) to identify the most important features. Compare the
performance of a model before and after feature selection.
2. Choose a dataset from a specific domain (e.g., finance, healthcare, e-commerce) and createdomain- specific
features that could improve model performance. Explain the rationale behind thenew features and
how they contribute to better predictions. 81
82
PART-A Q&A UNIT-II
83
11 : PART A: Q & A UNIT – II
Q1: What is feature selection in the context of machine learning? CO2,K1

A1: Feature selection is the process of choosing a subset of relevant
features (variables) from a larger set of available features to improve the
performance and efficiency of a machine learning model. It aims to retain only
the most informative and impactful features for prediction while discarding
irrelevant or redundant ones.
Q2: Why is feature selection important in machine learning? CO2,K1

A2: Feature selection is important for several reasons:
• It enhances model performance by reducing overfitting and noise in the data.
• It improves computational efficiency by working with a smaller set of features.
• It enhances model interpretability by focusing on the most relevant features.

• It reduces the risk of the "curse of dimensionality" that can lead to
poor generalization.
Q3: What are the main types of feature selection methods? CO2,K1
A3: Feature selection methods can be broadly categorized into three types:
1. Filter Methods: These methods assess the relationship between each feature
and the target variable independently. Examples include correlation-based
selection and statistical tests.
2. Wrapper Methods: These methods use a specific machine learning algorithm
to evaluate subsets of features, considering their combined predictive power.
Examples include recursive feature elimination and forward/backward selection.
3. Embedded Methods: These methods incorporate feature selection as part of
the model training process. Examples include L1 regularization and tree-based

feature importance.
84
Q4: How can you evaluate the importance of features in a dataset? CO2,K1
A4: Feature importance can be evaluated through various methods, including:
• Correlation analysis: Measuring the correlation between each feature and the target
variable.
• Tree-based models: Assessing the decrease in impurity (e.g., Gini impurity)
caused by a feature in decision trees or random forests.

• L1 regularization: Observing the magnitude of coefficients assigned to features in linear
models like Lasso regression.
Q5: What is overfitting, and how does feature selection help mitigate it? CO2,K1
A5: Overfitting occurs when a model learns to perform exceptionally well on the training
data but fails to generalize to new, unseen data. Feature selection helps
mitigate overfitting by reducing the complexity of the model. By selecting only the most
relevant features, the model focuses on capturing the true underlying
patterns in the data rather than fitting noise.
Q6: How would you compare two different feature selection techniques?
CO2,K1
A6: Two common metrics for comparing feature selection techniques are model
performance and computational efficiency. You can:
• Train models using each technique and evaluate them on a validation or test set using
appropriate metrics (e.g., accuracy, F1-score).
• Measure the time it takes for each technique to perform feature selection on the dataset.
Q7: What is cross-validation, and why is it important in evaluating

feature selection? CO2,K1
A7: Cross-validation is a technique used to assess a model's performance by splitting
the dataset into multiple subsets for training and testing. It helps ensure that the model's
performance is consistent across different data partitions and reduces the risk of
overfitting. Cross-validation is important for evaluating feature selection because it
provides a more robust estimate of how well the selected features generalize to new
data.
85
Q8: Can you explain the concept of recursive feature elimination (RFE)?
CO2,K1
A8: Recursive Feature Elimination (RFE) is a wrapper method for feature selection.
It involves repeatedly training a model, removing the least important feature(s)
based on a defined criterion, and then retraining the model. This process is
repeated until a specified number of features is reached or until the desired model
performance is achieved. RFE helps identify the subset of features that contribute
the most to the model's predictive power.
Q9: What is the curse of dimensionality, and how does feature selection
address it? CO2,K1
A9: The curse of dimensionality refers to the challenges posed by high-dimensional
data, where the number of features is much larger than the number of
observations. In such cases, data becomes sparse, and model performance can
degrade. Feature selection helps address this issue by reducing the number of
dimensions, making the data more manageable, improving model generalization,
and reducing the risk of overfitting.
Q10: In what scenarios would you prefer using embedded feature
selection methods? CO2,K1
A10: Embedded feature selection methods are preferred when the feature selection
process is integrated into the model training itself. They are especially useful when:
• The dataset has a high number of features.
• The model's complexity needs to be controlled automatically.
• You want to build a model with interpretable and sparse coefficients.

• You are using algorithms that naturally handle feature selection, such as L1-
regularized linear models or tree-based models.
Q11: What is feature-based time series analysis? CO2,K1
A11: Feature-based time series analysis involves extracting relevant features from
time series data to capture meaningful patterns and characteristics. These features
are then used as inputs for machine learning models to make predictions or gain
insights.
86
Q12: How do you extract features from time series data? CO2,K1
A12: Feature extraction from time series data can involve methods like:
• Statistical features: Mean, variance, skewness, kurtosis, etc.
• Frequency domain features: Fourier or wavelet transform coefficients.
• Autocorrelation and cross-correlation features.
• Time-domain features: Moving averages, exponential moving averages,

etc.
• Trend and seasonality features.
• Entropy-based features: Approximate entropy, sample entropy, etc.

Q13: What are some common applications of feature-based
time series analysis? CO2,K1
A13: Feature-based time series analysis has various applications, such as:
• Stock price prediction.
• Energy consumption forecasting.
• Anomaly detection in industrial systems.
• Human activity recognition.
• Environmental monitoring (e.g., pollution levels).
• Disease outbreak prediction.
Q14: Why is feature selection important in time series analysis?

CO2,K1
A14: Feature selection in time series analysis is essential to:
• Reduce dimensionality and computation time.
• Improve model generalization and reduce overfitting.
• Enhance interpretability by focusing on relevant features.
• Mitigate the effects of noise and irrelevant information.
87
12. PART B QUESTIONS : UNIT – II
1. Explain the types of visual data( CO2, K4)

2. Explain the concept of feature based time series analysis ( CO2, K4)
3. Explain the steps involved in feature engineering ( CO2, K4)
4. Describe various feature selection techniques.( CO2, K4)
5. Explain how to evaluate the features. ( CO2, K4)
88
13. SUPPORTIVE ONLINE CERTIFICATION COURSES
NPTEL : https://nptel.ac.in/courses/110106064
coursera : https://www.coursera.org/learn/time-series-features
Udemy https://www.udemy.com/course/feature-selection-for-machine-
learning/
89
14. REAL TIME APPLICATIONS
Feature selection and evaluation are critical steps in various real-time applications
across different domains. Here are some real-time applications where feature
selection and evaluation play a significant role:
Real-Time Predictive Maintenance: In industries such as manufacturing and
aviation, feature selection and evaluation help identify the most relevant sensor data for
predicting equipment failures. By selecting key features and evaluating their
importance, predictive maintenance models can detect anomalies and trigger
maintenance actions before critical failures occur.
Financial Fraud Detection: Feature selection and evaluation are crucial for
identifying patterns indicative of fraudulent transactions in real-time financial data. By
selecting informative features and evaluating their impact, fraud detection
algorithms can swiftly detect and flag potentially fraudulent activities.
Healthcare Monitoring and Diagnostics: In real-time patient monitoring,
selecting relevant physiological features and evaluating their significance can aid in
early disease detection and patient risk assessment. Continuous evaluation of
features can provide timely alerts for medical interventions.
Web Traffic Anomaly Detection: For websites and online services, feature
selection and evaluation assist in detecting unusual patterns in real-time web traffic
data. Relevant features are chosen to capture normal behavior, and evaluation helps
distinguish anomalies that could indicate cyberattacks or system glitches.
Autonomous Vehicles and Robotics: Feature selection and evaluation are
essential in real-time decision-making for autonomous vehicles and robots. Selecting
pertinent sensory inputs and evaluating their impact on navigation and control
algorithms ensure safe and efficient operations.
Energy Management and Smart Grids: Feature selection and evaluation are
used in real-time energy consumption analysis and forecasting. By selecting relevant
features related to energy usage patterns and evaluating their influence, smart grids can
optimize energy distribution and consumption.
90
Network Intrusion Detection: In cybersecurity, feature selection and
evaluation help identify network anomalies and potential security
breaches in real-time. By selecting key network traffic features and
evaluating their importance, intrusion detection systems can quickly flag
suspicious activities.
Social Media Sentiment Analysis: Real-time sentiment analysis on
social media platforms relies on feature selection and evaluation to
determine which textual and linguistic features are most indicative of user
sentiment. This enables businesses to respond promptly to customer
feedback and trends.
Environmental Monitoring: Real-time analysis of environmental data,
such as air quality and pollution levels, benefits from feature selection
and evaluation. By
selecting relevant environmental indicators and evaluating their impact,
authorities can make informed decisions to protect public health.
Human Activity Recognition: In wearable devices and IoT
applications, feature selection and evaluation help in real-time recognition
of human activities. Relevant
sensor data features are chosen, and evaluation ensures accurate
tracking and classification of activities.
91
Assessment Schedule
(Proposed Date &
Actual Date)
92
16. Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Propose Actual
No. d Date Date
1 FIRST INTERNAL ASSESSMENT 22-8-24 to 30-8-24
2 SECOND INTERNAL ASSESSMENT 30-9-24 to 8-10-24
3 MODEL EXAMINATION 26-10-24 to 8-11-

24
4 END SEMESTER EXAMINATION 22-11-24
93
17. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS
TEXT BOOKS:
1. Suresh Kumar Mukhiya and Usman Ahmed, “Hands-on Exploratory Data
Analysis
with Python”, Packt publishing , March 2020.
2. Guozhu Dong, Huan Liu, "Feature Engineering for Machine Learning
and Data Analytics", First Publication, CRC Press, First edition, 2018.
3. Ben Fry, “Visualizing Data”, O’reilly Publications, First Edition, 2007.
REFERENCES:
1. Danyel Fisher & Miriah Meyer, “Making Data Visual: A Practical Guide
To Using
Visualization For Insight”, O’reilly publications, 2018.
2. Claus O. Wilke, ”Fundamentals of Data Visualization”, O’reilly
publications, 2019.
3. EMC Education Services, “Data Science and Big data analytics:
Discovering,
Analyzing, Visualizing and Presenting Data”, Wiley Publishers, 2015.
4. Tamara Munzner, “Visualization Analysis and Design”, A K Peters/CRC
Press; 1st edition, 2014.
5. Matthew O. Ward, Georges Grinstein, Daniel Keim, “Interactive
Data Visualization: Foundations, Techniques, and Applications”, 2nd
Edition, CRC press,
2015.
94
18. MINI PROJECT SUGGESTION
Mini Project Ideas:
Toppers:
1. Spam Email Classification:
Dataset: A collection of emails labeled as spam or not spam.
Task: Perform feature selection and evaluation to build a spam email classifier. Techniques:
Compare different feature selection methods such as chi-squared test, mutual
information, and recursive feature elimination. Evaluate model performance using
metrics like accuracy, precision, recall, and F1-score.
Above Average Learners
2. House Price Prediction:
Dataset: Housing dataset containing various features related to houses.
Task: Select relevant features and build a regression model to predict house prices.
Techniques: Utilize correlation analysis and feature importance from tree-based
models. Compare the performance of different regression algorithms and evaluate
using metrics like RMSE and R-squared.
Average Learners
3. Customer Churn Prediction:
Dataset: Customer data from a telecommunications company, including various
customer attributes.
Task: Identify key features affecting customer churn and build a churn prediction
model.
Techniques: Apply feature importance from ensemble methods like Random Forest or
XGBoost. Evaluate model performance using accuracy, precision, recall, and ROC- AUC.
Below Average Learners
4. Medical Diagnostics:
Dataset: Medical dataset with patient attributes and diagnostic outcomes. Task:
Perform feature selection and evaluation to develop a diagnostic model.
Techniques: Implement L1-based regularization to select important features. Evaluate the
model using sensitivity, specificity, and overall accuracy.
Slow Learners:
Take a dataset with date and time information and create new features such as day of the
week, month, hour, and time since the last event. Show how these features can improve
the performance of a time series or event prediction model.
95
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not the
intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance
on the contents of this information is strictly prohibited.
96

DEV U2

Uploaded by

Copyright:

Available Formats

DEV U2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DEV U2

Uploaded by

Copyright:

Available Formats

1

S.NO Topic Page No.

6. CO- PO/PSO Mapping 11

8. Activity based learning 14

11. Part A Q & A 82

13. Supportive online Certification courses 88

14. Real time Applications in day to day life 89

15. Assessment Schedule 91

16. Prescribed Text Books and Reference Books 93

17. Mini Project Suggestions 94

➢ To outline an overview of exploratory data analysis and phases involved in

➢ To acquire an in-depth knowledge in EDA techniques

UNIT I EXPLORATORY DATA ANALYSIS 6+ 6

EDA fundamentals – Understanding data science – Significance of EDA – Making sense

UNIT II FEATURE ENGINEERING 6+6

UNIT III VISUALIZING DATA 6+6

UNIT V TREES, HIERARCHIES, AND RECURSION 6+6

TOTAL: 30 +30 PERIODS

At the end of this course, the students will be able to:

COURSE OUTCOMES HKL

CO1 Explain the overview of exploratory data analysis and phases K2

CO2 Explore in-depth knowledge in EDA techniques K4

CO3 Apply the visualization techniques in data K3

CO4 Describe the methods of time series analysis K4

CO5 Represent the data in tree and hierarchical formats K4

Sl. N Topic Number Propose Actual CO Tax Mode of

Data Streams 10.8.2024 2 PPT/C

9 Feature 1 20.8.2024 2 K1 PPT/C

Feature Selection Using Heart Disease Dataset.

● Feature selection process would be maximizing the accuracy of the model.

1) Students can form group. ( 3 students / team)

2) Install Weka Tool.

What is Feature Engineering?

Feature Engineering encapsulates various data engineering techniques such as

Why is Feature Engineering so important?

Well to analyse that, let us have a look at this diagram.

Thus Feature Engineering plays an extremely pivotal role in

Benefits of Feature Engineering

 Higher efficiency of the model

 Easier Algorithms that fit the data

 Easier for Algorithms to detect patterns in the data

but that is exactly what this guide is all about.

These processes are described as below:

Transformations: The transformation step of feature engineering involves adjusting the

Below are some benefits of using feature selection in machine learning:

o It reduces the training time.

o It reduces overfitting hence enhancing the generalization.

Need for Feature Engineering in Machine Learning

o Better features mean simpler models.

Steps in Feature Engineering

o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA)

o Benchmark: Benchmarking is a process of setting a standard baseline for

Feature Engineering Techniques

6. One hot encoding

Tools for feature engineering

TSFresh is a free Python library containing well-known algorithms from statistics,

As the name suggests, Feature Selector is a Python library for choosing

PyCaret is a Python-based open-source library. Although it is not a dedicated

Some other useful tools for feature engineering include:

 the NumPy library with numeric and matrix operations;