SRU ADA Unit-3

1
Advanced Data Analytics
Term: 2024-25
Unit-3
Regression – Concepts, Least Square Estimation,
Variable Rationalization, and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics,
Model Construction, Analytics applications to various
Business Domains etc.
2
Introduction
3
4
5
Classification
The classifier algorithms are designed to indicate whether a new data
point belongs to one or another among several predefined classes.
Imagine when you are organising emails into spam or inbox,
categorising images as cat or dog, or predicting whether a loan
applicant is a credible borrower. In the classification models, there is a
learning process by the use of labeled examples from each category. In
this process, they discover the correlations and relations within the
data that help to distinguish class one from the other classes. After
learning these patterns, the model is then capable of assigning these
class labels to unseen data points.
6
Common Classification Algorithms:
● Logistic Regression: A very efficient technique for the
classification problems of binary nature (two types, for example,
spam/not spam).
● Support Vector Machine (SVM): Good for tasks like
classification, especially when the data has a large number of
features.
● Decision Tree: Constructs a decision tree having branches and
proceeds to the class predictions through features.
● Random Forest: The model generates an “ensemble” of decision
trees that ultimately raise the accuracy and avoid overfitting
(meaning that the model performs great on the training data but
lousily on unseen data). 7
Regression
Regression algorithms are about forecasting of a continuous output
variable using the input features as their basis. This value could be
anything such as predicting real estate prices or stock market trends to
anticipating customer churn (how likely customers stay) and sales
forecasting. Regression models make the use of features to understand
the relationship among the continuous features and the output
variable. That is, they use the pattern that is learned to determine the
value of the new data points.
8
Common Regression Algorithms
● Linear Regression: Fits depth of a line to the data to model for the
relationship between features and the continuous output.
● Polynomial Regression: Similiar to linear regression but uses
more complex polynomial functions such as quadratic, cubic, etc,
for accommodating non-linear relationships of the data.
● Decision Tree Regression: Implements a decision tree-based
algorithm that predicts a continuous output variable from a number
of branching decisions.
● Random Forest Regression: Creates one from several decision
trees to guarantee error-free and robust regression prediction
results.
● Support Vector Regression (SVR): Adjusts the Support Vector
Machine ideas for regression tasks, where we are trying to find one 9
10
Unsupervised learning involves a difficult task of working with data
which is not provided with pre-defined categories or label.
Clustering
Visualize being given a basket of fruits with no labels on them. The
fruits clustering algorithms are to group them according to the inbuilt
similarities. Techniques like K-means clustering are defined by exact
number of clusters (“red fruits” and “green fruits”) and then each data
point (fruit) is assigned to the cluster with the highest similarity within
based on features (color, size, texture).
11
Contrary to this, hierarchical clustering features construction of hierarchy
of clusters which makes it more easy to study the system of groups.
Spatial clustering algorithm Density-Based Spatial Clustering of
Applications with Noise (DBSCAN) detects groups of high-density data
points, even in those areas where there is a lack of data or outliers.
Dimensionality Reduction
● Sometimes it is difficult to both visualize and analyze the data when
you have a large feature space (dimensions). The purpose of
dimensionality reduction methods is to decrease the dimensions
needed to maintain the key features. Dimensions of greatest
importance are identified by principal component analysis (PCA), which
is the reason why data is concentrated in fewer dimensions with the
highest variations. This speeds up model training as well as offers a 12
Anomaly Detection
Unsupervised learning can also be applied to find those data points
which greatly differ than the majorities. The statistics model may
identify these outliers, or anomalies as signaling of errors, fraud or
even something unusual. Local Outlier Factor (LOF) makes a
comparison of a given data point’s local density with those surrounding
it. It then flags out the data points with significantly lower densities as
outliers or potential anomalies. Isolation Forest is the one which uses
different approach, which is to recursively isolate data points according
to their features. Anomalies usually are simple to contemplate as they
often necessitate fewer steps than an average normal point.
13
14
Besides, supervised learning is such a kind of learning with labeled
data that unsupervised learning, on the other hand, solves the task
where there is no labeled data. Lastly, semi-supervised learning fills
the gap between the two. It reveals the strengths of both approaches
by training using data sets labeled along with unlabeled one. This is
especially the case when labeled data might be sparse or prohibitively
expensive to acquire, while unlabeled data is undoubtedly available in
abundance.
15
Generative Semi-Supervised Learning
Envision having a few pictures of cats with labels and a universe of
unlabeled photos. The big advantage of generative semi-supervised
learning is its utilization of such a scenario. It exploits a generative
model to investigate the unlabeled pictures and discover the
orchestrating factors that characterize the data. This technique can
then be used to generate the new synthetic data points that have the
same features with the unlabeled data. The synthetic data is then
labeled with the pseudo-labels that the generative model has
interpreted from the data. This approach combines the existing labeled
data with the newly generated labeled data to train the final model
which is likely to perform better than the previous model that was
trained with only the limited amount of the original labeled data.
16
Graph-based Semi-Supervised
Learning
This process makes use of the relationships between data points and
propagates labels to unmarked ones via labeled ones. Picture a social
network platform where some of the users have been marked as fans of
sports (labeled data). Cluster-based methods can analyze the links
between users (friendships) and even apply this information to infer that
if a user is connected to someone with a “sports” label then this user
might also be interested in sports (unbiased labels with propagated
label). While links and the entire structure of the network are also
important for the distribution of labels. This method is beneficial when
the data points are themselves connected to each other and this
connection can be exploiting during labelling of new data.
17
18
Value-based learning
Visualize a robot trying to find its way through a maze. It has neither a
map nor instructions, but it gets points for consuming the cheese at
the end and fails with deduction of time when it runs into a wall. Value
learning is an offshoot of predicting the anticipated future reward of
taking a step in a particular state. For example, the algorithm Q-
learning will learn a Q-value for each state-action combination. This Q-
value is the expected reward for that action at that specific state.
Through a repetitive process of assessing the state, gaining rewards,
and updating the Q-values the agent manages to determine that which
actions are most valuable in each state and eventually guides it to the
most rewarding path.
19
Policy-based learning
In contrast to the value-based learning, where we are learning a
specific value for each state-action pair, in policy-based learning we
are trying to directly learn a policy which maps states to actions. This
policy in essence commands the agent to act in different situations as
specified by the way it is written. Actor-Critic is a common approach
that combines two models: an actor that retrains the policy and a critic
that retrains the value function (just like value-based methods). The
actor witnesses the critic’s feedback which updates the policy that the
actor uses for better decision making. Proximal Policy Optimization
(PPO) is a specific policy-based method which focuses on high variance
issues that complicate early policy-based learning methods.
20
Deep Learning
Deep learning is a subfield of machine learning that utilizes artificial
neural networks with multiple layers to achieve complex pattern
recognition. These networks are particularly effective for tasks
involving large amounts of data, such as image recognition and natural
language processing.
21
● Artificial Neural Networks (ANNs) – This is a popular model that
refers to the structure and function of the human brain. It consists
of interconnected nodes based on various layers and is used for
various ML tasks.
● Convolutional Neural Networks (CNNs) – A CNN is a
deep learning model that automates the spatial hierarchies of
features from input data. This model is commonly used in image
recognition and classification.
● Recurrent Neural Networks (RNNs) – This model is designed for
the processing of sequential data. It enables the memory input
which is known for Neural network architectures.
● Long Short-Term Memory Networks (LSTMs) – This model is
comparatively similar to Recurrent Neural Networks and allows 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Regression Concepts
 Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent and independent
variable.  Linear regression shows the linear relationship between the
independent(predictor) variable i.e. X-axis and the dependent(output)
variable i.e. Y-axis, called linear regression.  If there is a single input
variable X(dependent variable), such linear regression is called simple
linear regression
39
The above graph presents the linear relationship between the
output(y) variable and predictor(X) variables.  The blue line is referred
to as the best fit straight line.  Based on the given data points, we
attempt to plot a line that fits the points the best.  To calculate best-fit
line linear regression uses a traditional slope-intercept form which is
given below,
40
41
42
43
44
45
46
47
Assumptions of Linear Regression:
Regression is a parametric approach, which means that it makes
assumptions about the data for the purpose of analysis. For successful
regression analysis, it’s essential to validate the following assumptions:
1. Linearity of residuals: There needs to be a linear relationship
between the dependent variable and independent variable(s).
48
2. Independence of residuals:
• The error terms should not be dependent on one another (like in
time-series data wherein the next value is dependent on the
previous one).
●  There should be no correlation between the residual terms.
●  The absence of this phenomenon is known as Autocorrelation.
●  There should not be any visible patterns in the error terms.
49
Normal distribution of residuals
The mean of residuals should follow a normal distribution with a mean
equal to zero or close to zero.
●  This is done in order to check whether the selected line is actually
the line of best fit or not.
●  If the error terms are non-normally distributed, suggests that
there are a few unusual data points that must be studied closely to
make a better model.
50
The equal variance of residuals
The equal variance of residuals:
●  The error terms must have constant variance. This phenomenon is
known as Homoscedasticity.
●  The presence of non-constant variance in the error terms is
referred to as Heteroscedasticity.
●  Generally, non-constant variance arises in the presence of outliers
or extreme leverage values.
51
Multiple Linear Regression
52
Feature Selection: With more variables present, selecting the
optimal set of predictors from the pool of given features (many of
which might be redundant) becomes an important task for building a
relevant and better model
Overfitting and Underfitting in Linear Regression
● There have always been situations where a model performs well on
training data but not on the test data.
●  While training models on a dataset, overfitting, and underfitting
are the most common problems faced by people.
●  Before understanding overfitting and underfitting one must know
about bias and variance.
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Thank You
78

SRU ADA Unit-3

Uploaded by

Copyright:

Available Formats

SRU ADA Unit-3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SRU ADA Unit-3

Uploaded by

Copyright:

Available Formats

1

Advanced Data Analytics

You might also like