Assignment4 - AnswerKey

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

4.

Assignment4_AnswerKey
What is Regularization Technique in Machine Learning? Discuss different
types of regularization techniques in detail. ✅
Regularization in Machine Learning:
Definition:
Regularization is a technique used in machine learning to prevent overfitting and improve the
generalization of models. Overfitting occurs when a model fits the training data too closely,
capturing noise and irrelevant patterns, leading to poor performance on new, unseen data.
Regularization introduces a penalty term to the model's objective function, discouraging the use
of overly complex models.
Objective Function with Regularization Term:
The regularized objective function is a combination of the original loss function (L(\theta))
J (θ)

and a regularization term : R(θ)

[J (θ) = L(θ) + α ⋅ R(θ)]

Here, is the regularization parameter, controlling the strength of regularization. The two most
α

common types of regularization are L1 regularization (Lasso) and L2 regularization (Ridge), but
there are other variations as well.
Types of Regularization Techniques:
1. L1 Regularization (Lasso):
Regularization Term: (R(θ) = ∥θ∥ 1 = ∑
n

j=1
|θ j |)

Effect: Encourages sparsity in the model by penalizing some weights to exactly zero.
Use Case: Feature selection, when there is a belief that some features are irrelevant.
2. L2 Regularization (Ridge):
Regularization Term: (R(θ) = ∥θ∥
2
2
= ∑
n

j=1
2
θ )
j

Effect: Smoothens the weights, preventing them from taking extreme values.
Use Case: Generally used to prevent multicollinearity.
3. Elastic Net:
Regularization Term: (R(θ) = α ⋅ ∥θ∥ 1 + (1 − α) ⋅ ∥θ∥ )
2
2

Combination: Combines L1 and L2 regularization.


Use Case: When a dataset has a large number of features and may exhibit
multicollinearity.
4. Dropout:
Technique: Randomly drops (sets to zero) a fraction of neurons during training.
Effect: Prevents co-adaptation of hidden units, acting as a form of regularization for
neural networks.
Use Case: Commonly used in deep learning for neural network regularization.
5. Early Stopping:
Technique: Halts training when performance on a validation set starts to degrade.
Effect: Prevents the model from overfitting the training data too closely.
Use Case: Simple and effective for iterative optimization algorithms.
6. Batch Normalization:
Technique: Normalizes the inputs of each layer to have zero mean and unit variance.
Effect: Acts as a form of regularization by reducing internal covariate shift.
Use Case: Commonly used in deep learning for stabilizing and accelerating training.
Regularization techniques are crucial for building models that generalize well to new data and
avoid overfitting. The choice of regularization method depends on the characteristics of the data
and the specific requirements of the problem at hand.
Give a detail note on kernel methods? ✅
Kernel methods are a class of algorithms in machine learning that operate by implicitly
mapping input data into a higher-dimensional space, known as the feature space. These
methods are particularly prevalent in tasks such as classification, regression, and clustering.
The essence of kernel methods lies in the use of kernels, which measure the similarity
between pairs of data points in the input space without explicitly computing the
transformation.
Kernel Function
A kernel is a function that computes the similarity (or inner product) between pairs of data
points in the original input space or the transformed feature space.
E.g. - Linear, Polynomial, radial basis function etc.
Feature Mapping
The kernel implicitly represents a mapping of the input data into a higher-dimensional feature
space.
Feature mapping allows the algorithm to operate in a space where the data might be more
separable or where certain patterns are more evident.
Kernel Trick
The kernel trick is a computational shortcut that allows the computation of dot products in
the feature space without explicitly calculating the transformed feature vectors.
It is computationally more efficient and avoids the need to store or compute the high-
dimensional feature vectors explicitly.
Advantages
1. Nonlinearity:
Kernel methods allow capturing complex, nonlinear relationships in the data, making
them suitable for a wide range of real-world problems.
2. Efficiency with the Kernel Trick:
The kernel trick enables efficient computations in high-dimensional spaces without the
need to explicitly transform the data.
3. Versatility:
Kernel methods are versatile and applicable to various machine learning tasks, including
classification, regression, and dimensionality reduction.

Define clustering. What are the different types of clustering explain in


detail ✅
Clustering Definition
The task of grouping a set of objects in such away that objects in the same group are more
similar in some sense that those in other clusters
A main task of data mining and used in many fields, including machine learning, pattern
recognition , image analysis etc.

k-means clustering
The k-means clustering algorithm is one of the simplest unsupervised learning algorithms for
solving the clustering problem.
1. For a given dataset, let's have k clusters
2. k points are chosen as the "centers" for the clusters, one for each cluster.
3. We then associate each data point with the nearest center.
4. We then take averages of the data points associated with a center and replace the center
with the average (for each center point)
5. We repeat the process until the centers converge to a fixed number of points.
6. The datasets nearest to the centers form the various clusters in the dataset.
The end objective is to partition the set X into k mutually disjoint-subsets
S = {S 1 , S 2 , . . . . S k }

and a set of data points V which minimizes the following within-cluster sum
k


∑ ∑ ||x − v i || → 2

i=1 →
x∈S i

Hierarchical Clustering
A method of cluster analysis which seeks to build a hierarchy of clusters in a given dataset.
Produces clusters in which clusters at each level are created by merging clusters at the next
lower level.
At the lowest level, each cluster contains a single observation.
At the highest level , there is only one cluster containing all of the data
Whether two clusters should be merged or not is a decision taken on the basis of the
measure of dissimilarity between the clusters.
Density based Clustering
Clusters are defined as areas of higher density than the remainder of the data set. -
Objects in the sparse areas (that are used to separate clusters) are usually considered to be
noise and border points.
DBSCAN (most popular density based clustering method) Density based Spatial clustering of
applications with Noise.
For test, not for assignment

Discuss PCA & its kernel PCA? ✅


Principal Component Analysis (PCA):
Definition: Principal Component Analysis (PCA) is a dimensionality reduction technique used in
machine learning and statistics. Its primary goal is to transform high-dimensional data into a
lower-dimensional representation while retaining as much of the original variance as possible.
PCA identifies a set of orthogonal axes, called principal components, along which the data varies
the most.
Steps of PCA:
1. Standardization:
Standardize the features of the dataset to have zero mean and unit variance.
2. Covariance Matrix Calculation:
Compute the covariance matrix of the standardized data.
3. Eigen decomposition:
Perform eigen decomposition on the covariance matrix to obtain the eigenvalues and
eigenvectors.
4. Selection of Principal Components:
Select the top k eigenvectors corresponding to the k largest eigenvalues, where k is the
desired dimensionality of the reduced data.
5. Projection:
Project the original data onto the selected principal components to obtain the reduced-
dimensional representation.
Kernel Principal Component Analysis (PCA):
Kernel Principal Component Analysis (Kernel PCA) is an extension of PCA that incorporates the
use of kernel functions to handle non-linear relationships in the data. Traditional PCA is effective
for linear relationships, but when the data exhibits non-linear patterns, Kernel PCA is a valuable
alternative.
Steps of Kernel PCA:
1. Kernel Matrix Calculation:
Compute the kernel matrix using a chosen kernel function (e.g., radial basis function
(RBF), polynomial).
2. Centering the Kernel Matrix:
Center the kernel matrix to ensure zero mean in the feature space.
3. Eigenvalue Decomposition:
Perform eigendecomposition on the centered kernel matrix to obtain eigenvalues and
eigenvectors.
4. Selection of Principal Components:
Select the top �k eigenvectors corresponding to the �k largest eigenvalues.
5. Projection:
Project the original data onto the selected principal components to obtain the reduced-
dimensional representation.

Explain in detail the concept of Kernel and K- Means? ✅


Kernel in Machine Learning:
Definition: A kernel in machine learning is a function that measures the similarity or proximity
between pairs of data points. Kernels play a crucial role in algorithms that involve inner products
or distances between data points. They are often used to implicitly map data into a higher-
dimensional space, allowing linear algorithms to operate effectively in non-linear feature spaces.
k-means Clustering
The k-means clustering algorithm is one of the simplest unsupervised learning algorithms for
solving the clustering problem.
1. For a given dataset, let's have k clusters
2. k points are chosen as the "centers" for the clusters, one for each cluster.
3. We then associate each data point with the nearest center.
4. We then take averages of the data points associated with a center and replace the center
with the average (for each center point)
5. We repeat the process until the centers converge to a fixed number of points.
6. The datasets nearest to the centers form the various clusters in the dataset.
The end objective is to partition the set X into k mutually disjoint-subsets
S = {S 1 , S 2 , . . . . S k }

and a set of data points V which minimizes the following within-cluster sum
What is ensemble learning. Discuss different types of ensemble learning
techniques ✅
Ensemble Learning
A technique that creates multiple models and then combine them to produce improved
results. Usually produces more accurate solutions than a single more would.

Some components that can be combined include


Different Learning Algorithms
Same Learning Algorithm trained in different ways
Same Learning Algorithms train the same ways
Model Combination Schemes - Combining Multiple Learners
Models composed of multiple learners that complement each other so that combining them
gives a higher accuracy.
Multi-Expert Combination
Multi-Expert combination methods have base-learners that work in parallel. These methods can
in turn be divided into two:
In the global approach, also called learner fusion, given an input, all base-learners generate an
output and all these outputs are used.
Examples are voting and stacking.
In the local approach, or learner selection, for example, in mixture of experts, there is a gating
model, which looks at the input and chooses one (or very few) of the learners as responsible
for generating the output.
Multi-Stage Combination
These models use a serial approach where the next base-learner is trained with or tested on
only the instances where the previous base-learners are not accurate enough.
Base-Learners are sorted on the basis of complexity. This ensures that a more complex base-
learner is not used unless the preceding simpler base-learners are not confident.
Example: Cascading
With L base-learners
- Prediction of Base-Learner, , given some input
d j (x) Mj x

Final Prediction, where are the parameters


y = f (d 1 , d 2 , d 3 , . . . . d L |ϕ) ϕ

For each learner, for multi-class classification, there will be K outputs for each learner.
These values are combined, and then we choose the class with the maximum value. yi

Voting
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners
This is known as ensembles and linear opinion pools.
Simplest case, all learners have equal weightage and we have simple voting that corresponds
to taking an average.
There are several other combination rules as well (max, min, product, median, average)
if the outputs are not posterior probabilities, then the outputs need to be normalized to the
same scale
Error-Correcting Output Codes
A technique that allows a multi-class classification problem to be reframed as multiple binary
classification problems, allowing the use of native binary classification models directly.
These are different from one-vs-rest and one-vs-one methods.
Here each class can be encoded as a number of binary classification methods.
1. Binary Decomposition:
In ECOC, each class is represented by a unique binary code (a set of bits). The number of
bits in the code is determined by the total number of classes in the problem.
For example, if there are K classes, each class is assigned a unique binary code with log2​
(K) bits.
2. Training Binary Classifiers:
For each bit in the binary code, a binary classifier is trained to distinguish between the
classes that have a 1 in that bit and those that have a 0.
This results in log2​(K) binary classifiers, each solving a binary classification problem.
3. Decoding Predictions:
During prediction, each binary classifier produces a binary decision for its assigned bit.
The final predicted class is determined by decoding the binary outputs from all
classifiers, and the class with the closest binary code to the decoded output is chosen.
4. Error Correction:
The binary codes are designed such that the Hamming distance between any two class
codes is large. Hamming distance is the number of bits in which two binary strings differ.
This design allows for error correction. Even if some binary classifiers make incorrect
predictions, the correct class can still be identified based on the overall closest match.
Bagging
Bootstrap Aggregating: Involves having each model in the ensemble vote with equal weights
However, to promote model variance, bagging trains each model on a randomly drawn subset of
the training set.
Random Forest Algorithm: Combines random decision trees with bagging to achieve very high
classification accuracy
The benefit is that we get lots of learner that just perform slightly differently
Accuracy of the classification can be made without complicating the analysis work.
What is Boosting? Discuss with neat relevant example? ✅
Boosting in Machine Learning:
Definition:
Boosting is an ensemble learning technique that combines the predictions of multiple weak
learners (typically simple models) to create a strong learner with improved predictive
performance. The key idea behind boosting is to sequentially train weak models, giving more
weight to the instances that were misclassified by the previous models. This adaptive boosting
process aims to correct errors and improve overall accuracy.
Steps in Boosting:
1. Initialization:
Assign equal weights to all training instances.
2. Model Training:
Train a weak learner on the dataset with the current instance weights.
3. Compute Error:
Calculate the error of the weak learner on the training set.
4. Adjust Weights:
Increase the weights of misclassified instances, making them more influential in the next
iteration.
5. Repeat:
Repeat steps 2-4 for a predefined number of iterations or until a specified level of
accuracy is achieved.
6. Combine Models:
Combine the predictions of all weak learners, typically using a weighted sum.
Example of Boosting: AdaBoost (Adaptive Boosting):
Let's consider AdaBoost, one of the most popular boosting algorithms, with a simple example:
Problem
Classify whether emails are spam or not based on two features (e.g., number of words and
presence of specific keywords).
Steps:
1. Iteration 1:
Train a weak learner (e.g., a decision stump) on the dataset with equal weights.
Misclassified instances are given higher weights.
2. Iteration 2:
Train another weak learner, giving more weight to the previously misclassified instances.
Adjust weights based on errors.
3. Iteration 3:
Repeat the process for multiple iterations, each time focusing more on instances that
were misclassified in the previous iterations.
4. Final Model:
Combine the predictions of all weak learners with weighted voting.
Benefits of Boosting:
Improved Accuracy: Boosting often results in higher accuracy compared to individual weak
learners.
Robustness: Boosting can handle noise and outliers effectively.
Adaptability: The algorithm adapts to misclassifications, focusing on instances that are
challenging to classify.
Considerations:
Overfitting: Boosting can be prone to overfitting, especially if the number of iterations is too
high.
Sensitive to Noisy Data: Boosting may not perform well with noisy data.
Boosting, with algorithms like AdaBoost, is widely used in various domains, including image
recognition, text classification, and bioinformatics, where accurate predictions are crucial, and
the data may be complex and noisy.
Explain the concept of Bagging with its uses? ✅
Bootstrap Aggregating: Involves having each model in the ensemble vote with equal weights
However, to promote model variance, bagging trains each model on a randomly drawn subset of
the training set.
The benefit is that we get lots of learner that just perform slightly differently
Accuracy of the classification can be made without complicating the analysis work.
Uses
1. Random Forest Algorithm:
Random Forest is a popular algorithm that combines bagging with random decision
trees to achieve high classification accuracy.
Benefit: By introducing randomness in the construction of each decision tree, Random
Forest creates diverse learners, improving the ensemble's overall performance.
2. Diversity for Improved Generalization:
Bagging promotes diversity among base models by training them on randomly drawn
subsets of the training set.
Advantage: The diverse set of learners captures different aspects of the underlying
patterns in the data, enhancing the ensemble's generalization to new, unseen data.
3. Simplicity and Accuracy:
Bagging provides a simple yet effective way to improve the accuracy of classification
models.
Advantage: The ensemble approach aggregates predictions from multiple models,
leading to a more robust and accurate overall prediction without complicating the
analysis.
What is Random Forest? Explain with example? ✅
A random forest is an ensemble learning method where multiple decision trees are constructed
and then they are merged to get more accurate predictions
1. The random forest algorithm generates many classification trees. Each tree is generated as
follows:
1. If the number of examples in the training set is N, take a sample of N examples at
random, but with replacement from the original data.
2. The sample will be the set for generating the tree
3. If there are M input variables, a number m is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to
split the node.
1. The value of m is held constant during the generation of the various trees in the
forest
4. Each tree is grown to the largest extent possible
2. To classify a new object from an input vector, put the input vector down each of the trees in
the forest. Each tree gives a classification, and the trees vote for the class.
3. The forest then chooses the classification
Example
Dataset:
Suppose we have the following training dataset:
Hours of Study Previous Exam Score Outcome
2 50 Fail
Hours of Study Previous Exam Score Outcome
5 80 Pass
1 30 Fail
6 90 Pass
3 60 Fail
8 85 Pass
7 75 Pass
4 45 Fail
Steps to Demonstrate Random Forest:
1. Create Decision Trees:
Random Forest builds multiple decision trees. Each tree is trained on a random subset of
the data with replacement (bootstrapping). For simplicity, let's say we build three
decision trees:
Tree 1: Trained on a subset of data: (2, 50), (6, 90), (4, 45), (3, 60), (5, 80)
Tree 2: Trained on a subset of data: (7, 75), (5, 80), (3, 60), (1, 30), (8, 85)
Tree 3: Trained on a subset of data: (2, 50), (1, 30), (7, 75), (4, 45), (6, 90)
2. Make Predictions:
Each tree makes a prediction for each input. For classification, the majority vote is taken.
For regression, the average is calculated.
For simplicity, let's say Tree 1 predicts (Fail, Pass, Fail, Fail, Pass) for the five instances it
was trained on. Similarly, we get predictions from Tree 2 and Tree 3.
3. Majority Voting:
Combining the predictions of all trees, we take the majority vote. For example, if two
trees predict "Pass" and one predicts "Fail," the final prediction is "Pass."
4. Final Prediction:
Based on the majority vote, we determine the final prediction for each instance

Explain boosting and ADA boosting algorithm with neat sketch


Boosting:
Definition: Boosting is an ensemble learning technique that combines the predictions of multiple
weak learners (simple models) to create a strong learner with improved predictive performance.
The key idea behind boosting is to sequentially train weak models, giving more weight to the
instances that were misclassified by the previous models. This adaptive boosting process aims to
correct errors and improve overall accuracy.
AdaBoost (Adaptive Boosting):
Training
AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a
classifier of the form
T
F T (x) = ∑ f t (x)
t=1

where each is a weak learner that takes an object as input and returns a value indicating the
ft x

class of the object. For example, in the two-class problem, the sign of the weak learner's output
identifies the predicted object class and the absolute value gives the confidence in that
classification. Similarly, the -th classifier is positive if the sample is in a positive class and
t

negative otherwise.
Each weak learner produces an output hypothesis which fixes a prediction
h for each sample
h(x i )

in the training set. At each iteration , a weak learner is selected and assigned a coefficient
t αt

such that the total training error of the resulting -stage boosted classifier is minimized.
Et t

E t = ∑ E[F t−1 (x i ) + α t h(x i )]


i

Here F t−1 (x) is the boosted classifier that has been built up to the previous stage of training and
f t (x) = α t h(x) is the weak learner that is being considered for addition to the final classifier.
Weighting
At each iteration of the training process, a weight is assigned to each sample in the training
w i,t

set equal to the current error on that sample. These weights can be used in the
E(F t−1 (x i ))

training of the weak learner. For instance, decision trees can be grown which favor the splitting of
sets of samples with large weights.

Key Differences from normal Boosting


1. Weighted Training:
In AdaBoost, each training example is assigned an initial weight, and these weights are
adjusted in each iteration based on the performance of the weak learner. This weighting
helps the algorithm focus on examples that are difficult to classify.
2. Sequential Training:
AdaBoost trains weak learners sequentially. In each iteration, a new weak learner is
trained to correct the mistakes of the combined model from the previous iterations.
3. Weighted Voting:
The final model in AdaBoost is a weighted combination of weak learners. The weight of
each weak learner depends on its accuracy, with higher accuracy leading to a higher
weight in the final combination.
4. Adaptive Nature:
AdaBoost adaptively adjusts the importance of different training examples over
iterations, giving more emphasis to misclassified examples. This adaptability
contributes to the overall robustness of the algorithm.

You might also like