Assignment4 - AnswerKey
Assignment4 - AnswerKey
Assignment4 - AnswerKey
Assignment4_AnswerKey
What is Regularization Technique in Machine Learning? Discuss different
types of regularization techniques in detail. ✅
Regularization in Machine Learning:
Definition:
Regularization is a technique used in machine learning to prevent overfitting and improve the
generalization of models. Overfitting occurs when a model fits the training data too closely,
capturing noise and irrelevant patterns, leading to poor performance on new, unseen data.
Regularization introduces a penalty term to the model's objective function, discouraging the use
of overly complex models.
Objective Function with Regularization Term:
The regularized objective function is a combination of the original loss function (L(\theta))
J (θ)
Here, is the regularization parameter, controlling the strength of regularization. The two most
α
common types of regularization are L1 regularization (Lasso) and L2 regularization (Ridge), but
there are other variations as well.
Types of Regularization Techniques:
1. L1 Regularization (Lasso):
Regularization Term: (R(θ) = ∥θ∥ 1 = ∑
n
j=1
|θ j |)
Effect: Encourages sparsity in the model by penalizing some weights to exactly zero.
Use Case: Feature selection, when there is a belief that some features are irrelevant.
2. L2 Regularization (Ridge):
Regularization Term: (R(θ) = ∥θ∥
2
2
= ∑
n
j=1
2
θ )
j
Effect: Smoothens the weights, preventing them from taking extreme values.
Use Case: Generally used to prevent multicollinearity.
3. Elastic Net:
Regularization Term: (R(θ) = α ⋅ ∥θ∥ 1 + (1 − α) ⋅ ∥θ∥ )
2
2
k-means clustering
The k-means clustering algorithm is one of the simplest unsupervised learning algorithms for
solving the clustering problem.
1. For a given dataset, let's have k clusters
2. k points are chosen as the "centers" for the clusters, one for each cluster.
3. We then associate each data point with the nearest center.
4. We then take averages of the data points associated with a center and replace the center
with the average (for each center point)
5. We repeat the process until the centers converge to a fixed number of points.
6. The datasets nearest to the centers form the various clusters in the dataset.
The end objective is to partition the set X into k mutually disjoint-subsets
S = {S 1 , S 2 , . . . . S k }
and a set of data points V which minimizes the following within-cluster sum
k
→
∑ ∑ ||x − v i || → 2
i=1 →
x∈S i
Hierarchical Clustering
A method of cluster analysis which seeks to build a hierarchy of clusters in a given dataset.
Produces clusters in which clusters at each level are created by merging clusters at the next
lower level.
At the lowest level, each cluster contains a single observation.
At the highest level , there is only one cluster containing all of the data
Whether two clusters should be merged or not is a decision taken on the basis of the
measure of dissimilarity between the clusters.
Density based Clustering
Clusters are defined as areas of higher density than the remainder of the data set. -
Objects in the sparse areas (that are used to separate clusters) are usually considered to be
noise and border points.
DBSCAN (most popular density based clustering method) Density based Spatial clustering of
applications with Noise.
For test, not for assignment
and a set of data points V which minimizes the following within-cluster sum
What is ensemble learning. Discuss different types of ensemble learning
techniques ✅
Ensemble Learning
A technique that creates multiple models and then combine them to produce improved
results. Usually produces more accurate solutions than a single more would.
For each learner, for multi-class classification, there will be K outputs for each learner.
These values are combined, and then we choose the class with the maximum value. yi
Voting
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners
This is known as ensembles and linear opinion pools.
Simplest case, all learners have equal weightage and we have simple voting that corresponds
to taking an average.
There are several other combination rules as well (max, min, product, median, average)
if the outputs are not posterior probabilities, then the outputs need to be normalized to the
same scale
Error-Correcting Output Codes
A technique that allows a multi-class classification problem to be reframed as multiple binary
classification problems, allowing the use of native binary classification models directly.
These are different from one-vs-rest and one-vs-one methods.
Here each class can be encoded as a number of binary classification methods.
1. Binary Decomposition:
In ECOC, each class is represented by a unique binary code (a set of bits). The number of
bits in the code is determined by the total number of classes in the problem.
For example, if there are K classes, each class is assigned a unique binary code with log2
(K) bits.
2. Training Binary Classifiers:
For each bit in the binary code, a binary classifier is trained to distinguish between the
classes that have a 1 in that bit and those that have a 0.
This results in log2(K) binary classifiers, each solving a binary classification problem.
3. Decoding Predictions:
During prediction, each binary classifier produces a binary decision for its assigned bit.
The final predicted class is determined by decoding the binary outputs from all
classifiers, and the class with the closest binary code to the decoded output is chosen.
4. Error Correction:
The binary codes are designed such that the Hamming distance between any two class
codes is large. Hamming distance is the number of bits in which two binary strings differ.
This design allows for error correction. Even if some binary classifiers make incorrect
predictions, the correct class can still be identified based on the overall closest match.
Bagging
Bootstrap Aggregating: Involves having each model in the ensemble vote with equal weights
However, to promote model variance, bagging trains each model on a randomly drawn subset of
the training set.
Random Forest Algorithm: Combines random decision trees with bagging to achieve very high
classification accuracy
The benefit is that we get lots of learner that just perform slightly differently
Accuracy of the classification can be made without complicating the analysis work.
What is Boosting? Discuss with neat relevant example? ✅
Boosting in Machine Learning:
Definition:
Boosting is an ensemble learning technique that combines the predictions of multiple weak
learners (typically simple models) to create a strong learner with improved predictive
performance. The key idea behind boosting is to sequentially train weak models, giving more
weight to the instances that were misclassified by the previous models. This adaptive boosting
process aims to correct errors and improve overall accuracy.
Steps in Boosting:
1. Initialization:
Assign equal weights to all training instances.
2. Model Training:
Train a weak learner on the dataset with the current instance weights.
3. Compute Error:
Calculate the error of the weak learner on the training set.
4. Adjust Weights:
Increase the weights of misclassified instances, making them more influential in the next
iteration.
5. Repeat:
Repeat steps 2-4 for a predefined number of iterations or until a specified level of
accuracy is achieved.
6. Combine Models:
Combine the predictions of all weak learners, typically using a weighted sum.
Example of Boosting: AdaBoost (Adaptive Boosting):
Let's consider AdaBoost, one of the most popular boosting algorithms, with a simple example:
Problem
Classify whether emails are spam or not based on two features (e.g., number of words and
presence of specific keywords).
Steps:
1. Iteration 1:
Train a weak learner (e.g., a decision stump) on the dataset with equal weights.
Misclassified instances are given higher weights.
2. Iteration 2:
Train another weak learner, giving more weight to the previously misclassified instances.
Adjust weights based on errors.
3. Iteration 3:
Repeat the process for multiple iterations, each time focusing more on instances that
were misclassified in the previous iterations.
4. Final Model:
Combine the predictions of all weak learners with weighted voting.
Benefits of Boosting:
Improved Accuracy: Boosting often results in higher accuracy compared to individual weak
learners.
Robustness: Boosting can handle noise and outliers effectively.
Adaptability: The algorithm adapts to misclassifications, focusing on instances that are
challenging to classify.
Considerations:
Overfitting: Boosting can be prone to overfitting, especially if the number of iterations is too
high.
Sensitive to Noisy Data: Boosting may not perform well with noisy data.
Boosting, with algorithms like AdaBoost, is widely used in various domains, including image
recognition, text classification, and bioinformatics, where accurate predictions are crucial, and
the data may be complex and noisy.
Explain the concept of Bagging with its uses? ✅
Bootstrap Aggregating: Involves having each model in the ensemble vote with equal weights
However, to promote model variance, bagging trains each model on a randomly drawn subset of
the training set.
The benefit is that we get lots of learner that just perform slightly differently
Accuracy of the classification can be made without complicating the analysis work.
Uses
1. Random Forest Algorithm:
Random Forest is a popular algorithm that combines bagging with random decision
trees to achieve high classification accuracy.
Benefit: By introducing randomness in the construction of each decision tree, Random
Forest creates diverse learners, improving the ensemble's overall performance.
2. Diversity for Improved Generalization:
Bagging promotes diversity among base models by training them on randomly drawn
subsets of the training set.
Advantage: The diverse set of learners captures different aspects of the underlying
patterns in the data, enhancing the ensemble's generalization to new, unseen data.
3. Simplicity and Accuracy:
Bagging provides a simple yet effective way to improve the accuracy of classification
models.
Advantage: The ensemble approach aggregates predictions from multiple models,
leading to a more robust and accurate overall prediction without complicating the
analysis.
What is Random Forest? Explain with example? ✅
A random forest is an ensemble learning method where multiple decision trees are constructed
and then they are merged to get more accurate predictions
1. The random forest algorithm generates many classification trees. Each tree is generated as
follows:
1. If the number of examples in the training set is N, take a sample of N examples at
random, but with replacement from the original data.
2. The sample will be the set for generating the tree
3. If there are M input variables, a number m is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to
split the node.
1. The value of m is held constant during the generation of the various trees in the
forest
4. Each tree is grown to the largest extent possible
2. To classify a new object from an input vector, put the input vector down each of the trees in
the forest. Each tree gives a classification, and the trees vote for the class.
3. The forest then chooses the classification
Example
Dataset:
Suppose we have the following training dataset:
Hours of Study Previous Exam Score Outcome
2 50 Fail
Hours of Study Previous Exam Score Outcome
5 80 Pass
1 30 Fail
6 90 Pass
3 60 Fail
8 85 Pass
7 75 Pass
4 45 Fail
Steps to Demonstrate Random Forest:
1. Create Decision Trees:
Random Forest builds multiple decision trees. Each tree is trained on a random subset of
the data with replacement (bootstrapping). For simplicity, let's say we build three
decision trees:
Tree 1: Trained on a subset of data: (2, 50), (6, 90), (4, 45), (3, 60), (5, 80)
Tree 2: Trained on a subset of data: (7, 75), (5, 80), (3, 60), (1, 30), (8, 85)
Tree 3: Trained on a subset of data: (2, 50), (1, 30), (7, 75), (4, 45), (6, 90)
2. Make Predictions:
Each tree makes a prediction for each input. For classification, the majority vote is taken.
For regression, the average is calculated.
For simplicity, let's say Tree 1 predicts (Fail, Pass, Fail, Fail, Pass) for the five instances it
was trained on. Similarly, we get predictions from Tree 2 and Tree 3.
3. Majority Voting:
Combining the predictions of all trees, we take the majority vote. For example, if two
trees predict "Pass" and one predicts "Fail," the final prediction is "Pass."
4. Final Prediction:
Based on the majority vote, we determine the final prediction for each instance
where each is a weak learner that takes an object as input and returns a value indicating the
ft x
class of the object. For example, in the two-class problem, the sign of the weak learner's output
identifies the predicted object class and the absolute value gives the confidence in that
classification. Similarly, the -th classifier is positive if the sample is in a positive class and
t
negative otherwise.
Each weak learner produces an output hypothesis which fixes a prediction
h for each sample
h(x i )
in the training set. At each iteration , a weak learner is selected and assigned a coefficient
t αt
such that the total training error of the resulting -stage boosted classifier is minimized.
Et t
Here F t−1 (x) is the boosted classifier that has been built up to the previous stage of training and
f t (x) = α t h(x) is the weak learner that is being considered for addition to the final classifier.
Weighting
At each iteration of the training process, a weight is assigned to each sample in the training
w i,t
set equal to the current error on that sample. These weights can be used in the
E(F t−1 (x i ))
training of the weak learner. For instance, decision trees can be grown which favor the splitting of
sets of samples with large weights.