ML Module V
ML Module V
ML Module V
MODULE-V
1.Explain about Reinforcement Learning
A.Reinforcement learning is an area of Machine Learning. It is about taking suitable action to
maximize reward in a particular situation. It is employed by various software and machines to
find the best possible behavior or path it should take in a specific situation. Reinforcement
learning differs from supervised learning in a way that in supervised learning the training data
has the answer key with it so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent decides what to do to
perform the given task. In the absence of a training dataset, it is bound to learn from its
experience.
Reinforcement Learning (RL) is the science of decision making. It is about learning the
optimal behavior in an environment to obtain maximum reward. In RL, the data is accumulated
from machine learning systems that use a trial-and-error method. Data is not part of the input
that we would find in supervised or unsupervised machine learning.
Reinforcement learning uses algorithms that learn from outcomes and decide which action to
take next. After each action, the algorithm receives feedback that helps it determine whether
the choice it made was correct, neutral or incorrect. It is a good technique to use for automated
systems that have to make a lot of small decisions without human guidance.
Reinforcement learning is an autonomous, self-teaching system that essentially learns by trial
and error. It performs actions with the aim of maximizing rewards, or in other words, it is
learning by doing in order to achieve the best outcomes
Main points in Reinforcement learning –
Input: The input should be an initial state from which the model will start
Output: There are many possible outputs as there are a variety of solutions to a
particular problem
Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
Types of Reinforcement:
Robustness- Gaussian Mixture Models are relatively robust to the outliers which are
present in the data, as they can accommodate the presence of multiple modes called
“peaks” in the distribution.
Speed- Gaussian Mixture Models are relatively fast to fit a dataset, especially when
using an efficient optimization algorithm such as the expectation-maximization (EM)
algorithm.
Disadvantages of Gaussian Mixture Models
There are a few drawbacks to using Gaussian Mixture Models which are stated below:
Sensitivity To Initialization- Gaussian Mixture Models can be sensitive to the initial
values of the model parameters, especially when there are too many components in the
mixture. This can sometimes lead to poor convergence to the true maximum likelihood
solution.
Assumption Of Normality- Gaussian Mixture Models assume that the data are
generated from a mixture of normal distributions, which may not always be the case in
practice. If the data deviate significantly from normality, GMMs may not be the most
appropriate model.
Number Of Components- Choosing the appropriate number of components in a
Gaussian Mixture Model can be challenging, as adding too many components may
overfit the data, while using too few components may underfit the data. The extremes
of both points result in a challenging task, which becomes tough to be handled.
In the M step, the algorithm determines the parameters that maximize the expected
log-likelihood obtained in the E step, and corresponding model parameters are
updated based on the estimated latent variables.
By iteratively repeating these steps, the EM algorithm seeks to maximize the likelihood of the
observed data. It is commonly used for unsupervised learning tasks, such as clustering, where
latent variables are inferred and has applications in various fields, including machine learning,
computer vision, and natural language processing.
Key Terms in Expectation-Maximization (EM) Algorithm
Some of the most commonly used key terms in the Expectation-Maximization (EM) Algorithm
are as follows:
Latent Variables: Latent variables are unobserved variables in statistical models
that can only be inferred indirectly through their effects on observable variables.
They cannot be directly measured but can be detected by their impact on the
observable variables.
Likelihood: It is the probability of observing the given data given the parameters of
the model. In the EM algorithm, the goal is to find the parameters that maximize the
likelihood.
Log-Likelihood: It is the logarithm of the likelihood function, which measures the
goodness of fit between the observed data and the model. EM algorithm seeks to
maximize the log-likelihood.
1. Initialization:
Initially, a set of initial values of the parameters are considered. A set of
incomplete observed data is given to the system with the assumption that
the observed data comes from a specific model.
2. E-Step (Expectation Step): In this step, we use the observed data in order to
estimate or guess the values of the missing or incomplete data. It is basically used to
update the variables.
Compute the posterior probability or responsibility of each latent
variable given the observed data and current parameter estimates.
Estimate the missing or incomplete data values using the current
parameter estimates.
Compute the log-likelihood of the observed data based on the current
parameter estimates and estimated missing data.
3. M-step (Maximization Step): In this step, we use the complete data generated in
the preceding “Expectation” – step in order to update the values of the parameters.
It is basically used to update the hypothesis.
Update the parameters of the model by maximizing the expected
complete data log-likelihood obtained from the E-step.
This typically involves solving optimization problems to find the
parameter values that maximize the log-likelihood.
The specific optimization technique used depends on the nature of the
problem and the model being used.
4. Convergence: In this step, it is checked whether the values are converging or not, if
yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Check for convergence by comparing the change in log-likelihood or the
parameter values between iterations.
If the change is below a predefined threshold, stop and consider the
algorithm converged.
Otherwise, go back to the E-step and repeat the process until
convergence is achieved.
Applications of the EM algorithm
It can be used to fill in the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of the Hidden Markov
Model (HMM).
It can be used for discovering the values of latent variables.
Advantages of EM algorithm
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of
implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm
It has slow convergence.
Soft clustering, often associated with Gaussian Mixture Models (GMMs), is a probabilistic
approach to clustering where data points are not assigned exclusively to a single cluster, but
rather, each point has a probability distribution over all clusters. GMMs are a type of
probabilistic model that assumes the data is generated by a mixture of several Gaussian
distributions.
Here's a breakdown of how soft clustering works in the context of GMMs:
1. Gaussian Mixture Model (GMM): A GMM represents the data as a combination of
multiple Gaussian distributions. Each Gaussian distribution is associated with a cluster,
and the overall data distribution is a weighted sum of these Gaussian components.
2. Probability Distributions: Instead of assigning each data point to a single cluster,
GMMs provide a probability distribution for each point across all clusters. The
probability that a data point belongs to a particular cluster is given by the weight of the
corresponding Gaussian component.
3. Parameters of the Model:
Means (μ): Represent the center of each Gaussian distribution.
Covariances (Σ): Indicate the shape and orientation of each Gaussian.
Weights (π): Reflect the importance of each Gaussian component in the overall
mixture.
4. Expectation-Maximization (EM) Algorithm:
E-step (Expectation): Calculate the probability that each data point belongs to
each cluster. This step involves computing the responsibility of each cluster for
each data point using Bayes' theorem.
M-step (Maximization): Update the parameters (means, covariances, and
weights) of the Gaussian components based on the responsibilities computed in
the E-step.
5. Soft Assignment: After running the EM algorithm, you obtain a set of probabilities for
each data point across all clusters. These probabilities reflect the likelihood that a point
belongs to each cluster.
The soft clustering property of GMMs is beneficial when the boundaries between clusters are not
well-defined. It allows for a more nuanced representation of the data, capturing the uncertainty or
overlap between clusters. Soft clustering is particularly useful when dealing with complex data
distributions where instances may exhibit characteristics of multiple clusters simultaneously.
The temporal difference algorithm always aims to bring the expected prediction and the new
prediction together, thus matching expectations with reality and gradually increasing the
accuracy of the entire chain of prediction.
Temporal Difference Learning aims to predict a combination of the immediate reward and
its own reward prediction at the next moment in time.
In TD Learning, the training signal for a prediction is a future prediction. This method is a
combination of the Monte Carlo (MC) method and the Dynamic Programming (DP) method.
Monte Carlo methods adjust their estimates only after the final outcome is known, but
temporal difference methods tend to adjust predictions to match later, more accurate,
predictions for the future, much before the final outcome is clear and know. This is
essentially a type of bootstrapping.
Temporal difference learning in machine learning got its name from the way it uses changes,
or differences, in predictions over successive time steps for the purpose of driving the
learning process.
The prediction at any particular time step gets updated to bring it nearer to the prediction of
the same quantity at the next time step.
parameters temporal difference learning?
2. Overlapping Subproblems:
The problem can be broken down into smaller, overlapping subproblems. Solving
these subproblems independently and combining their solutions leads to an
optimal solution for the overall problem.
3. Memoization:
To avoid redundant computations, dynamic programming often involves
memoization, which is a technique of storing and reusing solutions to
subproblems in a table or cache.
4. Bottom-Up or Top-Down Approach:
Dynamic programming solutions can be implemented using either a bottom-up
approach, starting from the smallest subproblems and building up to the overall
problem, or a top-down approach, solving the problem by recursively breaking it
into smaller subproblems.
Applications of Dynamic Programming:
1. Fibonacci Sequence:
One of the classic examples of dynamic programming is computing Fibonacci
numbers efficiently using memoization to avoid redundant calculations.
2. Shortest Path Problems:
Algorithms like Dijkstra's and Floyd-Warshall for finding the shortest paths in
graphs use dynamic programming principles. These algorithms avoid
recomputing paths by storing solutions to subproblems.
3. Longest Common Subsequence (LCS):
Dynamic programming is employed to find the longest common subsequence
between two sequences. This has applications in bioinformatics, text comparison,
and version control systems.
4. Knapsack Problem:
The 0/1 Knapsack problem, where items have weights and values, and the goal is
to maximize the total value without exceeding a given weight limit, can be
efficiently solved using dynamic programming.
5. Matrix Chain Multiplication:
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves
starting the identification of a local model from scratch.
Overfitting: Since instance-based learning essentially memorizes the training data, it can
be prone to overfitting. If the training set contains noise or outliers, the model may
perform poorly on new, unseen data.
Lack of Generalization: Instance-based learning may struggle to generalize well to
instances outside the training set. It tends to be more focused on individual instances,
which can limit its ability to capture broader patterns in the data.
8(a).
8(b).
Given 100 hypothesis functions, each trained with 106 samples, what is the lower bound on the
probability that there does not exist a hypothesis function with error greater than 0.1?