Machine Learning
Machine Learning
Machine Learning
YEAR OF
CATEGORY L T P CREDIT INTRODUCTION
CST413 MACHINE LEARNING
PEC 2 1 0 3 2019
Preamble: This course enables the learners to understand the advanced concepts and algorithms
in machine learning. The course covers the standard and most popular supervised learning
algorithms such as linear regression, logistic regression, decision trees, Bayesian learning and the
Naive Bayes algorithm, basic clustering algorithms and classifier performance measures. This
course helps the students to provide machine learning based solutions to real world problems.
Prerequisite: Basic understanding of probability theory, linear algebra and Python Programming
Course Outcomes: After the completion of the course the student will be able to
CO1 Illustrate Machine Learning concepts and basic parameter estimation methods.
(Cognitive Knowledge Level: Apply)
CO2 Demonstrate supervised learning concepts (regression, linear classification).
(Cognitive Knowledge Level: Apply)
CO3 Illustrate the concepts of Multilayer neural network and Support Vector Machine
(Cognitive Knowledge Level: Apply)
CO4 Describe unsupervised learning concepts and dimensionality reduction techniques.
(Cognitive Knowledge Level: Apply)
CO5 Solve real life problems using appropriate machine learning models and evaluate the
performance measures (Cognitive Knowledge Level: Apply)
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1
CO2
CO3
CO4
CO5
Conduct investigations of
PO4 complex problems PO10 Communication
Assessment Pattern
Remember 30 30 30
Understand 30 30 30
Apply 40 40 40
Analyze
Evaluate
Create
Mark Distribution
150 50 100 3
Syllabus
Module-1 (Overview of machine learning)
Machine learning paradigms-supervised, semi-supervised, unsupervised, reinforcement learning.
Basics of parameter estimation - maximum likelihood estimation(MLE) and maximum a posteriori
estimation(MAP). Introduction to Bayesian formulation.
Module-2 (Supervised Learning)
Regression - Linear regression with one variable, Linear regression with multiple variables,
solution using gradient descent algorithm and matrix method, basic idea of overfitting in
regression. Linear Methods for Classification- Logistic regression, Naive Bayes, Decision tree
algorithm ID3.
Module-3 (Neural Networks (NN) and Support Vector Machines (SVM))
Perceptron, Neural Network - Multilayer feed forward network, Activation functions (Sigmoid,
ReLU, Tanh), Backpropagation algorithm.
Reference Books
1. Christopher Bishop. Neural Networks for Pattern Recognition,Oxford University Press,
1995.
2. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective, MIT Press 2012.
3. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements Of Statistical Learning,
Second edition Springer 2007.
4. P. Langley, Elements of Machine Learning, Morgan Kaufmann, 1995.
5. Richert and Coelho, Building Machine Learning Systems with Python.
6. Davy Cielen, Arno DB Meysman and Mohamed Ali.Introducing Data Science: Big Data,
Machine Learning, and More, Using Python Tools, Dreamtech Press 2016.
2. Suppose you have a three class problem where class label y ∈ 0, 1, 2 and each
training example X has 3 binary attributes X1, X2, X3 ∈ 0, 1. How many parameters
3. Describe how Support Vector Machines can be extended to make use of kernels.
Illustrate with reference to the Gaussian kernel K(x, y) = e−z, where z = (x−y)2 .
4. Briefly explain one way in which using tanh instead of logistic activations makes
optimization easier.
5. ReLU activation functions are most used in neural networks instead of the tanh
activation function. Draw both activation functions and give a) an advantage of the
ReLU function compared to the tanh function. b) a disadvantage of the ReLU
function compared to the tanh function.
Assume that k = 3 and that initially the points are assigned to clusters as follows:
C1 = {x1, x2, x3}, C2 = {x4, x5, x6}, C3 = {x7, x8}. Apply the k-means algorithm until
convergence, using the Manhattan distance.
4. Cluster the following eight points representing locations into three clusters: A1(2, 10),
A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9).
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as D(a, b)
= |x2 – x1| + |y2 – y1|
Use k-Means Algorithm to find the three cluster centers after the second iteration.
QP CODE:
PART A
5. Suppose that you have a linear support vector machine(SVM) binary classifier.
Consider a point that is currently classified correctly, and is far away from the
decision boundary. If you remove the point from the training set, and re-train the
classifier, will the decision boundary change or stay the same? Justify your answer.
6. Mention the primary motivation for using the kernel trick in machine learning
algorithms?
9. Classifier A attains 100% accuracy on the training set and 70% accuracy on the
test set. Classifier B attains 70% accuracy on the training set and 75% accuracy on
the test set. Which one is a better classifier. Justify your answer.
10. How does bias and variance trade-off affect machine learning algorithms?
(10x3=30)
Part B
(Answer any one question from each module. Each question carries 14 Marks)
11. (a) Suppose that X is a discrete random variable with the following probability (7)
mass function: where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent
observations
(b) Suppose you have a three class problem where class label y ∈ 0, 1, 2 (7)
OR
12. (a) Consider the geometric distribution, which has p.m.f P(X = k) = (1 −θ)k−1θ. (7)
Assume that n i.i.d data are drawn from that distribution.
i. Write an expression for the log-likelihood of the data as a function of the
parameter θ.
ii. Find the maximum likelihood estimate for θ?
ii. Let θ has a beta prior distribution. What is the posterior distribution of θ?
(b) Find the maximum a posteriori (MAP) estimator for the mean of a univariate (7)
normal distribution. Assume that we have N samples, x1,..., xN independently
drawn from a normal distribution with known variance σ2 and unknown mean
µ and the prior distribution for the mean is itself a normal distribution with
mean ν and variance β2.
13. (a) Consider the hypothesis for the linear regression hθ(x) = θ0 + θ1x, and the cost (7)
function J( θ0, θ1) = 1/2m Σ1m( hθ(x(i)) – y(i))2where m is the number of training
examples. Given the following set of training examples.
(c) Explain the significance of regularization. How do Ridge differs from Lasso (4)
regularization?
OR
14. (a) The following dataset can be used to train a classifier that determines whether (7)
a given person is likely to own a car or not. There are three features: education
level (primary, secondary, or university); residence (city or country); gender
(female, male).
Use ID3 Algorithm and find the best attribute at the root level of the tree
(b) Consider a linear regression problem y = w1x + w0, with a training set having (7)
m examples (x1, y1), . . .(xm, ym). Suppose that we wish to minimize the mean
5th degree error (loss function) given by 1/m Σ1m(yi −w1xi − w0)5.
1. Calculate the gradient with respect to the parameter w1.
2. Write down pseudo-code for on-line gradient descent on w1.
3. Give one reason in favor of on-line gradient descent compared to batch-
gradient descent, and one reason in favor of batch over on-line.
15. (a) Consider a support vector machine whose input space is 2-D, and the inner (8)
products are computed by means of the kernel K(x, y) = (x.y + 1)2-1, where
x.y denotes the ordinary inner product. Show that the mapping to feature
space that is implicitly defined by this kernel is the mapping to 5-D given by
(b) Consider a neuron with four inputs, and weight of edge connecting the inputs (3)
are 1, 2, 3 and 4. Let the bias of the node is zero and inputs are 2, 3, 1, 4. If
the activation function is linear f(x)=2x, compute the output of the neuron.
OR
16. (a) State the mathematical formulation to express Soft Margin as a constraint (10)
optimization problem.
17. (a) Suppose that we have the following data (one variable). Use single linkage (8)
Agglomerative clustering to identify the clusters.
Data: (2, 5, 9, 15, 16, 18, 25, 33, 33, 45).
(b) Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (6)
(i) Compute the Euclidean distance between the two objects.
(ii) Compute the Manhattan distance between the two objects.
(iii) Compute the Minkowski distance between the two objects, using p = 3
OR
19. (a) Suppose the dataset had 9700 cancer-free images from 10000 images from (7)
cancer patients. Find precision, recall and accuracy ? Is it a good classifier?
Justify.
Actual cancer = yes cancer = no Total
Class\Predicted
class
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
(b) What is Principal Component Analysis (PCA)? Which eigen value indicates (7)
the direction of largest variance?
OR
20. (a) Assume you have a model with a high bias and a low variance. What (6)
are the characteristics of such a model?
(b) What are ROC space and ROC curve in machine learning? In ROC space, (8 )
which points correspond to perfect prediction, always positive prediction and
always negative prediction? Why?
Teaching Plan
No. of
Lecture
No Contents Hours
(37 hrs)
Module -1 (Overview of machine learning) (7 hours)
1.1 Supervised, semi-supervised, unsupervised learning, reinforcement learning 1 hour
(Text Book (TB) 1: Chapter 1)
1.2 Maximum likelihood estimation(MLE) (TB 1: Section 4.2) 1 hour
1.3 Maximum likelihood estimation (MLE)- example (TB 1: Section 4.2) 1 hour
1.4 Maximum a posteriori estimation(MAP) (TB 4: Section 6.2) 1 hour
1.5 Maximum a posteriori estimation(MAP)-example (TB 4: Section 6.2) 1 hour
1.6 Bayesian formulation (TB 1: Section 14.1, 14.2) 1 hour
1.7 Bayesian formulation -example (TB 1: Section 14.1, 14.2) 1 hour
Module-2 (Supervised Learning) (7 hours)
2.1 Linear regression with one variable (TB 1: Section 2.6) 1 hour
2.2 Multiple variables, Solution using gradient descent algorithm and matrix 1 hour
method (No derivation required) (TB 1: Section 5.8)
2.3 Overfitting in regression, Lasso and Ridge regularization 1 hour
2.4 Logistic regression 1 hour
2.5 Naive Bayes (TB 2: Section 18.2) 1 hour
2.6 Decision trees (TB 2: Chapter 19) 1 hour
2.7 Decision trees- ID3 algorithm (TB 2: Chapter 19) 1 hour
Module-3 (Neural Networks and Support Vector Machines) (9 hours)
3.1 Perceptron, Perceptron Learning 1 hour
3.2 Multilayer Feed forward Network, Activation Functions (Sigmoid, ReLU, 1 hour
Tanh)
3.3 Back Propagation Algorithm 1 hour
3.4 Illustrative Example for Back Propagation 1 hour
3.5 Introduction, Maximum Margin Hyperplane, 1 hour
3.6 Mathematics behind Maximum Margin Classification 1 hour
3.7 Formulation of maximum margin hyperplane and solution 1 hour
3.9 Non-linear SVM , Kernels for learning non-linear functions, Examples - 1 hour
Linear, RBF, Polynomial
Module-4 (Unsupervised Learning) (7 hours)
5.6 Face detection (TB 3: Chapter 5 Section Application: A Face Detection 1 hour
Pipeline)
5.7 Face detection (TB 3: Chapter 5 Section Application: A Face Detection 1 hour
Pipeline)