Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

COMPUTER SCIENCE AND ENGINEERING

YEAR OF
CATEGORY L T P CREDIT INTRODUCTION
CST413 MACHINE LEARNING
PEC 2 1 0 3 2019

Preamble: This course enables the learners to understand the advanced concepts and algorithms
in machine learning. The course covers the standard and most popular supervised learning
algorithms such as linear regression, logistic regression, decision trees, Bayesian learning and the
Naive Bayes algorithm, basic clustering algorithms and classifier performance measures. This
course helps the students to provide machine learning based solutions to real world problems.
Prerequisite: Basic understanding of probability theory, linear algebra and Python Programming

Course Outcomes: After the completion of the course the student will be able to

CO1 Illustrate Machine Learning concepts and basic parameter estimation methods.
(Cognitive Knowledge Level: Apply)
CO2 Demonstrate supervised learning concepts (regression, linear classification).
(Cognitive Knowledge Level: Apply)
CO3 Illustrate the concepts of Multilayer neural network and Support Vector Machine
(Cognitive Knowledge Level: Apply)
CO4 Describe unsupervised learning concepts and dimensionality reduction techniques.
(Cognitive Knowledge Level: Apply)
CO5 Solve real life problems using appropriate machine learning models and evaluate the
performance measures (Cognitive Knowledge Level: Apply)

Mapping of course outcomes with program outcomes

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1

CO2

CO3

CO4

CO5

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

Abstract POs defined by National Board of Accreditation

PO# Broad PO PO# Broad PO

PO1 Engineering Knowledge PO7 Environment and Sustainability

PO2 Problem Analysis PO8 Ethics

PO3 Design/Development of solutions PO9 Individual and team work

Conduct investigations of
PO4 complex problems PO10 Communication

PO5 Modern tool usage PO11 Project Management and Finance

PO6 The Engineer and Society PO12 Life long learning

Assessment Pattern

Continuous Assessment Tests


Bloom’s End Semester Examination
Category Marks (%)
Test 1 (%) Test 2 (%)

Remember 30 30 30

Understand 30 30 30

Apply 40 40 40

Analyze

Evaluate
Create

Mark Distribution

Total Marks CIE Marks ESE Marks ESE Duration

150 50 100 3

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

Continuous Internal Evaluation Pattern:


Attendance 10 marks
Continuous Assessment Tests(Average of Internal Tests 1 & 2) 25 marks

Continuous Assessment Assignment 15 marks


Internal Examination Pattern
Each of the two internal examinations has to be conducted out of 50 marks. First series test shall
be preferably conducted after completing the first half of the syllabus and the second series test
shall be preferably conducted after completing remaining part of the syllabus. There will be two
parts: Part A and Part B. Part A contains 5 questions (preferably, 2 questions each from the
completed modules and 1 question from the partly completed module), having 3 marks for each
question adding up to 15 marks for part A. Students should answer all questions from Part A. Part
B contains 7 questions (preferably, 3 questions each from the completed modules and 1 question
from the partly completed module), each with 7 marks. Out of the 7 questions, a student should
answer any 5.
End Semester Examination Pattern:
There will be two parts; Part A and Part B. Part A contains 10 questions with 2 questions from
each module, having 3 marks for each question. Students should answer all questions. Part B
contains 2 full questions from each module of which student should answer any one. Each
question can have maximum 2 sub-divisions and carries 14 marks.

Syllabus
Module-1 (Overview of machine learning)
Machine learning paradigms-supervised, semi-supervised, unsupervised, reinforcement learning.
Basics of parameter estimation - maximum likelihood estimation(MLE) and maximum a posteriori
estimation(MAP). Introduction to Bayesian formulation.
Module-2 (Supervised Learning)
Regression - Linear regression with one variable, Linear regression with multiple variables,
solution using gradient descent algorithm and matrix method, basic idea of overfitting in
regression. Linear Methods for Classification- Logistic regression, Naive Bayes, Decision tree
algorithm ID3.
Module-3 (Neural Networks (NN) and Support Vector Machines (SVM))
Perceptron, Neural Network - Multilayer feed forward network, Activation functions (Sigmoid,
ReLU, Tanh), Backpropagation algorithm.

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

SVM - Introduction, Maximum Margin Classification, Mathematics behind Maximum Margin


Classification, Maximum Margin linear separators, soft margin SVM classifier, non-linear SVM,
Kernels for learning non-linear functions, polynomial kernel, Radial Basis Function(RBF).
Module-4 (Unsupervised Learning)
Clustering - Similarity measures, Hierarchical Agglomerative Clustering, K-means partitional
clustering, Expectation maximization (EM) for soft clustering. Dimensionality reduction –
Principal Component Analysis.
Module-5 (Classification Assessment)
Classification Performance measures - Precision, Recall, Accuracy, F-Measure, Receiver
Operating Characteristic Curve(ROC), Area Under Curve(AUC. Bootstrapping, Cross Validation,
Ensemble methods, Bias-Variance decomposition. Case Study: Develop a classifier for face
detection.
Text Book
1. Ethem Alpaydin, Introduction to Machine Learning, 2nd edition, MIT Press 2010.
2. Mohammed J. Zaki and Wagner Meira, Data Mining and Analysis: Fundamental Concepts
and Algorithms, Cambridge University Press, First South Asia edition, 2016.
3. Jake VanderPlas, Python Data Science Handbook, O'Reilly Media, 2016
4. Tom Mitchell, Machine Learning, McGraw-Hill, 1997.

Reference Books
1. Christopher Bishop. Neural Networks for Pattern Recognition,Oxford University Press,
1995.
2. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective, MIT Press 2012.
3. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements Of Statistical Learning,
Second edition Springer 2007.
4. P. Langley, Elements of Machine Learning, Morgan Kaufmann, 1995.
5. Richert and Coelho, Building Machine Learning Systems with Python.
6. Davy Cielen, Arno DB Meysman and Mohamed Ali.Introducing Data Science: Big Data,
Machine Learning, and More, Using Python Tools, Dreamtech Press 2016.

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

Course Level Assessment Questions


Course Outcome1 (CO1):
1. A coin is tossed 100 times and lands heads 62 times. What is the maximum
likelihood estimate for θ, the probability of heads.
2. Suppose data x1, ..., xn are independent and identically distributed drawn from an
exponential distribution exp(λ). Find the maximum likelihood for λ.
3. Suppose x1, ..., xn are independent and identically distributed(iid) samples from a
distribution with density

Find the maximum likelihood


estimate(MLE) for θ.
4. Find the maximum likelihood
estimator (MLE) and maximum a posteriori (MAP) estimator for the mean of a univariate
normal distribution. Assume that we have N samples, x1,..., xN independently drawn from a
normal distribution with known variance σ2 and unknown mean µ and the prior
distribution for the mean is itself a normal distribution with mean ν and variance β2. What
happens to the MLE and MAP estimators as the number of samples goes to infinity.

Course Outcome 2(CO2):


1. Suppose that you are asked to perform linear regression to learn the function that
outputs y, given the D-dimensional input x. You are given N independent data
points, and that all the D attributes are linearly independent. Assuming that D is
around 100, would you prefer the closed form solution or gradient descent to
estimate the regressor?

2. Suppose you have a three class problem where class label y ∈ 0, 1, 2 and each

training example X has 3 binary attributes X1, X2, X3 ∈ 0, 1. How many parameters

(probability distribution) do you need to know to classify an example using the


Naive Bayes classifier?

Course Outcome 3(CO3):


1. What are support vectors and list any three properties of the support vector
classifier solution?
2. Why do you use kernels to model a projection from attributes into a feature space,
instead of simply projecting the dataset directly?

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

3. Describe how Support Vector Machines can be extended to make use of kernels.
Illustrate with reference to the Gaussian kernel K(x, y) = e−z, where z = (x−y)2 .
4. Briefly explain one way in which using tanh instead of logistic activations makes
optimization easier.
5. ReLU activation functions are most used in neural networks instead of the tanh
activation function. Draw both activation functions and give a) an advantage of the
ReLU function compared to the tanh function. b) a disadvantage of the ReLU
function compared to the tanh function.

Course Outcome 4(CO4): .


1. Which similarity measure could be used to compare feature vectors of two images? Justify
your answer.
2. Illustrate the strength and weakness of k-means algorithm.
3. Suppose you want to cluster the eight points shown below using k-means

Assume that k = 3 and that initially the points are assigned to clusters as follows:
C1 = {x1, x2, x3}, C2 = {x4, x5, x6}, C3 = {x7, x8}. Apply the k-means algorithm until
convergence, using the Manhattan distance.
4. Cluster the following eight points representing locations into three clusters: A1(2, 10),
A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9).
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as D(a, b)
= |x2 – x1| + |y2 – y1|
Use k-Means Algorithm to find the three cluster centers after the second iteration.

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

Course Outcome 5(CO5):


1. What is ensemble learning? Can ensemble learning using linear classifiers learn
classification of linearly non-separable sets?
2. Describe boosting. What is the relation between boosting and ensemble learning?
3. Classifier A attains 100% accuracy on the training set and 70% accuracy on the test set.
Classifier B attains 70% accuracy on the training set and 75% accuracy on the test set.
Which one is a better classifier. Justify your answer.
4. What are ROC space and ROC curve in machine learning? In ROC space, which points
correspond to perfect prediction, always positive prediction and always negative
prediction? Why?
5. Suppose there are three classifiers A,B and C. The (FPR, TPR) measures of the three
classifiers are as follows – A (0, 1), B (1, 1) , C (1,0.5). Which can be considered as a
perfect classifier? Justify your answer.

Model Question Paper

QP CODE:

Reg No: _______________

Name: _________________ PAGES : 4

APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY

SEVENTH SEMESTER B.TECH DEGREE EXAMINATION, MONTH & YEAR

Course Code: CST413

Course Name: Machine Learning

Max. Marks : 100 Duration: 3 Hours

PART A

Answer All Questions. Each Question Carries 3 Marks

1. Define supervised learning? Name special cases of supervised learning depending


on whether the inputs/outputs are categorical, or continuous.

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

2. Differentiate between Maximum Likelihood estimation (MLE) and Maximum a


Posteriori (MAP) estimation?

3. What is overfitting and why is it a problem?

4. Specify the basic principle of gradient descent algorithm.

5. Suppose that you have a linear support vector machine(SVM) binary classifier.
Consider a point that is currently classified correctly, and is far away from the
decision boundary. If you remove the point from the training set, and re-train the
classifier, will the decision boundary change or stay the same? Justify your answer.

6. Mention the primary motivation for using the kernel trick in machine learning
algorithms?

7. Expectation maximization (EM) is designed to find a maximum likelihood setting


of the parameters of a model when some of the data is missing. Does the algorithm
converge? If so, do you obtain a locally or globally optimal set of parameters?

8. Illustrate the strength and weakness of k-means algorithm.

9. Classifier A attains 100% accuracy on the training set and 70% accuracy on the
test set. Classifier B attains 70% accuracy on the training set and 75% accuracy on
the test set. Which one is a better classifier. Justify your answer.

10. How does bias and variance trade-off affect machine learning algorithms?
(10x3=30)

Part B
(Answer any one question from each module. Each question carries 14 Marks)

11. (a) Suppose that X is a discrete random variable with the following probability (7)
mass function: where 0 ≤ θ ≤ 1 is a parameter. The following 10 independent
observations

were taken from such a distribution: (3, 0, 2, 1, 3, 2, 1, 0, 2, 1). What is the


maximum likelihood estimate of θ.

(b) Suppose you have a three class problem where class label y ∈ 0, 1, 2 (7)

and each training example X has 3 binary attributes X1, X2, X3 ∈ 0, 1.


How many parameters (probability distribution) do you need to know

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

to classify an example using the Naive Bayes classifier?

OR

12. (a) Consider the geometric distribution, which has p.m.f P(X = k) = (1 −θ)k−1θ. (7)
Assume that n i.i.d data are drawn from that distribution.
i. Write an expression for the log-likelihood of the data as a function of the
parameter θ.
ii. Find the maximum likelihood estimate for θ?
ii. Let θ has a beta prior distribution. What is the posterior distribution of θ?

(b) Find the maximum a posteriori (MAP) estimator for the mean of a univariate (7)
normal distribution. Assume that we have N samples, x1,..., xN independently
drawn from a normal distribution with known variance σ2 and unknown mean
µ and the prior distribution for the mean is itself a normal distribution with
mean ν and variance β2.

13. (a) Consider the hypothesis for the linear regression hθ(x) = θ0 + θ1x, and the cost (7)
function J( θ0, θ1) = 1/2m Σ1m( hθ(x(i)) – y(i))2where m is the number of training
examples. Given the following set of training examples.

Answer the following questions :


1) Find the value of hθ (2) if θ0= 0 and θ1 = 1.5
2) Find the value of J(0,1)
3) Suppose the value of J( θ0, θ1) = 0. What can be inferred from this.

(b) Assume we have a classification problem involving 3 classes: professors, (3)


students, and staff members. There are 750 students, 150 staff members
and 100 professors. All professors have blond hair, 50 staff members have
blond hair, and 250 students have blond hair. Compute the information
gain of the test ”hair color = blond” that returns true or false.

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

(c) Explain the significance of regularization. How do Ridge differs from Lasso (4)
regularization?

OR

14. (a) The following dataset can be used to train a classifier that determines whether (7)
a given person is likely to own a car or not. There are three features: education
level (primary, secondary, or university); residence (city or country); gender
(female, male).

Use ID3 Algorithm and find the best attribute at the root level of the tree

(b) Consider a linear regression problem y = w1x + w0, with a training set having (7)
m examples (x1, y1), . . .(xm, ym). Suppose that we wish to minimize the mean
5th degree error (loss function) given by 1/m Σ1m(yi −w1xi − w0)5.
1. Calculate the gradient with respect to the parameter w1.
2. Write down pseudo-code for on-line gradient descent on w1.
3. Give one reason in favor of on-line gradient descent compared to batch-
gradient descent, and one reason in favor of batch over on-line.

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

15. (a) Consider a support vector machine whose input space is 2-D, and the inner (8)
products are computed by means of the kernel K(x, y) = (x.y + 1)2-1, where
x.y denotes the ordinary inner product. Show that the mapping to feature
space that is implicitly defined by this kernel is the mapping to 5-D given by

(b) Consider a neuron with four inputs, and weight of edge connecting the inputs (3)
are 1, 2, 3 and 4. Let the bias of the node is zero and inputs are 2, 3, 1, 4. If
the activation function is linear f(x)=2x, compute the output of the neuron.

(c) Compare ReLU with Sigmoid function (3)

OR

16. (a) State the mathematical formulation to express Soft Margin as a constraint (10)
optimization problem.

(b) What is the basic idea of back propagation algorithm (4)

17. (a) Suppose that we have the following data (one variable). Use single linkage (8)
Agglomerative clustering to identify the clusters.
Data: (2, 5, 9, 15, 16, 18, 25, 33, 33, 45).

(b) Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (6)
(i) Compute the Euclidean distance between the two objects.
(ii) Compute the Manhattan distance between the two objects.
(iii) Compute the Minkowski distance between the two objects, using p = 3

OR

18. (a) Suppose that we have the following data: (8)


(2, 0), (1, 2), (2, 2), (3, 2), (2, 3), (3, 3), (2, 4), (3, 4), (4, 4), (3, 5)
Identify the cluster by applying the k-means algorithm, with k = 2. Try using
initial cluster centers as far apart as possible

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

(b) Describe EM algorithm for Gaussian Mixtures (8)

19. (a) Suppose the dataset had 9700 cancer-free images from 10000 images from (7)
cancer patients. Find precision, recall and accuracy ? Is it a good classifier?
Justify.
Actual cancer = yes cancer = no Total
Class\Predicted
class
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

(b) What is Principal Component Analysis (PCA)? Which eigen value indicates (7)
the direction of largest variance?

OR

20. (a) Assume you have a model with a high bias and a low variance. What (6)
are the characteristics of such a model?

(b) What are ROC space and ROC curve in machine learning? In ROC space, (8 )
which points correspond to perfect prediction, always positive prediction and
always negative prediction? Why?

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

Teaching Plan
No. of
Lecture
No Contents Hours
(37 hrs)
Module -1 (Overview of machine learning) (7 hours)
1.1 Supervised, semi-supervised, unsupervised learning, reinforcement learning 1 hour
(Text Book (TB) 1: Chapter 1)
1.2 Maximum likelihood estimation(MLE) (TB 1: Section 4.2) 1 hour
1.3 Maximum likelihood estimation (MLE)- example (TB 1: Section 4.2) 1 hour
1.4 Maximum a posteriori estimation(MAP) (TB 4: Section 6.2) 1 hour
1.5 Maximum a posteriori estimation(MAP)-example (TB 4: Section 6.2) 1 hour
1.6 Bayesian formulation (TB 1: Section 14.1, 14.2) 1 hour
1.7 Bayesian formulation -example (TB 1: Section 14.1, 14.2) 1 hour
Module-2 (Supervised Learning) (7 hours)
2.1 Linear regression with one variable (TB 1: Section 2.6) 1 hour

2.2 Multiple variables, Solution using gradient descent algorithm and matrix 1 hour
method (No derivation required) (TB 1: Section 5.8)
2.3 Overfitting in regression, Lasso and Ridge regularization 1 hour
2.4 Logistic regression 1 hour
2.5 Naive Bayes (TB 2: Section 18.2) 1 hour
2.6 Decision trees (TB 2: Chapter 19) 1 hour
2.7 Decision trees- ID3 algorithm (TB 2: Chapter 19) 1 hour
Module-3 (Neural Networks and Support Vector Machines) (9 hours)
3.1 Perceptron, Perceptron Learning 1 hour

3.2 Multilayer Feed forward Network, Activation Functions (Sigmoid, ReLU, 1 hour
Tanh)
3.3 Back Propagation Algorithm 1 hour
3.4 Illustrative Example for Back Propagation 1 hour
3.5 Introduction, Maximum Margin Hyperplane, 1 hour
3.6 Mathematics behind Maximum Margin Classification 1 hour
3.7 Formulation of maximum margin hyperplane and solution 1 hour

Downloaded from Ktunotes.in


COMPUTER SCIENCE AND ENGINEERING

3.8 Soft margin SVM, Solution of Soft margin SVM 1 hour

3.9 Non-linear SVM , Kernels for learning non-linear functions, Examples - 1 hour
Linear, RBF, Polynomial
Module-4 (Unsupervised Learning) (7 hours)

4.1 Similarity measures- Minkowski distance measures( Manhattan, Euclidean), 1 hour


Cosine Similarity
4.2 Clustering - Hierarchical Clustering (TB 2: Chapter 14) 1 hour
4.3 K-means partitional clustering (TB 2: Chapter 13) 1 hour
4.4 Expectation maximization (EM) for soft clustering (TB 2: Chapter 13) 1 hour
4.5 Expectation maximization (EM) for soft clustering (TB 2: Chapter 13) 1 hour

4.6 Dimensionality reduction – Principal Component Analysis (TB 1: Section 1 hour


6.3)
4.7 Dimensionality reduction – Principal Component Analysis (TB 1: Section 1 hour
6.3)
Module-5 (Classification Assessment) (7 hours)

5.1 Performance measures - Precision, Recall, Accuracy, F-Measure, ROC, 1 hour


AUC. (TB 2: Chapter 22.1)
5.2 Boot strapping, Cross validation 1 hour
5.3 Ensemble methods- bagging, boosting 1 hour
5.4 Bias-Variance decomposition (TB 2: Chapter 22.3) 1 hour
5.5 Bias-Variance decomposition (TB 2: Chapter 22.3) 1 hour

5.6 Face detection (TB 3: Chapter 5 Section Application: A Face Detection 1 hour
Pipeline)
5.7 Face detection (TB 3: Chapter 5 Section Application: A Face Detection 1 hour
Pipeline)

Downloaded from Ktunotes.in

You might also like