CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
You will use the SVM to predict if a person doesn’t have diabetes (negative class) or has
diabetes (positive class), where the data is from National Institute of Diabetes and Digestive
and Kidney Diseases. A total of 8 features will be used, including Preg (number of times
pregnant), Pres (Diastolic blood pressure), Mass (body mass index), etc. There are a total
number of 760 patients in the data.
The dataset is given in file data.txt, which contains the data description, and a matrix with
9 columns and 760 rows. Columns 1-8 are input features, and column 8 is the class label.
In this assignment, you don’t need to do the normalization and training/testing splitting.
You will implement the soft margin SVM using different gradient descent methods as we
learned in class. To recap, to estimate the w, b of the soft margin SVM, we can minimize
the cost function:
n m
( n
!)
1X X X (i)
J(w, b) = (wj )2 + C max 0 , 1 − y (i) w j xj + b . (1)
2 j=1 i=1 j=1
In order to minimize the function, we first obtain the gradient with respect to wj , the jth
item in the vector w, as follows:
m
X ∂L(x(i) , y (i) )
∂J(w, b)
∇wj J(w, b) = = wj + C , (2)
∂wj i=1
∂w j
where: (
∂L(x(i) , y (i) ) 0 if y (i) (x(i) w + b) ≥ 1
= (i)
∂wj −y (i) xj otherwise.
We should also compute the gradient with respect to b, which needs to be derived by yourself.
Now, we will implement and compare the following gradient descent techniques.
1
Algorithm 1: Batch Gradient Descent: Iterate through the entire dataset and
update the parameters as follows:
k = 0;
while convergence criteria not reached do
for j = 1, . . . , n do
Update wj ← wj − η∇wj J(w, b);
Update b ← b − η∇b J(w, b);
Update k ← k + 1;
where,
k is the number of iterations,
n is the dimensions of w,
η is the learning rate of the gradient descent, and
∇wj J(w, b) is the value computed from computing Eq. (2) above and ∇b J(w, b) is the value
computed from your answer in question (a) below.
The convergence criteria for the above algorithm is ∆%cost < , where
|Jk−1 (w, b) − Jk (w, b)| × 100
∆%cost = , (3)
Jk−1 (w, b)
where,
Jk (w, b) is the value of Eq. (1) at kth iteration, and
∆%cost is computed at the end of each iteration of the while loop.
For this method, it is recommended to use η = 0.000000001 and = 0.04, or you can
adjust these hyperparameters by yourself until you obtain reasonable results.
where,
2
k is the number of iterations,
m is the number of examples in the training dataset,
n is the dimensions of w,
i is the training example currently used for computing gradient,
η is the learning rate of the gradient descent, and
∇wj J (i) (w, b) is defined for a single training example as follows:
Note that you will also have to derive ∇b J (i) (w, b), but it should be similar to your solution
to question (a) below.
(k)
The convergence criteria here is ∆cost < , where
(k) (k−1)
∆cost = 0.5 · ∆cost + 0.5 · ∆%cost ,
where,
k is the iteration number, and
∆%cost is the same as above (Eq. (3)).
(k)
Calculate ∆cost , ∆%cost at the end of each iteration of the while loop.
(0)
Initialize ∆cost = 0, w = 0, b = 0 and compute J0 (w, b) with these values.
3
η is the learning rate of the gradient descent, and
∇wj Jl (w, b) is defined for a batch of training example as follows:
min(m,(l+1)×batch size)
∂Jl (w, b) X ∂L(x(i) , y (i) )
∇wj Jl (w, b) = = wj + C .
∂wj i=l×batch size+1
∂wj
(k)
The convergence criteria here is ∆cost < , where
(k) (k−1)
∆cost = 0.5 · ∆cost + 0.5 · ∆%cost ,
where,
k is the iteration number, and
∆%cost is the same as above (Eq. (3)).
(k)
Calculate ∆cost , ∆%cost at the end of each iteration of the while loop.
(0)
Initialize ∆cost = 0, w = 0, b = 0 and compute J0 (w, b) with these values.
Question (a)
Notice that we have not given you the equation for ∇b J(w, b).
Task: What is ∇b J(w, b) used for the Batch Gradient Descent Algorithm?
(Hint: It should be very similar to ∇wj J(w, b).)
Question (b)
Task: Implement the SVM algorithm for the above mentioned gradient descent techniques.
Use C = 10 for all the techniques. Note: update w in iteration i + 1 using the values
computed in iteration i. Do not update using values computed in the current iteration!
Task: Plot the value of the cost function Jk (w, b) vs. the number of iterations (k). Report
the total time taken for convergence by each of the gradient descent techniques.
The diagram should have graphs from all the three techniques on the same plot.
For debugging, Batch GD should converge within 100 iterations, SGD within 5,000 iterations,
and Mini Batch GD somewhere in-between.
What to submit
4
2. Plot of Jk (w, b) vs. the number of iterations (k).