Logistic
Logistic
Logistic
Department of ECE
University of California, San Diego
La Jolla, 92093
{sic046, yuw176}@ucsd.edu
1 Introduction
where xij is the value of jth feature of ith training example. The gradient-based
update of parameter βj is
∂
βj := βj + λ LCL (7)
∂βj
Data structure. Training data, validation data and test data are stored in
matrix to accelerate the computing speed, because matrix operations can be
calculated quickly in python.
Every data sample xi has d dimensions. Conditional probability p(yi |xi ; β)
in equation 4 can be calculated with matrix
1
p(yi |xi ; β) = (17)
1 + exp − (zi · β)
where zi = [1, xi1 , xi2 , ..., xid ] is the expansion of xi with an extra 1, and β =
t
[β0 , β1 , ..., βd ] .
Therefore, training data is an ntraining × (d + 1) matrix, where ntraining is
the number of training examples. Each row zi is features of ith example with an
extra 1 at first column. Similarly, test data is an ntest × (d + 1) matrix, where
ntest is the number of test examples. Label of training data is an ntraining × 1
matrix, and label test data is an ntest × 1 matrix.
where mean(xj ) is the mean of jth feature over all training examples, and std(xj )
is the standard deviation of jth feature over all training examples.
Note that validation data should not be used in normalization.
5
2
λ(t) = + λ0 (19)
t1.4
where t is the number of gradient-based update, λ0 is the hyperparameter we
decide in grid search, and λ(t) is the learning rate we use in tth gradient-based
update.
The advantage of changing learning rate is:
Batch. The advantage of SGA is it greatly reduces the computing time to eval-
uate derivatives. However, approximating the gradient by one sample a time may
gets bad approximation some time, with examples that are not representative.
This will reduce the converging rate.
A compromising method is to use a m-size batch of examples to approxi-
mate the gradient. In this way, single outlier will not impact the approximation
significantly. The update rule with m-size batch is
" m
#
1 X
βj := βj + λt (yt − pti )xti j − 2µβj (20)
m i=1 i
where (t1 , t2 , .., tm ) is the chosen numbers from training data to form the batch.
6
Model averaging. Note that we randomize the data before each model training
process, and the validation set is different every time. Therefore, there is much
noise in every model we train. A good way to reduce the random noise is to
average several models to get the final model. The advantage of this approach
can also be explained in terms of utilization efficiency of data. Single model is
trained with a part of entire training data: in our implementation 75% of training
data is used to train the model. By combining several models, more training data
is used, with no impact on the role of validation model.
The modified model is built from K models trained by SGA. The parameters
of kth models is βk , k = 1, 2, ..K. In the new model, for every test data xi , the
probability of its label being 1 is defined
K
1 X 1
p(yi |xi ; β1 , ..., βK ) = (24)
K 1 + exp − (zi · βk )
k=1
7
This is the average of probability p(yi |xi ; β) of k models. Decision rule remains
the same.
3.4 L-BFGS
Apart from SGA, we also use L-BFGS optimization method for comparison.
We use the open-source software SciPy for L-BFGS 2 . We need to input loss
function, gradient of the loss function, and penalty term µ.
The loss function is
n
X 2
L=− logp(yi |xi ; β) + µ kβk2 (25)
i=1
The best penalty term µ is selected from a finite candidate set. The set
consists of exponentially growing sequence.
4 Design of experiments
4.1 Data
We use Gender Recognition dataset to train and test the model. The training
dataset has 559 examples. 75% of which are used as training data, and the rest
25% are used as validation data.
Checking convergence. To prove that the model converges at the end of the
training process, we train 5 models using the selected pair (λ0 , µ) and draw the
curve of objective function v.s. epoch number. The value of objective function
should increase at first and then to a certain point remains constant or goes up
and down near the constant.
4.3 L-BFGS
After deciding all parameters, we finally build the model and test the perfor-
mance of the model on the test data.
For L-BFGS, one set of β is trained and we use it as the final model.
For SGA, we use a model averaging. The modified model is built from 5
models trained by SGA. The modified model is tested on the test data.
30
25
20
computing time − seconds
15
10
0
−1 0 1 2 3 4 5 6 7 8 9
batch size − 2n
0.2
0.18
0.16
0.14
error rate
0.12
0.1
0.08
0.06
−1 0 1 2 3 4 5 6 7 8 9
batch size − 2n
Fig. 2. Error bar: Mean error rate on validation set v.s. batch size
10
Grid search. For the candidate sets λ0 = 10k , k ∈ {−5, −4, .., 0} and µ =
5l , l ∈ {−7, −6, .., 1}, we train models for each pair and repeat the whole process
for 5 times. The average error rate for each pair of parameters is calculated.
For each experiment, the tendency of error rate’s changes is very similar.
Here, we only show the average error rate in Figure 3.
Best results on validation set are obtained when (λ0 , µ) are in a range of
values. We choose (λ0 = 0.001, µ = 0.04), since it is in the middle of the desired
range. In this way, we expect to get more stable results with random training
set.
0.5
0.45
0.4
0.35
error rate
0.3
0.25
0.2 0
0.15 −1
0.1 −2
0.05 −3
−8 −7 −6 −5 −4
−4 −3 −2 −1 0 1 −5 λ0 − 10−n
2
µ − 5−n
5.2 L-BFGS
Penalty term µ. Figure 5 shows the change of error rate on validation set with
penalty term µ. As is shown, error rate doesn’t change much when µ is smaller
11
−5
−10
−15
Value of objective function
−20
−25
−30
−35
−40
0 100 200 300 400 500 600 700 800 900 1000
epoch number
than 1, and then goes up when µ becomes too large. The reason is that when µ
is too large, the restriction it gives to the model is too strong that β trained is
too small.
We choose µ = 5−2 = 0.04, because it is in the middle of the desired region.
0.22
0.2
0.18
0.16
error rate
0.14
0.12
0.1
0.08
−8 −6 −4 −2 0 2 4 6
µ − 5−n
Fig. 5. Error bar: Mean error rate on validation set v.s. µ (L-BFGS)
Space complexity. Training data and test data are stored in matrix. It requires
(N + n) × (d + 2) units of space, where N stands for then number of training
examples, and n is the number of test examples. The algorithm of normaliza-
tion requires (2 × d) units of space to store mean and standard deviation. The
algorithm of randomization requires N × (d + 2) units to store the temporary
shuffled data. The calculation of gradient requires (d + 1) units to store gradient.
The classification process requires n units to store the classification results.
Therefore, the space complexity is O((N + n) × d).
time. The choice of validation set may contain noise, which can be alleviated by
model averaging. Batch-based learning may be even faster than SGA, which is
a trade off between gradient accuracy and computation complexity.
References
1. Elkan, C.: Maximum likelihood, logistic regression, and stochastic gradient training.
(January 17, 2013)