Icann 2023

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Decorelated Weight Initialization by

Backpropagation

Alexander Kovalenko[0000−0002−7194−1874] and Pavel Kordı́k[0000−0002−7194−1874]

Faculty of Information Technology, Czech Technical University in Prague


Prague, Czech Republic
{alexander.kovalenko, pavel.kordik}@fit.cvut.cz

Abstract. A hybrid, trainable weight initialization method for neural


networks has been proposed to address potential training issues caused by
weight symmetry. By pre-optimizing randomly initialized weights, using
backpropagation, this method enhances parameter diversity. This effi-
cient approach, applicable to any neural network architecture, decreases
symmetry and decorrelates weights, thus optimizing performance with
fewer trainable parameters.

Introduction The stochastic nature of neural networks’ weight initialization [4]


imposes certain restrictions on model performance. Mainly, it can result in sym-
metrical activation patterns that limit the network’s representational capacity.
This is especially critical in resource-constrained models, where such patterns
can significantly degrade performance.
Small, resource-constrained neural networks are indispensable in a wide range
of applications that require limited computational resources, real-time process-
ing, or online retraining. Furthermore, a constrained number of parameters in
a neural network promotes learning of general rules rather than merely fitting
to a training dataset. Therefore, given the same performance, a smaller model
potentially has better generalization capability.
Weight initialization algorithms are typically non-deterministic and contain
an element of randomness, so symmetry in the distribution of weights is a matter
of chance. The similarity of the initial weights can also influence the network’s
symmetry, which in turn affects the network’s dynamics during training. This is
especially crucial in resource-constrained neural networks where weight redun-
dancy is low. To address this issue, various methodologies have been developed.
For instance, orthogonal initialization has been reported to reduce overfitting
and improve system stability in recurrent neural networks [6, 8]. Numerous ini-
tialization techniques have been developed [2, 5], ranging from random weight
initialization and the widely-used Kaiming initialization [3], to unsupervised pre-
training with stacked autoencoders [1].
In this work, we present an efficient solution to decorrelate weights using
backpropagation. The approach is based on a trainable Gram matrix of the
model’s layer weights. The Gram matrix [7], a specific type of covariance ma-
trix that arises in the context of inner product spaces, has a determinant that
2 Kovalenko et al.

provides a measure of the spread or volume of multivariate data. Leveraging the


fully differentiable nature of neural networks, we employed data-independent
weight decorrelation with backpropagation.

Our Contribution. In this work, we propose a hybrid approach to weight


initialization that is both stochastic and trainable. This technique is suitable
for neural networks of any architecture and activation function. The method is
based on pre-optimizing stochastically initialized weights to enhance diversity
among the network parameters. It is architecture-agnostic, data-independent,
and computationally cheap. As a result, models with asymmetric initialization
require far fewer trainable parameters to achieve optimal performance due to
the decorrelated weights and lower symmetry.

Decorrelation by Backpropagation. Prior to the model training, an anneal-


ing process is employed. This process applies a data-independent loss function,
backpropagated through the neural network, to decorrelate the weights. We de-
fine weight symmetry (inversely related to diversity, D) as the cosine distance
between the rows of weight matrices in feed-forward layers and as the average co-
sine distance between individual kernels in convolutional layers. To enforce this
symmetry, a straightforward approach that penalizes the Euclidean dot prod-
uct of individual rows in feed-forward neural network weight matrices and the
average Euclidean dot product of convolutional kernels is utilized:
XX
1/D = (wi · wj )
i j (1)
L = 1/D + αW̄(1 − σ(W))
where D is diversity, wi is the i−th row of a weight matrix in feed forward layer,
W layer weight tensor α is a regularization strength parameter. As seen from the
equations above, the first term penalizes weight similarity (symmetry), therefore
decorrelates the weight, while the second term is denoted to tackle abnormalities
in the weight distribution, i.e. it penalizes weight mean and standard deviation
that is largely distinct from the normally distributed ones. Even though the
distribution penalization can be easily regularized by changing the α value, in
the present work all the experiments were performed with α = 1.
When reciprocal diversity and deviance from normally distributed weights are
penalized, it becomes possible to effectively enforce asymmetric initial weights.
This contributes to faster and more efficient neural network training, especially
when the number of parameters is constrained and the dataset variance is high.

Decorrelation with Trainble Gram matrix at Initialization. The main


drawback of the abovementioned method is a pairwise comparison of the weights,
therefore if the weight matrix is large. Therefore, here we describe another
method to decorrelate the weights by maximizing the logarithm of the deter-
minant of the Gram matrix. Since the determinant of a covariance matrix is
Decorelated Weight Initialization by Backpropagation 3

related to the degree of correlation between the variables in the data, if the
determinant is close to zero, some of the variables are highly correlated, which
implies that the dataset has redundant information. This can be formalized as:
Let the linear layer weight matrix W = (aij )1≤i≤m,1≤j≤n be of size m × n,
then if columns of W are linearly dependent, then there exist a vector ⃗x ∈ Rn
for some n ∈ N such that W⃗x = 0. Then if m > n:

(W⊤ W)⃗x = W⊤ (W)⃗x = W⊤ · 0 = 0 (2)


alternatively if m < n:

(WW⊤ )⃗x = W(W⊤ )⃗x = W · 0 = 0 (3)


Therefore, in this case, by penalizing the normalized logarithmic determi-
nant of the Gram matrix (G = W⊤ W)) weight decorrelation can be achieved
without pairwise weight comparison. Similarly to the previous case distribution
abnormalities from N (0, 1) were penalized:

L = log(det(G))/n + αW̄(1 − σ(W)) (4)


where G is Gram matrix, n – dimensionality of the Gram matrix, W layer weight
tensor α is a regularization strength parameter.

Accuracy on Benchmark Datasets. We conducted experiments on two stan-


dard image datasets: CIFAR10 and CIFAR100, using resource-constrained neural
networks with 2 convolutional layers followed by 2 linear layers. To investigate
the impact of convolutional layers on Multiply-And-Accumulate (MACs) oper-
ations 1 , we varied the number of convolutional layers in the models. The fully
connected hidden layers in all models consisted of 128 and 64 neurons. In total,
we evaluated three different configurations with varying numbers of parameters:
(1) 4 and 8 convolutional kernels of size 3 × 3 with default padding, stride = 1,
and ReLU activation function (408 trainable parameters, 0.204 · 106 MACs); (2)
8 and 16 kernels with the same settings as above (1392 trainable parameters,
0.496 · 106 MACs); (3) 16 and 32 convolutional kernels with the same settings
as above (5088 trainable parameters, 1.37 · 106 MACs)
To demonstrate the robustness of our proposed method, we applied batch
normalization after each layer in the model. However, it is worth mentioning
that the proposed method is applicable alongside batch normalization layers,
and improves their performance.
The results are presented in Table 1. The decorrelated weight initialization
approach led to a significant improvement of 11% in validation accuracy, which
diminishes as the model size increased. This phenomenon can be attributed to the
ratio of efficient parameters: for overparametrized models, the number of efficient
parameters responsible for correct reasoning is sufficient, even with symmetrical
weight initialization. This is why pruning methods are effective. However, for
1
https://github.com/sovrasov/flops-counter.pytorch
4 Kovalenko et al.

resource-limited models, an excessive number of symmetrical weights decreases


the number of effective parameters, leading to decreased performance.

Model MACs Kaiming Ours Rel. Improvement, %

CIFAR10
1 0.204 · 106 57.85 62.72 8.4
2 0.496 · 106 62.37 66.87 7.2
3 1.37 · 106 65.16 68.33 4.9

CIFAR100
1 0.204 · 106 28.40 31.76 11.8
2 0.496 · 106 31.24 33.92 11.7
3 1.37 · 106 33.65 35.52 5.6
Table 1. Results of the models on the CIFAR10 and CIFAR100 datasets

References
1. Erhan, D., Courville, A., Bengio, Y., Vincent, P.: Why does unsupervised pre-
training help deep learning? In: Proceedings of the thirteenth international con-
ference on artificial intelligence and statistics. pp. 201–208. JMLR Workshop and
Conference Proceedings (2010)
2. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: Proceedings of the thirteenth international conference on ar-
tificial intelligence and statistics. pp. 249–256. JMLR Workshop and Conference
Proceedings (2010)
3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
4. Kumar, S.K.: On weight initialization in deep neural networks. arXiv preprint
arXiv:1704.08863 (2017)
5. Narkhede, M.V., Bartakke, P.P., Sutaone, M.S.: A review on weight initialization
strategies for neural networks. Artificial intelligence review 55(1), 291–322 (2022)
6. Rodrı́guez, P., Gonzalez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regularizing
cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2016)
7. Sreeram, V., Agathoklis, P.: On the properties of gram matrix. IEEE Transactions
on Circuits and Systems I: Fundamental Theory and Applications 41(3), 234–237
(1994)
8. Vorontsov, E., Trabelsi, C., Kadoury, S., Pal, C.: On orthogonality and learning
recurrent networks with long term dependencies. In: International Conference on
Machine Learning. pp. 3570–3578. PMLR (2017)

You might also like