Academia.eduAcademia.edu

The Softmap Algorithm

1998, Neural Processing Letters

A new unsupervised competitive learning rule is introduced, called the Self-organizing free-topology map (Softmap) algorithm, for nonparametric density estimation. The receptive fields of the formal neurons are overlapping, radially-symmetric kernels, the radii of which are adapted to the local input density together with the weight vectors which define the kernel centers. A fuzzy code membership function is introduced in order to encompass, in a novel way, the presence of overlapping receptive fields in the competitive learning scheme. Furthermore, a computationally simple heuristic is introduced for determining the overall degree of smoothness of the resulting density estimate. Finally, the density estimation performance is compared to that of the variable kernel method, VBAR and Kohonen's SOM algorithm.

Neural Processing Letters 8: 181–192, 1998. © 1998 Kluwer Academic Publishers. Printed in the Netherlands. 181 The Softmap Algorithm STEVEN RAEKELBOOM and MARC M. VAN HULLE Laboratorium voor Neuro- en Psychofysiologie, K.U. Leuven, Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, Belgium. E-mail: marc@neuro,kuleuven.ac.be Abstract. A new unsupervised competitive learning rule is introduced, called the Self-organizing free-topology map (Softmap) algorithm, for nonparametric density estimation. The receptive fields of the formal neurons are overlapping, radially-symmetric kernels, the radii of which are adapted to the local input density together with the weight vectors which define the kernel centers. A fuzzy code membership function is introduced in order to encompass, in a novel way, the presence of overlapping receptive fields in the competitive learning scheme. Furthermore, a computationally simple heuristic is introduced for determining the overall degree of smoothness of the resulting density estimate. Finally, the density estimation performance is compared to that of the variable kernel method, VBAR and Kohonen’s SOM algorithm. Key words: nonparametric density estimation, vector quantization, unsupervised competitive learning, variable kernel method 1. Introduction The Self-Organizing (feature) Map (SOM) algorithm [1, 2] was originally conceived for nonparametric regression analysis, whereby the converged topographic map was intended to capture the principal dimensions of the input space, but it has also been regarded as a discrete, nonparametric model of the input probability density ([2], p. 152). However, even though unsupervised competitive learning (UCL) algorithms such as the SOM were not intended to model the fine-structure of the density distribution, there is a more fundamental problem with them: Ritter and Schulten [3] have shown that the converged weight vectors undersample high probability regions and oversample low probability regions in the density distribution. In other words, the neuron weights do not perform an equiprobabilistic quantization of the input space. Instead, these learning rules are aimed at minimizing the average distortion due to quantization, usually the mean squared error (MSE). Adding a ‘conscience’ to frequently winning neurons to feel ‘guilty’ and to reduce their winning rate [4], rather than adding a neighborhood function, as done for the SOM algorithm, may be a good heuristic to achieve an equiprobabilistic quantization however, it is no more than that since, also for this case, equiprobabilistic quantization is in general not equivalent to MSE minimization [5]. 182 STEVEN RAEKELBOOM AND MARC M. VAN HULLE A more direct way to density estimation is to optimize an information–theoretic criterion, instead of a distortion criterion, when adapting the neuron weights. Following this lead, the Vectorial Boundary Adaptation Rule (VBAR) [6, 7, 8] was introduced for developing lattices with regular and fixed topologies, just like the SOM algorithm, but for which a formal guarantee exists that an equiprobabilistic weight distribution is achieved at convergence. A drawback of this rule is that the lattice dimension should match that of the input space in which it is developed. Furthermore, the rule is far too complicated to be practical for building quantizers in higher-dimensional spaces. In the present article, we will introduce a new unsupervised competitive learning rule, called the Self-organizing free-topology map (Softmap) algorithm, for nonparametric modeling of stationary density distributions of arbitrary dimensionality. To each neuron a kernel is associated the radius of which is adapted so as to model the local input density (equiprobabilistic quantization). We will introduce a technique for determining the overall degree of smoothing performed by the kernels. The technique is different and more practical than Parzen’s [9, 10]. 2. Softmap Algorithm Consider a network A consisting of N formal neurons, which quantize the input space V ⊆ ℜd , and a set V = {vm } of M input samples, M > N, drawn from a probability density function f (v), with v ∈ V . To each of the N neurons corresponds a circular- (or hyperspherical-, in general) Receptive Field (RF) region Si centered at wi = [wi1 , ..., wid ] and with radius σi (Figure 1). We associate with each Si a code membership function 1li (v) =  1 if v ∈ Si 0 if v 6 ∈ Si . (1) Depending on this function, the weights and radii are adapted in an iterative learning scheme in such a way that an equitable distribution of neuron weights is obtained at convergence and for which, in addition, all neurons have an equal probability to be active (equiprobabilistic quantization). This will be done by using two learning rules which form the basis of the Softmap algorithm. The purpose of the radii updates is to achieve an equiprobabilistic quantization in V -space: Pi = N1 , ∀i ∈ A, with Pi the probability that receptive field region Si is activated given the training set V. Before deriving the learning rule for the radii, consider first the following reasoning. Assume that, for each neuron, we determine for all M samples of V the Euclidean distance to the neuron’s receptive field center wi , and rank these distances using a sorting algorithm such as Quicksort. We then M/N determine the radius σi Pof Si for which the corresponding code membership ⌋ ≤ vm ∈V 1li (vm ) ≤ ⌊ M ⌋ + 1, where ⌊.⌋ denotes the integer function satisfies: ⌊ M N N 183 THE SOFTMAP ALGORITHM Figure 1. Neurons i and j have circular receptive field regions Si and Sj (circles drawn with continuous lines) centered at wi and wj (small open dots) in 2-dimensional input space V . The fuzzy means of the input samples falling in Si and Sj are Mi and Mj , respectively (small filled dots). The arrow indicates the update of wi given the present fuzzy mean Mi (not to scale). The dashed circles indicate the update of the receptive field regions given their respective distributions of input samples falling in- and outside Si and Sj (also not to scale). Note that the updated circles should also shift since their centers are updated, but this is not shown for clarity’s sake. The shaded area indicates the intersection between regions Si and Sj (overlap). part of its argument, so that we have for Pi : M1 ⌊ M ⌋ ≤ Pi ≤ M1 (⌊ M ⌋ + 1). In an iterN N M/N ative learning scheme, we update the σi ’s after each epoch, given the present σi . An equiprobabilistic quantization is achieved by minimizing the energy function⋆ : Eσ = N X i=1 M/N |σi − σi (2) |, M/N i.e., a positive definite function, with σi the present radius of neuron i and σi the desired one. The batch update rule is then obtained by performing gradient descent on Eσ : ∂Eσ M/N = −η sgn(σi − σi ), ∀i ∈ A, ∂σi 1σi = −η (3) with sgn(.) the sign function, and η the learning rate (a sufficiently small, and M/N positive number). Since it is cumbersome to determine σi , we consider the following analogy: M/N sgn(σi − σi ) = sgn(Pi − 1 ) N ⋆ From this point on, we shall ignore whether or not M is an integer. N (4) 184 STEVEN RAEKELBOOM AND MARC M. VAN HULLE M/N which holds except when for Pi = N1 , σi 6 = σi . This poses no problem since it still represents an equiprobabilistic quantization for neuron i: we only have to . define the sign function in such a way that sgn(0) = 0 so that no update will be performed when this case occurs. Hence, the batch update rule for the radii becomes 1σi = −η sgn(Pi − 1 ), ∀i ∈ A. N (5) The receptive field centers wi should be updated in such a way that an equitable distribution of them is obtained at convergence. This can be achieved by reducing the overlap between the RF regions. To this end, we define a fuzzy code membership function 4i [11]: 1li (v) , ∀i ∈ A, k∈A 1lk (v) 4i (v) = P and use this function to obtain the fuzzy mean of the inputs that activate Si : P m m vm ∈V v 4i (v ) , Mi = P m vm ∈V 4i (v ) (6) (7) if the denominator is non-zero, else wi is left unchanged. This formulation enables us to compile the energy function for the weights: Ew = N X i=1 |Mi − wi |, (8) with wi the present RF center and Mi the desired one, given the current distribution of RF regions. The batch update rule for the RF centers becomes 1wi = −η ∂Ew = η Sgn(Mi − wi ), ∀i ∈ A, ∂wi (9) with Sgn(.) the sign function taken componentwise. Since an input shared by different neurons will lead to smaller weight updates, there will be a competitive element in the learning process and the RF centers will tend to be ‘pulled’ apart by the unbalanced weight update strengths (Figure 1). Note that, since in general the RF regions overlap, 4i is a measure for the probability that neuron i belongs to the subset of the activated neurons, rather than the probability that neuron i is active. Both rules Equations (5) and (6) define the Softmap algorithm. As is usually the case with learning rates, an ‘optimal’ value is not always readily available. However, simulations have shown that, by taking them equal, the weights and radii will converge in a satisfactory manner. Furthermore, the training set size M depends on the size of the network N. Again empirically, we found that by taking M = O(10N), or higher, satisfactory quantization results are obtained. THE SOFTMAP ALGORITHM 185 3. Nonparametric Density Estimation Since the Softmap algorithm is aimed at producing an equiprobabilistic quantization of the input space, in terms of the RF regions Si , the volumes of the hyperspheres with radii σi yield local density estimates at the hypersphere centers wi . For example, for the 2-dimensional case: fb(wi ) = 1 , ∀i ∈ A. Nπσi2 (10) In order to obtain an estimate at any position v ∈ V , we use the previous estimates fb(wi ) for adjusting the radii (bandwidths) as done in the variable kernel density estimation technique (VK) of Breiman et al. [12], i.e., a widely-used nonparametric density estimation technique [10]. The basic idea is to place unit-volume, radiallysymmetric kernels K at a discrete number of ‘observations’ wi and to use the initial, local density estimates fb(wi ) for obtaining the bandwidths λi of these kernels. The major difference with classic VK is that we only place kernels at a restricted number of sites instead of at every input sample, N ≪ M, whereas for VK N ≡ M (also with respect to the formulas in this article). Note that, in order to obtain a local density estimate for every input sample, VK relies on another density estimation technique, such as the histogram-based estimator [10]. The kernel-based density estimate then becomes, for the d-dimensional case   N K v−wi X hλi 1 , (11) fbh (v) = N i=1 (hλi )d with h a global smoothing parameter, K the standard normal distribution, and α  g , λi = fb(wi ) v uN uY N fb(wi ), (12) g = t i=1 with g the geometric mean, and α ∈ [0, 1] a sensitivity parameter. Several authors [12, 13] have shown that an empirical and theoretical justification exists for taking α = d1 . Although the smoothing parameter has a large effect on the shape of the density estimate, its choice is not straightforward. Silverman [10] discusses a range of techniques for determining h, such as the subjective graph method, the leastsquares or likelihood cross-validation method, the test graph method, and so on. For the likelihood cross-validation method, computations easily get out of hand since, for each h-estimate, on the order of M 2 kernel evaluations have to be performed. Perhaps Parzen’s estimation of the density roughness [9] is the most widely-used 186 STEVEN RAEKELBOOM AND MARC M. VAN HULLE technique. It is based on the minimization of the MSE at position v, for an optimal overall degree of smoothing hP (v). For example, in the univariate case: 1 2 1 hP (v) = C(K) N − 5 |f ′′ (v)|− 5 f (v) 5 , (13) with C(K) a constant depending only on the type of kernel used. Unfortunately, the previous equation depends on the availability of the density estimate as well as its second derivative. The alternative we propose is a relatively simple and computationally-efficient heuristic. It also relies on the minimization of a MSE value, and it also starts from the same premises as the ones which led to Equation (13) but, instead of minimizing the MSE by referring to the unknown exact density f , we select the overall degree of smoothing in such a way that the MSE between the kernel-based density estimate fbh , located at the wi ’s, and the initial density estimate fb Equation (10) (i.e., without using kernels), at the same loci, is minimized: 2 PN fbh (wi ) − fb(wi ) . i=1 ∗ . (14) h = arg min MSE(fbh , fb) = arg min h h N The underlying hypothesis is that h∗ = min MSE(fbh , fb) ≈ hopt = min MSE(fbh , f ), h h (15) at the observation sites wi , with hopt the theoretically-optimal h-value. The viability of this hypothesis depends on the quality of the initial density estimates. However, since in our case the initial estimates are expected to be more accurate than those used in VK, since Softmap uses more samples for determining each one of them, the proposed approach seems feasible. We will test this point experimentally in the next section. 4. Simulations As an example density function, we consider a juxtaposition of two bivariate and truncated exponential distributions (Figures 2A and 3A). This distribution is particularly difficult due to the presence of discontinuities and the fast rising slopes. On the other hand, for reasons of comparison with other methods, we have introduced a bounded support for the original exponential distribution by truncating and renormalizing it. The resulting density function f is, analytically 1 −βv2 ) f (v) = I (v1 ) γβ exp (−γv γ1 β1 γβ exp (−γv1 −β(1−v2 )) + (1 − I (v1 )) , γ2 β1 (16) with v = [v1 , v2 ], γ1 = 2(1 − exp (−γ/2)), γ2 = 2(exp (−γ/2) − exp (−γ)), β1 = 1 − exp (−β), and I (v1 ) = 1 if v1 < 0.5, else I (v1 ) = 0. For γ and β, the 187 THE SOFTMAP ALGORITHM B 1                                                                                                                                                                                                        v2             0 E 1                                                                                                                                                                                                                      v2 0 0 v1 v2                                                                                              0 1 0                                                                                                                                                                                                                                    v2         v1 0 D 1 C 1  0 1 0 v1 1                                                                                                         v1  1 Figure 2. Greyscale representation of density function Equation (16) (A), and distribution of neuron weights (crosses) of a 15 × 15 lattice trained with Kohonen’s SOM algorithm, with vanishing and non-vanishing neighborhood range (termed SOM0 and SOM1 ) (B,C), VBAR (D), and Softmap (E). usual parameters of an exponential distribution along the v1 and v2 axes, we take 3 and 1.5, respectively. Before a nonparametric model of f can be obtained, given a finite set of samples drawn from it, initial density estimates are needed. For the sake of comparison, we first implement classic VK, thus with N ≡ M, and determine the initial density estimates by using the histogram-based technique: (1) the unit square [0, 1]2 is discretized by using a uniform grid sized 40 × 40 quadrilaterals, and (2) the proportion of M samples falling in each quadrilateral, divided by the quadrilateral’s surface area, yields the initial density estimate for each one of the samples falling in that quadrilateral. The initial density estimates are then used for selecting the bandwidths of the M kernels in classic VK. The overall degree of smoothing is determined by using our technique Equation (14) (step size 1h = 0.00375). Since VK is computationally intensive, since M kernels have to be evaluated, we increase M until the MSE performance in Equation (14) stabilizes, which occurs for M around 10,000, so that all results shown will be for M = 10, 000 samples (results labeled by VK). We also determine the initial density estimates using Softmap, for N ≪ M, as well as two other network-based techniques: VBAR [6, 7, 8], and the SOM algorithm [1, 2]. For all three techniques, we use the same training sets V consisting of M = 45, 000 samples for training the lattices sized N = 15 × 15 neurons until tmax = 200 epochs have elapsed. (Note that the samples used in classic VK were also drawn from this training set.) In addition, only for Softmap, also a lattice sized N = 10 × 10 neurons is trained but now by using M = 20, 000 samples. 188 STEVEN RAEKELBOOM AND MARC M. VAN HULLE For all three network-based techniques, the neuron weights are initialized on a uniform grid in the unit square [0, 1]2 , since Softmap is not topology-preserving; the radii for Softmap are initialized as 2√1N . For VBAR and the SOM algorithm, the learning rates are linearly decreased, starting from 0.00035 and 0.001, respectively, until zero is reached at tmax . For Softmap, the learning rates of the weights and the radii start at 0.025 and are also linearly decreased. Furthermore, only for the SOM algorithm, we use a Gaussian neighborhood function 3, expressed in lattice coordinates, and decrease its range σ3 in the following way:   t σ3 (t) = min σ3 min , σ30 exp(−2 σ30 ) , (17) tmax with t the present epoch, tmax the maximum number of epochs, σ30 the range spanned by the neighborhood function at t = 0, with σ30 = 6, and σ3 min the minimum neighborhood range. For reasons to be explained, two implementations of the SOM are considered, namely with σ3 min = 0 and 1, termed SOM0 and SOM1, respectively. The converged neuron weights of the 15 × 15 lattices are shown in Figures 2B– E. The initial density estimates are obtained as follows. For Softmap, we use the radii as indicated in Equation (10). For VBAR and the SOM algorithm, no such direct density measure is available except for the distribution of the neuron weights themselves, but this is difficult to quantify. In order to solve this problem, we determine the surface area of each quadrilateral in the trained lattice and locate the corresponding density estimate at the average of all input samples falling in each lattice quadrilateral. The final density estimates are then obtained, for all three network-based techniques, by locating adaptive kernels at the sites of the initial density estimates; the optimal degree of smoothing performed by these kernels is determined as indicated in Equation (14) (1h = 0.00375). In order to quantify the density estimation performance of these algorithms, we consider the following error metrics (Table I). We consider the MSE between the c b initial density estimate fband the final one fc h∗ , i.e. MSE(fh∗ , f ) Equation (14), and between the final density estimate and the analytical function f at the same loci as for the final density estimate, MSE(fc h∗ , f ). Furthermore, we also determine the MSE between the final density estimate and f when we would use the optimal degree of smoothing hopt instead of h∗ , MSE(fd hopt , f ). The corresponding kernels are evaluated on a uniform 20 × 20 grid within the unit square [0, 1]2 so as to yield the nonparametric models shown in Figures 3B–E by using the h∗ -value which b yielded the lowest MSE(fc h∗ , f ) (Table I). For the sake of comparison, we also show the nonparametric model obtained with VK (Figure 3F). From Table I we observe that VK and VBAR yield superior MSE(fc h∗ , f ) results compared to Softmap and the SOM algorithm (SOM0 and SOM1 ). On the other hand, Softmap yields a smoother nonparametric model which better captures the overall shape of the density function (Figure 3E). This is even more the case when comparing the density models generated by Softmap and the SOM0 implementa- 189 THE SOFTMAP ALGORITHM A D f 3 ^ 3 f h* 1 0 0 v2 v1 1 0 0 v2 v1 10 B 10 E ^ 3 f h* ^ 3 f h* 1 0 0 v2 v1 1 0 0 v2 v1 10 C 10 F ^ 3 f h* 1 0 0 v2 v1 10 ^ 3 f h* 1 0 0 v2 v1 10 Figure 3. Density function Equation (16) (A), and nonparametric models obtained by allocating adaptive kernels at the weights shown in Figures 2B–E (B–E), and by using the variable kernel method (F). See text for details. tion (Figure 3B): due to the occurrence of a series of phase transitions in SOM’s with short neighborhood range [14], the quality of the probability density estimate ensuing from the converged lattices is considerably deteriorated [7]. However, if we allow for a non-vanishing neighborhood range, these phase transitions can be avoided, and a smoother weight distribution will be obtained [15]. This is confirmed by the SOM1 result shown in Figure 2C. The corresponding density model (Figure 3C) is now smoother, and the MSE(fc h∗ , f ) improves (Table I), but it suffers from border effects by the reduced ability of the weight distribution to fully cover the input space (compare Figures 2B and C). Compared to VK, the nonparametric density function obtained with VK is supported by much more kernels than in the case of Softmap but, on the other hand, as 190 STEVEN RAEKELBOOM AND MARC M. VAN HULLE Table I. Density function estimation using four different methods (rows): the variable kernel method (VK), Kohonen’s SOM algorithm, VBAR and Softmap. The term between brackets [.] indicates the lattice size. See text for explanation. method h∗ VK SOM0 [152 ] SOM1 [152 ] VBAR [152 ] Softmap [102 ] Softmap [152 ] 0.01875 0.02625 0.03000 0.04500 0.11250 0.07125 d MSE(f h∗ , fb) 0.0135 0.0211 0.0059 0.1723 0.1311 0.1482 d MSE(f h∗ , f ) 0.0594 0.1635 0.1246 0.0688 0.1939 0.1361 hopt 0.03000 0.03750 0.02625 0.03750 0.11250 0.07500 MSE(fd hopt , f ) 0.0413 0.1225 0.1183 0.0653 0.1939 0.1356 Figure 4. Mean squared error (MSE) as a function of the overall degree of smoothing h plotted for a 10 × 10 (thin and thick continuous lines) and a 15 × 15 lattice (thin and thick dashed lines) trained with Softmap when 200 epochs have elapsed. The thin continuous and thin dashed lines correspond to the MSE between the final density estimate fbh and the theoretical function f , respectively; the thick continuous and thick dashed lines correspond to the MSE between the final- and the initial density estimates, fbh and fb, respectively. THE SOFTMAP ALGORITHM 191 mentioned above, the initial density estimates are expected to be less accurate than those obtained by Softmap. When comparing the weight distributions (Figures 2B– E), we observe that VBAR, unlike Softmap, yields initial estimates around the discontinuity region in f , which in turn results in lower MSE values. Owing to its heuristic nature, Softmap attracts proportionally more neuron weights in high density areas. As a result of this, the final density estimate will not be as well defined in the neighborhood of the discontinuities, as with VBAR. This results in a higher degree of smoothing h∗ (Table I) and hence, a smoother density model. We also observe that our technique for determining the overall degree of smoothing works well since for almost all cases h∗ is quite close to the optimal value hopt (Table I). The MSE as a function of the overall degree of smoothing is plotted for Softmap in Figure 4. Finally, in light of these results, we can conclude that, despite of its heuristic nature, Softmap yields an acceptable nonparametric density model. The advantage of Softmap lies in the simplicity of its design, its computational efficiency and its ability to model stationary density functions of arbitrary dimensionality. Acknowledgements M.M.V.H. is a research associate of the Fund for Scientific Research – Flanders (Belgium) and is supported by research grants received from the Fund for Scientific Research (G.0185.96), the National Lottery (9.0185.96), the Research Fund of the K.U. Leuven (F/95/138), the Flemish Regional Ministry of Education (Belgium) (GOA 95/99–06), and the European Commission (ECVnet EP8212). References 1. T. Kohonen, “Self-organized formation of topologically correct feature maps”, Biol. Cybern., Vol. 43, pp. 59–69, 1982. 2. T. Kohonen, Self-organizing maps, Springer: Heidelberg, 1995. 3. H. Ritter and K. Schulten, “On the stationary state of Kohonen’s self-organizing sensory mapping”, Biol. Cybern., Vol. 54, pp. 99–106, 1986. 4. S. Grossberg, “Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors”, Biol. Cybern., Vol. 23, pp. 121–134, 1976. 5. M.M. Van Hulle and D. Martinez, “On an unsupervised learning rule for scalar quantization following the maximum entropy principle”, Neural Computation, Vol. 5, pp. 939-953, 1993. 6. M.M. Van Hulle, “Nonparametric Density Estimation and -regression Achieved with a Learning Rule for Equiprobabilistic Topographic Map Formation”, in Proc. IEEE NNSP96, pp. 33–41, Seika, Kyoto, 1996. 7. M.M. Van Hulle, “Topographic Map Formation by Maximizing Unconditional Entropy: A Plausible Strategy for ‘On-line’ Unsupervised Competitive Learning and Non-Parametric Density Estimation”, IEEE Transactions on Neural Networks, Vol. 7(5), pp. 1299–1305, 1996. 8. M.M. Van Hulle, “Topology-preserving map formation achieved with a purely local unsupervised competitive learning rule”, Neural Networks, Vol. 10(3), pp. 431–446, 1997. 9. E. Parzen, “On estimation of a probability density function and mode”, Ann. Math. Statist., Vol. 33, pp. 1065–1076, 1962. 192 10. STEVEN RAEKELBOOM AND MARC M. VAN HULLE B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall: London, 1992. 11. L.A. Zadeh, “Fuzzy sets”, Information and control, Vol. 8, pp. 338–353, 1965. 12. L. Breiman, W. Meisel and E. Purcell, “Variable kernel estimates of multivariate densities”, Technometrics, Vol. 19, pp. 135–144, 1977. 13. I.S. Abramson, “On bandwidth variation of kernel estimates – a square root law”, Ann. Statist., Vol. 10, pp. 1217–1223, 1982. 14. R. Der and M. Herrmann, “Instabilities in Self-Organized Feature Maps with Short Neighborhood Range,” in M. Verleysen (ed.) Proc. of the European Symposium on Artificial Neural Networks – ESANN’94, pp. 271–276, Brussels, Belgium, 1994. 15. F. Mulier and V. Cherkassky, “Self-organization as an iterative kernel smoothing process”, Neural Computation, Vol. 7, pp. 1165–1177, 1995.