En Tanagra Kohonen SOM R
En Tanagra Kohonen SOM R
En Tanagra Kohonen SOM R
1 Introduction
Representation of data using a Kohonen map, followed by a cluster analysis. R software
(Kohonen package) and Tanagra (Kohonen-Som composant).
This tutorial complements the course material concerning the Kohonen map or Self-
organizing map ([SOM 1], June 2017). In a first time, we try to highlight two important
aspects of the approach: its ability to summarize the available information in a two-
dimensional space; Its combination with a cluster analysis method for associating the
topological representation (and the reading that one can do) to the interpretation of the
groups obtained from the clustering algorithm. We use the R software and the “Kohonen”
package (Wehrens et Buydens, 2007). In a second time, we carry out a comparative study of
the quality of the partitioning with the one obtained with the K-means algorithm. We use an
external evaluation i.e. we compare the clustering results with pre-established classes. This
procedure is often used in research to evaluate the performance of clustering methods. It
takes on its meaning when it is applied to artificial data where the true class membership is
known. We use the K-Means and Kohonen-Som components of Tanagra.
This tutorial is based on the Shane Lynn's article on the R-bloggers website (Lynn, 2014). I
completed it by introducing the intermediate calculations to better understand the
meaning of the charts, and by conducting the comparative study.
We use the famous WAVEFORM dataset (Breiman and al., 1984), which is available on the
UCI repository1. We have 21 descriptors. Two modifications are introduced in this part of the
tutorial: because we are in an unsupervised learning task, we have remove the class
attribute; we drawn a random sample of 1500 cases from the 5000 available instances. This
dataset is especially interesting because (1) we know the true number of clusters, (2) we
know also the relevant variables (the first and last variables do not play a role in the
discrimination between the classes).
1
https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator+(Version+1)
Configuration and launching the learning process. We must first install and load the
package “Kohonen” before starting the learning process with the som() function.
#kohonen library
library(kohonen)
#SOM
set.seed(100)
carte <- som(Z,grid=somgrid(15,10,"hexagonal"))
We asked a hexagonal grid of size 15 x 10. For this kind of grid, a circular neighborhood
shape is used (Figure 1).
Structure of the grid. We must access to the properties of the object to obtain details about
the structure of the grid.
#structure of the grid
print(carte$grid)
$xdim
[1] 15
$ydim
[1] 10
$topo
[1] "hexagonal"
$n.hood
[1] "circular"
attr(,"class")
[1] "somgrid"
(15.0, 8.66)
149 150
(15.0, 1.73)
16 17
(1.0, 1.73) 1 2 15
(15.5, 0.866)
(1.5, 0.866)
Figure 2 - Number (in red) and coordinates (in blue) of the cells into the grid
The property “$pts” draws our attention. The cells are numbered from 1 to 150 (15 rows x 10
columns). A row (x) and column (y) coordinates are associated to each cell (Figure 2). It is
therefore possible to calculate descriptive statistics relating to the cells by positioning them
into the representation space. Even if it is not very common, one could, for example,
14 juillet 2017 Page 4/21
Tanagra Data Mining Ricco Rakotomalala
calculate a variable factor map to identify correlations in the topological map. We will see
that below (page 10).
Graphical representations are one of the attractive aspects of Kohonen maps. We detail
some of them in this section.
Progression of the learning process. This graph enables to appreciate the convergence of
the algorithm. It shows the evolution of the average distance to the nearest cells in the map.
For our dataset (Figure 3), after a strong decreasing, we have not a significant improvement
from (nearly) 60 iterations. By default, the procedure requests RLEN = 100 iterations. We
should increase RLEN if the curve continues to decrease. This is not required here.
Count plots. The number of instances into the cells are used to identify high-density areas.
The "Kohonen" package for R allows you to specify the color function to use. We define the
function degrade.bleu () [range of blue].
#color range for the cells of the map
degrade.bleu <- function(n){
return(rgb(0,0.4,1,alpha=seq(0,1,1/n)))
}
The function takes "n" as input, it corresponds to the number of colors to be produced. We
generate a set of blue colors more or less opaque using rgb(). The higher is the number "n",
the darker is the color. Then, we plot the map with the plot () command by specifying the
right options (Figure 4).
#count plot
plot(carte,type="count",palette.name=degrade.bleu)
Ideally, the distribution should be homogeneous. The size of the map should be reduced if
there are many empty cells. Conversely, we must increase it if areas of very high density
appear (Lynn, 2014).
Details of the results can be accessed with the properties of the object somgrid. The
property « $unit_classif » points the order number of the cells to which is assigned the
individuals.
# cell membership for each individual
print(carte$unit.classif)
The first instance is assigned to the cell n° 83, the second one to the n° 6, …, the last one to
the n°74.
[1] 83 6 26 53 20 86 99 120 52 3 146 82 81 120 68 145 112 21 109
21 139 148 82 36 123 89
[27] 24 46 52 57 1 127 99 114 148 16 150 41 21 104 102 126 11 59 46
102 64 130 64 77 118 25
...
[1457] 58 78 107 26 7 144 47 57 127 43 83 141 78 19 102 62 7 76 126
7 58 69 125 76 43 6
[1483] 12 27 144 102 13 144 98 36 103 101 93 106 137 24 42 146 138 74
We can calculate the number of instance in each cell using the table() command.
# number of instances assigned to each node
nb <- table(carte$unit.classif)
print(nb)
There are 11 instances into the cell n°1, 12 into n°2, …, 12 into n°150.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
11 12 8 16 12 14 12 9 8 10 10 10 9 8 6 14 11 9 12
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
7 13 9 10 12 12 12 11 9 6 11 14 7 8 8 9 14 12 9
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
7 5 12 9 16 9 10 12 17 6 7 15 10 9 9 5 10 11 10
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
11 10 9 16 11 12 14 11 8 8 12 8 13 6 9 10 13 13 13
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
9 9 7 10 8 12 10 6 14 12 5 10 8 8 5 11 13 13 12
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
6 6 8 12 10 12 11 13 8 10 12 14 12 9 7 9 6 11 10
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
10 7 6 10 7 12 7 11 7 8 5 11 18 10 13 14 9 8 5
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
8 13 8 7 7 10 7 11 7 11 15 9 8 11 13 7 12
We can also check if there are empty nodes by counting the number of values into the
vector generated by the table() command.
#check if there are empty nodes
print(length(nb))
Neighbour distance plot. Called “U-Matrix” (unified distance matrix), it represents a self-
organizing map (SOM) where the Euclidean distance between the codebook vectors of
neighboring neurons is depicted in a range of colors.
According to the package documentation, the nodes that form the same group tend to be
close. Border areas are bounded by nodes that are far from each other. In our chart (Figure
5), the nodes which are close to the others are dark-colored. We observe that they are
concentrated on the ends of the map. This leaves hope for a good separation of the groups
in the typology.
A section is devoted to this type of chart because it allows to establish the role of variables
in the definition of the different areas that comprise the topological map. This is important
for the interpretation of the results. This is especially true when we combine the new data
representation system with a typology provided by a clustering algorithm (section 2.7).
Codebook vectors. This chart represents the vector of weights in a pie chart for each cell of
the map.
#codebooks – nodes pattern
plot(carte,type="codes",codeRendering = "segments")
It enables to distinguish at a glance the nature of the different areas of the map regarding
the variables (Figure 6). We note that the southwestern part is rather characterized by the
high values of the first variables (in green); the north is rather associated with the
intermediate variables (in yellow), the southeastern part would be relative to the last
variables (in pink). We detail the codebooks for the two first cells.
#codebooks values for the two first cells
print(carte$codes[1:2,])
The values of the variables are comparable from one node to another. But also, because we
have standardized them, they are also comparable within each node.
V1 V2 V3 V4 V5 V6 V7
[1,] -1.402508 0.7030893 1.028195 1.112326 1.69281 1.979410 1.201450
[2,] -1.317920 0.2778695 1.100380 1.466021 1.23953 1.316825 1.205971
V8 V9 V10 V11 V12 V13
[1,] 1.1301033 1.0421593 -0.04541570 -1.156651 -1.7364859 -1.899917
[2,] 0.8252693 0.3755839 -0.03180923 -0.887046 -0.8118423 -1.406320
V14 V15 V16 V17 V18 V19
[1,] -1.9378001 -0.9810732 -1.031163 -0.8476767 -0.8018206 -0.01620569
[2,] -0.8989741 -0.9614553 -0.966065 -0.6904285 -0.7548960 -0.53788446
V20 V21
[1,] -0.7937134 0.2506806
[2,] -0.7006180 -2.0250856
It seems that the southwest part of the map - cells n°1 and 2 by starting from the bottom
left (Figure 2) - are characterized by high values for V5, V6 and V7; and low values for V12,
V13 and V14.
Heatmaps. Even it is attractive, we realize that the "codebook" chart is difficult to read
when the number of nodes and variables increases. Rather than making a single chart for all
the variables, we can make a graph for each variable, trying to highlight the contrasts
between the high and low value areas. This univariate description is easier to understand.
We use the values of the "codebooks" to build the graphs that we set together into one
frame. The color scheme used combines red colors (respectively blue) with high values (low
values). We use the coolBlueHotRed() function for that (Wehrens, 2015).
#colors function for the charts
coolBlueHotRed <- function(n, alpha = 1) {
rainbow(n, end=4/6, alpha=alpha)[n:1]
}
Figure 7 – Areas for high values (red) and low values (blue) for each variable
We distinguish approximately 3 areas which will be confirmed by the cluster analysis below
(section 2.7): southwest, characterized by high values for V3 to V8; north for V10 to V12;
southeast for V14 to V18.
Variable factor map. This kind of chart is very popular in principal component analysis
(PCA), it is not usual in the Kohonen map context. However, we have all resources for
achieving the calculations: we have the coordinates (x, y) of each cell (Figure 2); their
weights (number of instances, Figure 4), and the value for each variable from the codebook.
y <- grille$grid$pts[,"y"]
mx <- weighted.mean(x,w)
my <- weighted.mean(y,w)
mv <- weighted.mean(v,w)
numx <- sum(w*(x-mx)*(v-mv))
denomx <- sqrt(sum(w*(x-mx)^2))*sqrt(sum(w*(v-mv)^2))
numy <- sum(w*(y-my)*(v-mv))
denomy <- sqrt(sum(w*(y-my)^2))*sqrt(sum(w*(v-mv)^2))
#correlation for the two axes
res <- c(numx/denomx,numy/denomy)
return(res)
}
The chart (Figure 8) is very similar to the variable factor map obtained from the principal
component analysis (PCA) on the same dataset. The chart looks like a heart (inverted here).
I have always wondered if the authors (Breiman et al., 1984) have done it intentionally by
generating the data.
We agree that the variable factor map is first as an indicative tool in this framework. There
are no mathematical meanings to the calculations, especially since the topological map is
supposed to be able to represent nonlinear patterns. Nevertheless, this kind of synthetic
chart is otherwise easier to inspect than codebooks (Figure 6) and heatmaps (Figure 7) when
the number of variables and nodes increases. In our example, the conclusions of the variable
factor map are consistent with the above representations.
Two characteristics allow us to do this: the variables are standardized, their influences are
directly comparable; the codebooks vectors of the cells correspond to an estimation of the
conditional averages, calculating their variance for each variable is equivalent to estimating
the between-node variance of the variable, and hence their relevance.
We calculate the weighted variance for each variable (weighted by the number of instances
within each cell).
#spread of the codebooks
#variance weighed by the cell size
sigma2 <- sqrt(apply(carte$codes,2,function(x,effectif){m<-sum(effectif*(x-
weighted.mean(x,effectif))^2)/(sum(effectif)-1)},effectif=nb))
The relevant variables (because they induce the strongest contrasts between the cells of the
map) appear first:
V15 V7 V17 V8 V6 V13 V11 V14
0.8900360 0.8797254 0.8603220 0.8540305 0.8538536 0.8527706 0.8518798 0.8518771
V16 V5 V10 V9 V12 V18 V4 V19
0.8508026 0.8461657 0.8437845 0.8435556 0.8198914 0.8161934 0.8141638 0.8067165
V1 V21 V2 V20 V3
0.8010057 0.7907470 0.7904047 0.7895682 0.7713664
Extreme variables (V1 to V3) and (V19 to V21) are the less influential i.e. the conditional
averages are homogeneous across the whole map. These results confirm what we found in
the different graphical representations seen earlier. But numerical indicators are more
convenient for the processing of large datasets.
Clustering of nodes. Kohonen map is a kind of cluster analysis, where adjacent cells have
similar codebooks. We can start from the preclusters (the cells of the map) provided by the
SOM algorithm to perform a cluster analysis. The HAC (hierarchical agglomerative
clustering) is often used in this context.
Below, we calculate the pairwise distance between the cells (using the codebooks) [dist],
then we perform a HAC [hclust] using the Ward's method [method = “ward.D2”].
#distance matrix between the cells
dc <- dist(carte$codes)
The option “members” is crucial. Indeed, the number of instances related to each node is
not the same i.e. the nodes have not identical weights. We must take into account it when
we calculate the Ward's criterion.
Figure 9 - Dendrogram
The variable ‘groupes’ enables to identify the cluster membership of each node of the
Kohonen map.
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2
[42] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 3 3 3 2 2 2 2 2 2 1 1 1 1 3 1 3
[83] 3 3 3 2 2 2 2 2 1 1 1 3 3 3 3 3 3 3 2 2 2 2 2 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 1 3 3
[124] 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2
The first node n°1 belongs to the cluster 1, the second one also, …, the last node of the map
(n°150) belongs to the cluster 2.
The three areas identified in the various previous charts are clearly highlighted. And we
know how to interpret them now.
Note: The adjacent nodes are grouped in the same cluster. This seems normal considering
the properties of the map (adjacent nodes have similar codebooks). But anomalies may
appear sometimes, mainly because of the constraints inherent in the clustering algorithm.
Cluster membership of the individuals. The cells of the map are associated to the cluster.
But, usually, we are interested in the cluster membership of the individuals. We obtain this
in a two-step process: we assign first the individuals to a cell, then we identify the cluster
associated to this cell.
#assign each instance to its cluster
ind.groupe <- groupes[carte$unit.classif]
print(ind.groupe)
The individual n°1 is associated to the cluster 3, the n°2 to the cluster 1, …, the n°1500 to the
cluster 2.
[1] 3 1 2 1 1 2 3 2 1 1 3 3 1 2 3 3 3 1 3 1 3 3 3 1 3 2 2 1 1 2 1
...
[1477] 2 3 3 1 2 1 2 2 3 2 2 3 3 1 2 2 1 1 3 2 2 3 3 2
Visualizing the clusters into the original representation space. Since we know the cluster
membership of individuals and the variables explaining the grouping, it is possible to
represent them in a three-dimensional space formed from the main variables defining the 3
areas identified previously (sections 2.5 and 2.6), i.e. V7, V11 and V15.
By means of the “plot3Drgl” package of the same author, we can generate an interactive
graphical representation.
#animated 3D graphical representation
library(plot3Drgl)
plotrgl(lighting = T)
A specific window appears. I admit to having a lot of fun by turning the graphic in all
directions and playing with the zoom (Figure 12). As we can see, the possibilities of data
representation, and therefore data analysis, are attractive.
We go to the basics in this section. The implementation of the Kohonen maps under
Tanagra is described in a previous tutorial (SOM 2, 2009). The reader may to refer to it for
the construction step by step and the settings of the data mining diagram.
3.1 Dataset
We use the whole 5000 instances of the WAVEFORM dataset here (waveform_som_2.xls).
The class attribute WAVE (last column) is not used for the construction of the clusters, but is
used for their evaluation. Because, the WAVEFORM is an artificial dataset, it is appropriated
for this type of evaluation scheme.
We import the dataset2, then we define the input variables with the DEFINE STATUS tool
(input: V1 to V21). We add the KOHONEN-SOM component into the diagram. We set the
size of the map (15 rows and 10 columns) (1), and we standardize the variables (2).
2
http://data-mining-tutorials.blogspot.fr/2010/08/tanagra-add-in-for-office-2007-and.html
(1)
(2)
Tanagra displays the number of instances into each cell. It provides also an indication about
the quality of the map, this is the proportion of variance explained. We have 59.92% for our
analysis.
To launch the cluster analysis (HAC) from the pre-clusters defined by the topological map,
we add the HAC component into the diagram by specifying the used variables with a second
DEFINE STATUS component: TARGET corresponds to the pre-clusters of KOHONEN-SOM
(CLUSTER_SOM_1), INPUT correspond to the descriptors (V1 to V21). We ask the creation
of 3 clusters since we know the true solution in advance. In any case, if we let the tool to
detect automatically the right number of cluster, it would have proposed this number of
clusters for this dataset.
Target =
CLUSTER_SOM_1
Input = V1…V21
Target = WAVE
Input = CLUSTER_HAC_1
V = 0.5029. This is a reference value. Let us see if K-Means algorithm can do better.
Again, we compare the computed cluster variable with the class attribute WAVE.
Target = WAVE
Input = CLUSTER_KMEANS_1
4 Conclusion
Kohonen maps are both a technique for dimensionality reduction, visualization, and
clustering (at least, it allows to establish a preclustering of data). In this tutorial, I tried to
highlight the advantages of the approach, emphasizing the different possibilities of
interpretations with the various maps, also showing the power of its association with the
Hierarchical Agglomerative Clustering (HAC) in an unsupervised learning process.
5 References
(CAH, 2017) Rakotomalala R., “Hierarchical Agglomerative Clustering (slides)”, Tanagra
Tutorial, June 2017.
(Lynn, 2014) Lynn S., “Self-Organising Maps for Customer Segmentation using R”, R-
bloggers, February 2014.
(SOM 1, 2017) Rakotomalala R., “Self-organizing map (slides)”, Tanagra Tutorial, June 2017.
(SOM 2, 2009) Rakotomalala R., “Self-organizing map (SOM)”, Tanagra Tutorial, July 2009.
(Wehrens et Buydens, 2007) Wehrens R., Buydens L., “Self and Super Organising Maps in R:
the kohonen package”, Journal of Statistical Software, (21) 5, 2007.