Self Organizing Maps for the Visual Analysis of Pitch Contours
Dominik Sacha Yuki Asano Christian Rohrdantz Felix Hamborg
Daniel Keim
Bettina Braun
Miriam Butt
Data Analysis and Visualization Group & Department of Linguistics
University of Konstanz
We present a novel interactive approach
for the visual analysis of intonation contours. Audio data are processed algorithmically and presented to researchers
through interactive visualizations. To this
end, we automatically analyze the data
using machine learning in order to find
groups or patterns. These results are visualized with respect to meta-data. We
present a flexible, interactive system for
the analysis of prosodic data. Using realworld application examples, one containing preprocessed, the other raw data, we
demonstrate that our system enables researchers to interact dynamically with the
data at several levels and by means of different types of visualizations, thus arriving
at a better understanding of the data via a
cycle of hypothesis generation and testing
that takes full advantage of our visual processing abilities.
1 Introduction and Related Work
Traditionally, linguistic research on F0 contours
has been conducted by manually annotating the
data using an agreed-upon set of pitch accents and
boundary tones such as the ToBI system (Beckman et al., 2005). However, the manual categorization of F0 contours is open to subjectiveness
in decision making. To overcome this disadvantage, recent research has focused on functional
data analysis of F0 contour data (Gubian et al.,
2013). The F0 contours are smoothed and normalized resulting in comparable pitch vectors for different utterances of the same structure. However,
with this method, the original underlying data is
abstracted away from and cannot be easily accessed (or visualized) for individual analysis.
One of the typical tasks in prosodic research is
to determine specific F0 contours that signal certain functions. State of the art analysis is time
intensive and not ideal, because statistics or projections are applied to the data leading to a possible loss of important aspects of original data. To
overcome these problems, we offer a visual analytics system that allows for the use of preprocessed
F0 pitch vectors in data analysis as well as the
ability to work with the original, individual data
points. Moreover, the linguistic researcher is interactively involved in the visual analytics process
by guiding the machine learning and by interacting with the visualization according to the visual
analytics mantra “Analyze first, Show the Important, Zoom, filter and analyze further, Details on
demand” (Keim et al., 2008).
Our system consists of three components. The
Data Input where all input files are read and converted into the internal data model. The second
part covers Machine Learning where we make use
of Self Organizing Maps (SOM) in order to find
clusters of similar pitch contours. The visualization based on the SOM result is realized within our
last component, the Interactive Visualization. The
researcher can interpret the data directly via this
visualization, but may also interact with the system in order to steer the underlying model. The
overall work flow is illustrated in Figure 1. This
combination of human knowledge and reasoning
with automated computational processing is the
key idea of visual analytics (Thomas and Cook,
2006) and supports human knowledge generation
processes (Sacha et al., 2014). Our contribution
builds on existing previous work on SOM based
visual analysis (Vesanto, 1999; Moehrmann et al.,
2011), but also on previous attempts to visually
investigate data from the domain of prosodic research (Ward and Mccartney, 2010; Ward, 2014).
Furthermore, we profit from approaches to analyze speech using the SOM algorithm (Mayer et
al., 2009; Silva et al., 2011; Tadeusiewicz et al.,
Figure 1: Work flow in four steps. A-Data Input, B-Configuration, C-Training, D-Visualization.
1999), but open up a new domain within this field
as we allow for a visualization of pitch contours
directly on a SOM-grid. We furthermore do not
just produce one SOM, but also compute and visually present several dependent/derivative SOMs.
2 System
The system pipeline consists of three main components: 1) Data-Input; 2) Machine-Learning; 3)
Interactive Visualizations.
2.1 Data Input
Our system is able to process and visualize any
kind of data that satisfies the following restrictions. The data set needs to consist of a list of
data items, where each item contains a set of keyvalue pairs, also called data attributes. The value
of a data attribute must be a primitive, i.e., either a
number, text string, or an array consisting of primitives. Except for primitive-arrays we do not allow
nested data, thus we flatten the input data if necessary. Overall, data items should be comparable
and contain attributes with equal keys (and different values).
The system also expects comparable feature
vectors to which a distance measure can be applied. Furthermore, additional (meta) data can be
part of the input. In the use cases presented here,
each F0 data is connected with speaker information such as the native language of the speaker, the
level of second language (L2) proficiency and the
context the data was produced in.
Vector Preprocessing After having loaded in
the data, our system allows for the inspection of
data prior to the actual analysis. Figure 1-A shows
the inspection view that is typically used in the
work flow at first. As part of the configuration
work flow, the user selects an attribute as the Input Vector (Figure 1-B). This forms the basis of
the machine learning component.
Before entering the machine learning of training phase, our system performs a validation of the
Input Vector and allows for its adjustment if necessary. Whereas normalized and smoothed data,
i.e., data items with vectors of equal length, can be
processed directly, our system also offers the functionality to perform basic preprocessing of raw Input Vectors. If it is found that not all vectors have
equal length, we offer several preprocessing techniques from which one can be chosen: Besides
simple approaches of adding mean-values (meanpadding) or 0s (zero-padding), we also offer an approach that makes use of linear interpolation (pairwise). If time and landmark-information is available, it is also possible to divide the vectors into
parts and adjust each of the parts separately. As
a result, all the parts have equal length and are
therefore better suited for comparison. The Input
Vectors values can be normalized using SemitoneNormalization. The mean value can also be subtracted from each contour, in order to minimize
gender effects.
In sum, we offer a very flexible preprocessing
functionality for the Input Vectors. The available
techniques can be combined flexibly and dynamically according to what is most suitable for the
analysis task at hand. However, there are still
methods that could be added. For example, one
could additionally enhance the vector processing
by a stronger leveraging of the time information
in order to prepare the data for duration focused
analysis tasks.
2.2 Machine Learning
We make use of Machine Learning (ML) for the
detection of groups/clusters that are present in the
data based on the Input Vectors. Additionally, the
system detects correlations to the meta data. In
our use cases this included information about the
Figure 2: SOM-Training illustrated by 4 steps. For each cluster the prototype and the distances between adjacent cells are visualized by black lines in between. In step 4 the training has finished and the
dedicated F0 contours are also drawn in each cell.
native language of the speakers and the level of
their language proficiency.
In principle, any distance function, projection or
clustering method could be applied in our extensible framework. The central problem that needs to
be resolved is that the high dimensional data from
the Input Vectors needs to be reduced to a twodimensional visualization that can be rendered on
a computer screen or a piece of paper. We experimented with several different methods and found
that SOMs, also known as Kohonen Maps (Kohonen, 2001), match the demands of this task best.
SOMs are a well established ML technique that
can be used for clustering or as a classifier based
on feature vectors. SOMs are very suitable for
our purpose for several reasons. First we can use
SOMs as an unsupervised ML-technique to find a
fixed number of clusters subsequent to a training
phase. SOMs also provide a topology where similar clusters are adjacent. Finally, the algorithm
adapts to the given input data depending on the
amounts of desired clusters and data.
Furthermore, in our system, the clustering and
dimensionality reduction are integrated in one
step. This stands in contrast to other clustering and
dimensionality reduction techniques like Multi Dimensional Scaling (MDS), Principal Component
Analysis (PCA) or Non-negative Matrix Factorization (NMF). A disadvantage found with these
other methods is that they tend to lead to clutter
in the two-dimensionsal space (when there is high
degree of overlap in the data). It is also unclear
when to perform the clustering: in the high dimensional space before projection or in the twodimensional space afterwards.
Our system proceeds as follows. First, the
SOM-grid is initialized with random cluster centroids, which are feature vector prototypes for
each cluster. Afterwards each feature vector is
used to train the SOM in a random order. For
each vector the SOM algorithm determines the
best matching unit (BMU) and adjusts the BMU
and adjacent clusters prototypes based on the input vector. This process is repeated n-times until
the SOM is in a stable state (Figure 2, steps A-C).
After the training phase the resulting grid can be
used for clustering. Each vector is assigned to the
cluster with the least distance to the cluster prototype (BMU). In Step D of Figure 2 each cell represents a cluster containing the cluster prototype
(black vector) and the cluster members (colored
Note that we did not rely on existing software
libraries like the SOM-toolbox, but instead implemented the algorithm from scratch. The reason for
this is that we aim at being able to visualize and
steer the algorithm at every step (see Section 2.4).
2.3 Visualizations
We build on Schreck et al.’s work on SOM-based
visual analysis (Schreck, 2010). Within the basic SOM-grid, we provide several different ways
of visualizing the information of interest to the researcher. As shown in Figure 3-A, we provide an
overview visualization which shows the SOM-grid
(Figure 3-A) filled by the clustered pitch contours.
The individual cells also show the cluster centroid
and the vectors (contours) that belong to that cell
in relation to the centroid (Figure 3-F). We also visualize the training history of a cluster in the background in each cell (Figure 3-A) in order to keep
track of the training phase.
Beyond the clustered contours, we furthermore
provide possible visualizations (these can be selected or not), which add in simple highlighters or
bar charts to the SOM result (Figure 3-C). We also
experimented with heatmaps,1 which turned out
1 In
our approach a color overlay for the SOM grid
Figure 3: Different approaches to visualize SOM-results according to available meta data. (A) Grid
visualization, (B) word cloud, (C) bar charts, (D) mixed color cells, (E) ranked group clusters, (F) one
single cell that visualizes contained vectors and the cluster prototype, (G) separated heatmaps for all
values of a categorical attribute.
to be good for visualizing the distribution of data
attributes among the SOM-grid (3-G). The colorintensity of a node depends on the number of data
items it contains; the more data items, the stronger
the intensity.
We offer several normalization options. One
approach takes the global maximum (of all
groups/grids), whereas the other one takes a local maximum for each single group/grid. Different
kinds of normalizations can also be chosen in order to handle outliers or small variabilities in the
data. Depending on the underlying data, an adequate normalization technique is needed to obtain
visible patterns in the data.
A drawback of the heatmaps is that it is not
easy to detect if cells are homogeneous or heterogeneous. That means that it is hard to determine
whether a cell contains only vectors of a specific
group (i.e., in our use cases just native Japanese or
native Germans) or if it is a mixed cell. For that
reason we also offer another visualization. For
each cell we derive the color depending on the
number of group members. Therefore we assign
a color (e.g. red vs. blue) to each group and mix
them accordingly. As a result homogeneous (red
vs. blue) and heterogeneous (purple) clusters are
easy to detect (see Figure 3-D, where GL stands
for “German learner” and “JN” for Japanese native). Finally, we also offer word cloud visualizations for each cell (Figure 3-B). These allow the
user an overview of the values contained in a cell if
the selected attribute has many categories/values.
Each of these visualizations offers different per-
spectives on the data and the user is able to interact
dynamically with each of the different visualization possibilities.
2.4 Interaction
The system offers various possibilities for interaction: 1) Configuration/Encoding Interactions; 2)
SOM Interactions; 3) Selection Interactions.
Configuration/Encoding Interactions: The
algorithm and the visualization techniques offer
many possibilities for individualized configuration, e.g., the grid dimensions of the SOM or the
normalization techniques that are applied by the
visualization techniques. Furthermore the cell layout can be toggled interactively from the SOMgrid to a grouped alignment. An advantage of the
grouped alignment is that the typical feature clusters for each group can be determined by their position. In combination with our coloring approach,
the analysts are thus able to locate the top group
clusters and detect if they are homogeneous or heterogeneous (Figure 3-E). Users may also define
and change visual mappings like the colors that
are assigned to the attribute values.
SOM Interactions: We incorporate the idea
that the analyst should be able to steer the training phase of the algorithm as well (Schreck et al.,
2009). The analyst is able to enter into an iterative
process that refines the analysis in each step. In
each step the SOM result can be manipulated and
serves as an input for the next iteration. For one, it
is possible to delete cells directly on the grid. Another interactive possibility is to move cells to a
desired position and to “pin” them to this position.
That means that for the next SOM training this cell
is fixed. We make use of these interactions to steer
the SOM-algorithm to deliver visually similar outputs. For example, if we fix a cell near the upper right corner, in the next round of training this
cell and the cells similar to it will be in the same
corner (e.g., in Figure 4-E the two gray cells are
fixed). Finally, it is possible to break off the current training process and to restart or to investigate the current state in more detail if the analyst
already perceives a pattern or discovers problems.
Selection Interactions: These interactions help
to filter and select the data during the analysis process. The data that are contained in the current
SOM visualization serve as input for the next iteration of the analysis work flow. Besides removing data elements directly on the SOM grid, data
can be selected to be removed directly in the attribute table (Figure 4-D). This feature allows the
analyst to drill down into selected data subspaces.
Details on Demand operations also enable the user
to inspect subsets of clusters. Furthermore, single
cells can be selected and investigated in a separate
linked detail view.
By enabling these interactions we present the
analyst with the flexible possibilities for an iterative analysis process. The system first provides an
overview of the data, the analyst is able to interact
with the data in iterations of hypothesis formation
and testing. The hypothesis testing can be done
with respect to the entire data set, or with respect
to a selected subset. In order to keep track of the
various visualizations and interactions conducted
by the analyst, we offer a visualization history that
displays the developed SOM grids next to one another (e.g., Figure 5). Clicking on one of these
grids will automatically bring the selected SOM
to the front of the screen.
3 Use Cases
We demonstrate the added value that our approach
brings to prosodic research with respect to two linguistic experiments that were originally conducted
independently of this work. We take a “paired analytics” approach for an evaluation of the potential
of our system (Arias-Hernandez et al., 2011). In
this approach, an expert for visual analytics collaborates with a domain expert. The domain expert places their focus on tasks, hypotheses and
ideas while an analysis expert operates the system.
Figure 4: Interaction techniques that enable for
an iterative data exploration. Configuration Interactions can be used to define parameters like the
grid dimensions or visual mappings (e.g., selecting the attribute colors). SOM Interactions include
the direct manipulation of the SOM-visualization
(move, delete, or pin cells, begin or stop SOM
training). Selection Interactions enable the analyst
to dismiss data in each step in order to drill down
into interesting data subspaces.
We are well aware that the standards for evaluation in natural language processing are quantitative in nature. There is an inherent conflict between quantitative evaluation and the rationale for
using a visual analytics system in the first place.
Visual analytics has the overall aim of allowing
an interactive, exploratory access to an underlying
complex data set. It is very difficult to quantify
data exploration and cycles of hypothesis testing
in the absence of a bench mark or gold standard.
This is a known problem within visual analytics
(Keim et al., 2010; Sacha et al., 2014), but one
which cannot be addressed within the scope of this
paper. The two use cases presented here should
be seen as an initial test as to the added value of
our system. An application to other scenarios and
other use cases is planned as future work.
The use cases discussed below consist of experiments that were concerned with whether linguistic structures of a native language (henceforth L1)
influence second language (henceforth L2) learning. The experiments involved Japanese native
speakers vs. German learners of Japanese. The latter group had varying degrees of L2 competence.
The data set consists of F0 contours and meta data
about the speakers.
3.1 Experiment 1
The first experiment investigated how native
speakers of an intonation language (German) produce attitudinal differences in an L2 that has lexically specified pitch movement (Japanese).
15 Japanese native speakers and 15 German native speakers, who were proficient in the respective
languages participated in the experiment. They
produced the German word Entschuldigung and
the Japanese word sumimasen, which both mean
‘excuse me’. The Japanese word contains a lexically specified pitch fall associated with the penultimate mora in the word, /se/. Materials were presented with descriptions of short scenes. The task
was to produce the target word three times in order to attract an imaginary waiter’s attention in a
crowded and noisy bar.
Our hypotheses were that Japanese native
speakers would not change the F0 contours across
the three attempts, because the Japanese falling
pitch accent is lexically fixed. German learners
would change them, because German F0 can be
changed in order to convey attitude or emotion.
Segmental boundary annotation was carried out
on the recorded raw data using Praat (Boersma and
Weenink, 2011) as the first step. In Experiment
1, segmental boundaries were put between the
Japanese smallest segmental unit, morae, which
resulted in —su—mi—ma—se—n— (the straight
lines signal the segmental boundaries). Then, F0
contours were computed from the annotated data
using the F0 tracking algorithm in the Praat toolkit
with the default range of 70-350 Hz for males and
100-500 Hz for females. Following the procedures of Functional Data Analysis (Ramsay and
Silverman, 2009), we first smoothed the sampled
F0 contours into a continuous curve represented by
a mathematical function of time f (t) adopting Bsplines (de Boor, 2001). Values of F0 were expressed in semitones (=st) and the mean value was
subtracted from each value, in order to minimize
gender effects. After smoothing the curves we
automatically carried out landmark registration in
order to align corresponding segmental boundaries
in time (Gubian et al., 2013; Ramsay et al., 2009).
After these steps, the smoothed F0 data all had the
same duration.
The analysis process of analyzing Experiment
1 is shown in Figure 5. The first SOM offers an overview for the whole dataset. The
word cloud visualization additionally shows the
utterances that occur in the cells (sumimasen,
Entschuldigung). In a next step the data set was
filtered to show only the data for sumimasen (Fig-
Figure 5: Experiment 1 work flow history: An
overview is shown first. In the following steps data
is filtered and the analysis refines stepwise into
an interesting subspace. First, only the utterances
sumimasen ’excuse me’ are selected (A). These
are then further subdivided according to speaker
group (B/C): Japanese Native (JN) vs. German
Figure 6: Experiment 1: Heatmaps for the repetition attribute for each speaker group. German
learner contours clearly include more variations
compared to native speaker contours.
ure 5-A) and a second SOM with only this data
was trained. In the 2nd SOM in Figure 5 the
cells are coloured according to the number of
speaker groups in each cell. Our analyst was
able to discover different pitch contours per group
(blue-German cells on the left-hand side and redJapanese cells on the right-hand side).
In order to get more details we decided to train
an additional SOM for each speaker group. We
simply added the relevant filters and began a new
SOM training for each group (Figure 5-B/C). As a
result the two visualizations now clearly show that
the F0 produced by the groups look different. For
further analysis, we also opened a heatmap visualization for another attribute for each group based
on the SOM-grids B and C. In Figure 6 the repetitions (1st, 2nd, or 3rd) are shown for each group.
One can clearly discover that the Japanese native
speakers’ (top) F0 contours rarely vary in comparison with the German speakers (bottom).
3.2 Experiment 2
In Experiment 1 we were able to determine that
German learners did not produce typical Japanese
F0 contours, namely flat F0 followed by a drastic
pitch fall, just on the basis of unannotated F0 data.
The second experiment examined whether German learners can produce this typical Japanese F0
phonetic form in an imitation experiment. The experiment was originally conducted independently
of Experiment 1.
24 Japanese native speakers and 48 German learners were asked to imitate Japanese disyllabic nonwords consisting of three-morae (/CV:CV/) with a
long-vowel. All stimuli were recorded either with
a flat pitch (high-high, HH) or with a falling pitch
(high-low, HL) that occurs after the long-vowel.
F0 contours produced by Japanese native speakers are expected to imitate the stimuli correctly by
realizing the typical phonetic form of a Japanese
pitch accent, namely a drastic pitch fall preceded
by a flat F0 . In contrast, as per the results of Experiment 1, German learners are expected not to
produce this phonetic form.
In analogy to Experiment 1, segmental annotation was carried out. Segmental boundaries were
put between consonants and vowels, which resulted in —c—v—(c)c—(v)v—. Then, F0 contours
were computed as in Experiment 1. The data contained the raw Hertz values of F0 and additional information included data about segments, speaker
information, time and landmark information for
the produced pitch contour. In total 2393 data
records were put into the SOM system.
The analysis workflow for Experiment 2 is shown
in Figure 7. The first SOM offers an overview for
the whole dataset. This overview clearly shows
two clusters for flat and falling F0 contours (“HH”blue and “HL”-red). On the lower most right corner, there is a red cell in the blue cluster. This type
of pattern could be indicative of an error or noise
in the data set.
Note that the SOM system did not know which
experimental conditions the data contained. Without any information about the experimental variables, SOM detected differences across conditions. Furthermore, no other current analysis techniques enable an overview of F0 data in this manner. Since we were interested in the phonetic re-
alization of Japanese pitch accent, we further analyzed only the data of the falling F0 condition.
As a consequence, a second SOM containing
only the “HL” contours was trained (Figure 7-A).
The next step was to remove the noise from the
data (Figure 7-2nd SOM). In the second SOM
we discovered one cell that contains non falling
F0 contours (lower left corner). We deleted this
cell and fixed/pinned the other corner cells in order to steer the SOM algorithm to produce a similar SOM in the next training phase (Figure 7B). In the next SOM the cells are colored according to the number of speaker groups in each cell
(blue-German, red-Japanese). The three cells in
the lower left corner were the most frequent F0
contours produced only by German learners of
Japanese. To analyze this further, we also changed
the grid based layout to the ranked group layout to
show the three most frequent F0 contours in each
language group (Figure 7-C). As a result, the last
SOM visualization now clearly shows that the F0
produced by the groups look different: Japanese
native speakers produced typical Japanese F0 contours consisting of a flat F0 before a drastic F0 fall
(Gussenhoven, 2004). The third cells from above
in both of the language groups show the same F0
forms, suggesting that some German native speakers produced F0 contours that were very similar to
those of Japanese native speakers. Note however,
that the most frequent contours produced by German learners clearly differed from the Japanese
contours. Finally, one of the most important contributions of the SOM system was that it delivered
us the findings without the necessity of having first
manually annotated a large amount of data, saving
personell costs.
4 Conclusion
We provide an interactive system for the analysis of prosodic feature vectors. To complement
other state of the art techniques we make use of
machine learning in combination with interactive
visualizations. We implemented an iterative process using chains of SOM-trainings for a step-bystep refinement of the analysis. We show with
real experiment data that the system supports linguistic research. Importantly, the analysis allows
for a clustering of F0 contours that works without time-intensive and possibly subjective manual intonational analysis. The clustered contours
can be subjected to intensive phonological analy-
Figure 7: Experiment 2 work flow history: An overview is shown first. Then only “HL” F0 contours
(red colored cells) are selected (A). The researcher interacted directly with the second SOM: The noise
cluster was removed (bottom left corner) and the other corner cells were fixed in order to steer the SOM
algorithm to produce a visually similar SOM for the next training (B). The resulting SOM reveals blue
clusters on the left hand side. Changing the layout to the top clusters per group (C) allows for a better
sis and furthermore allow the potential detection
of more fine-grained phonetic differences across
conditions. The analyses hence provide an important first step that the linguist can then focus on for
subsequent analysis. For example, it is very easy
to filter the data (e.g., examine only a subset of the
data) or to adjust the grid size. More importantly,
the approach is advantageous for an analysis of L2
data, since the learners’ language has a dynamic
character (Selinker, 1972) and it is difficult to determine intonational categories beforehand. Our
SOM approach is generalizable to all kinds of data
for which feature vectors can be derived, including
other linguistic features as intensity, amplitude or
We learned that the visualization of F0 contours
provides the most intuitive access for an understanding of the underlying data. One reason is that
the F0 contour can be visually inspected and directly related to meta data (e.g., through colors).
Even without time-intensive manual annotation of
F0 contours, we could clearly see the differences
between L1 and L2 performance despite the different characteristics of the two experimental data
sets. We visualizaed and animated the SOM training phase and presented this to the researcher as
well. This may seem unnecessary, but experience
has shown that it helps users that are not experienced with ML to better understand the processes.
We applied our technique to two different
datasets. A comparison of the achieved results
shows that our approach works very well “out of
the box” with preprocessed data and also with less
effort on the preprocessing. To overcome the prob-
lem of handling less preprocessed data we added
simple methods that turned out to be sufficient in
order to reveal new insights. The system helped us
to handle unexpected outliers or noise in the data.
All the F0 contours that do not match the major
clusters of the SOM-algorithm are assigned to a
few single cells. The data in these cells could easily be removed.
We plan to make the system available for other
researchers in the future and are considering several expansions as well. For one, other machine
learning and visualization techniques could be
added for additional or further tasks. We also
could try to support the user more in detecting
interesting subspaces in the data. It is possible,
for instance to visualize an overview of attributeheatmaps that enables the human to detect patterns
in each iteration.
In sum, this paper has presented an innovative
and promising new approach for the automatic
analysis of prosodic data. Key components are
that prosodic data is translated into vectors that can
be processed and analyzed further by SOM techniques and presented to the user as an interactive
visual analytic system.
We gratefully acknowledge the funding for this
work, which comes from the Research Initiative
LingVisAnn and the research project “Visual Analytics of Text Data in Business Applications” of
the University of Konstanz.
