Modelling Mental Lexicon

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Modelling the mental lexicon using cortical-like

Recurrent Neural Network.


Jaime A. Riascos Salas
October 12, 2018

Abstract
Since the introduction of the term mental lexicon, several efforts were
made to describe, develop and understand this system responsible for the
mental representations of words. Such efforts have done following several
approaches, since the theoretical and empirical methods up to computa-
tional models. Recently, Recurrent Neural Networks (RNN) have received
particular attention due to their capacity in several applications, including
language processing; likewise, realistic models and networks based on neu-
rophysiological data have contributed to the understanding of the brain’s
actions and behaviors during language processing. This work aims to re-
view and show some consideration on mental lexicon modeling as well as
the current approaches based on RNN, its alternatives and the cortical-like
RNN model.

1
1 Mental Lexicon
In this section, several considerations related to modelling the mental lexicon are
presented following the work was done by Weronika Szubko-Sitarek [1], where
a full review and discussion can be found.
Since the first discussions regarding the mental lexicon, psycholinguistics
have questioned many aspects about the nature of this system, which is respon-
sible for the mental representation of the words and their meanings [2]. One
of the main discussion was the representation of lexical entries, namely, their
internal structure. Levelt [3] proposed that such items should be separated in
two components: semantic component (Lemma) and formal component (Lex-
eme). The first one is related to the word’s meaning and its syntax; the second
one includes the morphology, phonology and orthographic. The community has
widely accepted this theory, but there still are discussions around the semantic
representations and how the meaning is represented [4]. Likewise, issues remain
in the number of lexicons and the modality of input and output (listening and
reading).
On the other hand, discussions about how the lexical entries should be stored
were also raised. Two main hypotheses were built to face this problem: The Full
Listing Hypothesis [5] and Decompositional Hypothesis [3]. In the first case, the
words as seen as a whole and independent lexical entity, like in a written dic-
tionary, there is no distinction among between words and its variations (go and
goes are stored separately). Unlike this approach, Decompositional Hypothesis
establish that words are stored as a bundle of morphemes (smallest meaningful
units of language). This theory has had experimental evidence coming from
priming tasks, lexical decision tasks, spoken error analysis or experiments with
brain-damaged subjects.
Semantical representation has also discussed in the literature. In order to
explain how the conceptual features storage and their retrieval from memory
work, several models have been postulated:

1. Hierarchical Network model [6]: This model establishes a network relation-


ship between the word’s meaning, creating a categorization where more
general words are placed higher in a network, and consequently, more spec-
ified words are placed lower in the hierarchy. This model was criticized
due to its inconsistencies with the typicality effect, where all words from
the same level are considered similar. Likewise, the outcomes from the
verification tasks were opposite to the model predictions.
2. The Spreading Activation model [7]: Unlike the previous model, this
model does not use a hierarchy anymore. Instead, it uses a network of
semantic relations where the accessibility of the nodes depends on the
frequency of use and word’s typicality. Moreover, the distance between
such nodes is based on structural characteristics (taxonomic relations and
contextual terms).
3. The Semantic Feature model [8]: This model (also known as Componential
Approach) suggests that words can be decomposed in fundamental seman-
tic elements. Such elements are semantic features (defining features) and
characteristic features. Thus, each word has a defining feature that in-
corporates characteristics features associate only to it. The theory is in

2
line with two important categorization theories, in this point arises the
main discussion of the model. In one side, the classical view, coming from
ancient Greece, namely, the Aristotelian model of categorization (essence
and accidents); in the other side, the prototype theory derived from cogni-
tivism, suggest that the people acquire the meaning of words by reference
to a highly typical example. Several authors agree with this theory, since
that it has empirical evidence (for a complete discussion see [1].

Other key concept within mental lexicon is the lexical access, this mechanism
is responsible of searching and recognizing a word in a fast way (about 200 ms
after its onset). Moreover, its complexity depends on the modality of the word
(phonological or orthographical) which has to be recognized so that its meaning
can be accessed. In the opposite way, the production of language starts from
the meaning of the desire concept and it has to be translated into a phonological
or orthographical representation. Several models have been proposed to explain
and analyze the lexical access:

1. Serial search model: In this model, it is proposed that lexical access occurs
by sequentially scanning of lexical entries one at a time. In the Forster’s
model [9], following the comparison of the lexicon to a library, access infor-
mation (orthographic, phonological and semantic/syntactic) are used to
localize the lexical item. Independently of the modality (visual, auditory),
each item can be accessed one at a time. Once the word’s localization is
found (based on its access information), the search for the word entry
in the lexicon starts until the correct lexical entry is retrieved from the
lexicon (master lexicon in Forster’s model).

2. The Logogen model [10]: Unlike the previous model, the Logogen model
uses a threshold for accessing words instead of a determinate location.
That means words are activated when it receives enough energy (thresh-
old) to access the lexical entry. Similarly to Forster’s model, the input
can be from orthographical, phonological or semantic information. Mor-
ton suggests that each entry has some features in common with a targeted
stimulus (logogen). Thus, when the input receives the lexical entry candi-
dates, the number of features for each logogen is summed up, and whether
the threshold is reached, the word, therefore, is activated. This model has
taken special attention due that it describes how the nervous system re-
sponsible for lexical processing works, integrating both acoustic and visual
inputs.
3. Cohort model: Inspired by the Spreading Activation Model, Marslen et al.
[11] propose that instead of the words of similar semantics being primed,
the priming is done by words with similar sound. Comprised by three
stages, the Marslen’s model starts with the activation of a set of words
(cohort) with similar sound (access stage); After, the lexical items in the
cohort are progressive suppressed based on its inconsistency with the con-
text (selection stage), thus until to reach a single item (integration stage).
4. Computational models: So far, all of the models presented here have been
determined by ”high-level theoretical principles,” that means, without car-
ing about what is going on inside the boxes. Thus, reading models can

3
explain and handle realistic and simulate data from the traditional ex-
periments (lexical decision, masked priming, eye-movements). Thanks for
the current computing powers, models can perform large-scale simulations
using a huge number of words; of course, it is necessary to highlight that
models only tackle a limited range of phenomenon or tasks to be simu-
lated. Several models have been proposed based on the task, the word
frequency, letter order, word-superiority effect, RT distribution, among
others. They are discussed in [1] and are visualized in the figure 1.

Finally, in order to understand some aspects of mental lexicon related to the


linguistic storage and how the neural interaction is in language processing, two
models are described as an answer to this discussion:

1. Modularity Theory: As its name suggests, Modularity theory [12] assumes


that mental processes are divided into separate components or modules.
Indeed, the language faculty is seen as a fully autonomous module that
incorporates many components that communicates with other cognitive
structures. He postulated nine characteristic features associated with the
language system. Two of the most characteristic features are the domain-
specific; which establishes that each module is only able of processing spe-
cific linguistic information, and the informationally encapsulated ; which
sees each intramodular processing as separate operating systems and that
they do not affect to any non-linguistic cognitive processes. This item is
one of the most controversial topics inside Fodor’s theory due to that there
is substantial evidence supporting the opposite of this postulate.
2. Connectionism: This theory (also known as interactive network models)
adopts a ”brain metaphor.” That means, based on neurophysiological ac-
tivity in the brain, the mental lexicon is seen as a network of interconnected
units which have various degrees of activation and connections. It worth
it to highlight that this paradigm seeks to explain the language process-
ing using models of information processing that employs the strength of
connections between the nodes rather than rules connections. One of the
most known implementation based on this theory is the model for lexical
entry recognition made by McClelland and Seidenberg [13]. There are
two main connectionism models: localist and distributed, their difference
primarily lies in the representation of the words, where the localist models
assume a one-to-one correspondence with the lexical units and its mental
representation; meanwhile, distributed models using a set of weights on
connections between the processing units and semantic properties. Like-
wise, it is necessary to point out that these models do not fit with the
traditional view of the mental lexicon (as it was described before), nor to
the notion of lexical access. Here, the information is simultaneously and
independently processed on different levels. Hence, the following models
that are going to be presented in the next section follow the connectionism
theory.

Summarizing, the figure 1 shows a conceptual diagram where each item


discussed and exposed above is presented. For a full discussion and review of
each of them see [1].

4
Figure 1: Mental lexicon considerations. Source: the author.

2 Recurrent Neural Networks


In this section, Recurrent Neural Networks are presented with their first model
and their posterior modifications as well as their applications in language mod-
elling. The introduction was extracted from a previous work published by the
author [14].

2.1 Introduction
Recurrent Neural Networks (RNNs) are obtained from the feed-forward network
by connecting the neurons’ output to their inputs [15]. Their name is due that
perform the same task for every element of a sequence; thus, the output of
the network is depended on the previous computations. The short-term time-
dependency is modelled by the hidden-to-hidden connections without using any
time delay-taps. They are usually trained iteratively via a procedure known
as back-propagation-through-time (BPTT). The figure 2 shows a RNN being
unfolded into a full network [16]. In other words, the current state (s) expands
out over time into a sequence of the n-layer neural network. Thus, for each
input x at time step t, there is a state s and an output o for the same time step.
The fact that RNNs share parameters (U, V, W) at each layer reflects that each
step is carried out the same task with different inputs.
As RNNs share parameters at each layer, when it is unfolded in time, results
in the problem of vanishing gradients. Long short-term memory (LSTM) [17]
was proposed to resolve this problem. An LSTM memory cell is composed of
four main elements: an input gate and output gate that control the impact of
the input and output value on the state of the memory cell; a self-recurrent
connection that controls the evolution of the state of the memory cell; and the
forget gate determines how much of prior memory value should be passed into
the next time step. Depending on the states of these gates, LSTM can represent
long-term or short-term dependency of sequential data.

5
Figure 2: RNN unfolding over time. Source: [16]

Recently, LSTM RNNs won several international competitions and set nu-
merous benchmark records. A stack of bidirectional LSTM RNNs broke a fa-
mous TIMIT speech (phoneme) recognition record, achieving 17.7% test set er-
ror rate [18]. LSTM-based systems also set benchmark records in language iden-
tification [19], medium-vocabulary speech recognition [20], and text-to-speech
synthesis [21]. LSTMs are also applicable to handwriting recognition [22], voice
activity detection [23], machine translation [24].
Moreover, RNN has widely used for modelling various neurobiological phe-
nomena, considering anatomical, electrophysiological and computational con-
straints. The computational power of RNNs comes from the fact that its neu-
ron’s activity is affected not only by the current stimulus (input) to the network
but also by the current state of the network, it means that inside the network
will keep traces of past inputs on [25]. Thus, the RNNs are ideally suited for
computations that unfold over time such as holding items in working memory
or accumulating evidence for decision-making.
For example, Barak [25] presents RNNs as a versatile tool to explain neural
phenomena including several constraints. He exposes how combining trained
RNNs with reverse engineering can represent an alternative framework for neu-
roscience modelling. Moreover, Rajan et al. [26] show RNN models of neural
sequences of memory based on decision-making tasks generated by minimally
structured networks. They suggest that neural sequences activation may pro-
vide a dynamic mechanism for short-term memory, which comes from mostly
unstructured network architectures. In the same way, Güçlü and J. van Gerven
[27] show how RNNs are a well-suited tool for modelling the dynamics of human
brain activity. In their approach, they investigated how the internal memories
of RNNs can be used in the prediction of feature-evoked response sequence,
which is commonly measured using fMRI. Likewise, Susillo et al. [28] use RNNs
to generate muscle activity signals (electromyography, EMG) to explain how
the neural responses work in motor cortex. They started with the hypotheses
that motor cortex reflects a dynamics, which is used for generating temporal
commands. Thus, the RNNs are used to transform simple inputs into temporal
and spatial complex patterns of muscle activity.
The previous works show how RNNs are a powerful tool for modelling several
neural dynamics in the brain. The next section will focus on the use of recurrent
networks in language modelling.

6
2.2 RNN in language modelling
Since the first implementation of the single Recurrent Neural Network (SRN)
[29] and its posterior modification [30], several updates, and new approaches
were created. In this first model, there are three simple layers (input, hidden
and output) and its task was to predict the next letter from a sequence (one
letter presented at a time). Likewise, this model has a contextual representation,
which keeps the activation pattern from the current hidden layer’s generated
by the previous step. Thus, this model was able to create symbol sequences
based on the input (n-1) stimuli and the context data. These pioneering studies
showed that SRN is a suitable model from artificial grammar learning (AGL)
using sequential data as an input. Consequently, the new approaches include
some of the SRN’s features, indeed, one of these is its capability of creating
grammatical and semantic categories as well as relationships between them. In
effect, as Elman points out [31], there is no a mental lexicon in the SRN(as the
definition presented before). Instead, the lexical knowledge is implicit by word’s
internal states (defining by the dynamical of the network).
Alternative methods based on SRN (or at least in its recurrent definition)
have been proposed. Fitz [32] describes a Liquid-state machine (LSM), a re-
current neural network inspired on the characteristics of the cerebellum infor-
mation processes. The model uses a sparse (liquid) and random connections
of neuron-like units which turn the time-varying input signal into a spatial-
temporal pattern. A working memory model which processes the past inputs
over transient states (like SRN). The author empathizes the robustness of this
implementation, mainly due to small changes in the parameters. Also, He used
the model to make novel predictions for different conditions (more frames in
languages and a considerable distance between dependencies). Demonstrating a
U-shaped pattern for the first conditions (shifted towards lower variability) and
an improvement of the performance for the three frames, one filler condition.
Furthermore, the same author developed a dual-path model [33] which uses
SRN as a sequencing pathway in order to learn syntactic representations from
two categories layers (CCOMPRES and COMPRESS) and unlike of using a
single prediction, this model computes an error (the difference between the
current next word and the model’s prediction) to adjust the weights in the
learning algorithm and thus make more accurate predictions. Meanwhile, the
meaning pathway encodes the target message in a fast-changing links between
the Role and Concept layers, the Role layer is composed by four thematic role
variables (Action, Agent, Theme, Recipient). The two pathways are intercepted
at the Hidden Layer. Consequently, these layers are intercepted at the output
layer, and thus the model can learn which word was associated with the specific
concept. This model presents several advantages, for instance the encoding the
complex utterances, the input distribution and high sentence accuracy.
Spite of the versatility and widely uses, RNN still has learning issues due to
the back-propagation limitation (as explained in the previous section). In order
to offer an alternative to that, the Self-Organized Feature Maps (SOFM) was
proposed for language modelling. The DevLex-II word learning model intro-
duced by Li et al. [34], this model uses three local maps to process and organize
linguistic information: auditory, semantic and articulatory maps. The first one
takes the phonological information; the semantic maps are responsible for orga-
nizing the meaning representations, and finally, the articulatory map integrates

7
the outputs phonemic sequences of words. These maps are inter-connected with
associative links trained by Hebbian learning which allows to the network a use-
ful modelling of comprehension and production processes. Ferro et al. [35] de-
veloped an improved model called Temporal Self-Organizing Maps (THSOMS).
This approach keeps the time serial information of the network using a predic-
tive activation chain which encodes both spatial and temporal information of
the input. The model tackles some issues related to the lexical organization
and morphological processing. Likewise, this model has revealed some dynam-
ics among short-term memory (activation), long-term memory (learning) and a
morphological organization of stored word forms (topology).
Nevertheless, all of the previous models dont have a network dynamics con-
sistent with a range of neurophysiological findings in learning, language and
memory tasks. Thus, in the next section presents a hybrid and cortical-like
network which combines both of the previous networks (RNN and SOM), called
as Self-Organazing Recurrent Neural Network (SORN).

2.3 Self-Organizing Recurrent Neural Network (SORN)


The Self-Organizing Recurrent Network (SORN) model [36] consists of a pool of
excitatory and inhibitory units with binary thresholded neurons and a readout
layer. The novelty of this network is that integrates three plasticity mech-
anisms in its learning step: Spike-Timing-Dependent Plasticity (STDP),that
modulates the connection strength among excitatory neurons; Synaptic nor-
malization (SN), which act as a homeostatic mechanism that proportionally
adjusts the synapse strength of incoming connections to a neuron; Intrinsic
plasticity (IP), that regulates the firing rate of excitatory neurons by using a
update threshold mechanism. The input is a sequence of binary vectors u(t)
that are connected to a random subset of the excitatory units in the recurrent
layer. Meanwhile, the output consists of a Readout Layer that is trained with
supervised learning methods, namely linear regression. The training in this
model consists of two phases: processing the input with plasticity and plasticity
is disabled, and firing neurons are used to train the readout layer. The figure
?? shows the architecture of the SORN model, with its neural units (excitatory
and inhibitory) and its input and output layer.

Figure 3: SORN’s architecture. Source: [36]

8
SORN is prominent due to its simplicity and the biological plausibility of
the recurrent plastic layer. Indeed, the authors report an observed log-normal
weight distribution of the synaptic weights of the network matching experimen-
tal findings [37]. Likewise, with the three plasticity mechanism, it is possible to
create an adequate representation of the input, allowing the network to outper-
form randomly initialized non-plastic networks (like SRN). Additionally, fluc-
tuation patterns of the connection strength were consistent with those found in
the dynamics of dendritic spines of rat hippocampus. Thus, SORN can offer
the possibility of studying the plasticity mechanism presented in the brain with
manageable and straightforward networks [37]. SORN has been successfully
applied to tasks for prediction [36], recall and non-linear computation [38] and
artificial grammar learning [39].
Undoubtedly, SORN is a feasible and complete tool for asserting and mod-
elling some or all of the mental lexicon entity. Using a network which includes a
neurophysiological behavior can be useful for the understanding of the language
processing phenomena.

9
References
[1] Weronika Szubko-Sitarek. Modelling the Lexicon: Some General Consider-
ations, pages 33–66. Springer Berlin Heidelberg, Berlin, Heidelberg, 2015.
[2] John Fiel. Psycholinguistics: the Key Concepts. Routledge, United King-
dom, London, 2004.
[3] W.J.M Levelt. Lexical Access in Speech Production. Oxford:Blackwell,
United Kingdom, London, 1993.
[4] Manfred Bierwisch and Robert Schreuder. From concepts to lexical items.
Cognition, 42(1):23 – 60, 1992.
[5] B Butterworth. Language Production Volume 2: Development, Writing
and Other Language Processes. Academic Press, United Kingdom, London,
1983.
[6] Allan M. Collins and M. Ross Quillian. Does category size affect catego-
rization time? Journal of Verbal Learning and Verbal Behavior, 9(4):432 –
438, 1970.
[7] A. M. Collins and E. F. Loftus. A spreading activation theory of semantic
processing. Psychological Review, 82:407–428, 1975.

[8] E.J. Shoben Smith, E.E. and L.J. Rips. Structure and process in semantic
memory: A featural model for semantic decisions. Psychological Review,
81:214–241, 1974.
[9] K. I. Forster. Accessing the Mental Lexicon. In New Approaches to Lan-
guage Mechanisms. A Collection of Psycholinguistic Studies, pages 257–
287. North-Holland Publishing Company, 1976.
[10] Morton. J. and K. Patterson. A new attempt at an interpretation or an
attempt at new interpretation. In Deep Dsylexia, pages 91–118. London:
Routledge., 1998.
[11] William D. Marslen-Wilson. Functional parallelism in spoken word-
recognition. Cognition, 25(1):71 – 102, 1987. Special Issue Spoken Word
Recognition.
[12] J Fodor. The Modularity of Mind: an Essay on Faculty Psychology. MIT
Press, Cambridge, MA, 1983.

[13] M. S. Seidenberg and L. J. McClelland. A distributed developmental model


of word recognition and naming. Psychological Review, 86(4):527, 1989.
[14] Rafael T. Gonzalez, Jaime A. Riascos, and Dante A. C. Barone. How
artificial intelligence is supporting neuroscience research: A discussion
about foundations, methods and applications. In Dante Augusto Couto
Barone, Eduardo Oliveira Teles, and Christian Puhlmann Brackmann, ed-
itors, Computational Neuroscience, pages 63–77, Cham, 2017. Springer In-
ternational Publishing.

10
[15] Michael Husken and Peter Stagge. Recurrent neural networks for time
series classification. Neurocomputing, 50:223 – 235, 2003.
[16] Deep Learning WILDM. Artificial Intelligence and NLP. Recurrent neural
networks tutorial, part 1 – introduction to rnns.
[17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu-
ral Comput., 9(8):1735–1780, November 1997.
[18] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech
recognition with deep recurrent neural networks. CoRR, abs/1303.5778,
2013.
[19] Lopez-Moreno I. Sak H. Gonzalez-Rodriguez J. Gonzalez-Dominguez, J.
and P. J. Moreno. Automatic language identification using long short-term
memory recurrent neural networks, 2014.
[20] Jürgen T. Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Ger-
hard Rigoll. Robust speech recognition using long short-term memory re-
current neural networks for hybrid acoustic modelling, 2014.
[21] Qian Y.-Xie F. Fan, Y. and F. K. Soong. Tts synthesis with bidirectional
lstm based recurrent neural networks., 2014.
[22] Alex Graves and Jürgen Schmidhuber. Offline handwriting recognition
with multidimensional recurrent neural networks. In D. Koller, D. Schuur-
mans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information
Processing Systems 21, pages 545–552. Curran Associates, Inc., 2009.
[23] F. Eyben, F. Weninger, S. Squartini, and B. Schuller. Real-life voice ac-
tivity detection with lstm recurrent neural networks and an application to
hollywood movies. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing, pages 483–487, May 2013.
[24] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learn-
ing with neural networks. CoRR, abs/1409.3215, 2014.
[25] Omri Barak. Recurrent neural networks as versatile tools of neuroscience
research. Current Opinion in Neurobiology, 46:1 – 6, 2017. Computational
Neuroscience.
[26] Kanaka Rajan, Christopher D. Harvey, and David W. Tank. Recurrent
network models of sequence generation and memory. Neuron, 90(1):128 –
142, 2016.
[27] Umut Guclu and Marcel A. J. van Gerven. Modeling the dynamics of
human brain activity with recurrent neural networks. Frontiers in Compu-
tational Neuroscience, 11:7, 2017.
[28] David Sussillo, Mark M. Churchland, Matthew T. Kaufman, and Kr-
ishna V. Shenoy. A neural network that finds a naturalistic solution for the
production of muscle activity. Nature Neuroscience, 18:1025–1033, 2015.
[29] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14(2):179
– 211, 1990.

11
[30] Jeffrey L. Elman. Distributed representations, simple recurrent networks,
and grammatical structure. Machine Learning, 7(2):195–225, Sep 1991.

[31] Jeffrey L. Elman. An alternative view of the mental lexicon. In Trends in


Cognitive Sciences, pages 301–306, 2004.
[32] H. Fitz. A liquid-state model of variability effects in learning nonadja-
cent dependencies. In Proceedings of the 33rd Annual Conference of the
Cognitive Science Society, pages 897–902, 2011.

[33] Hartmut Fitz and Franklin Chang. Meaningful questions: The acquisi-
tion of auxiliary inversion in a connectionist model of sentence production.
Cognition, 166:225 – 250, 2017.
[34] Ping Li, Xiaowei Zhao, and Brian Mac Whinney. Dynamic self-organization
and early lexical development in children. Cognitive Science, 31(4):581–612.

[35] Marcello Ferro, Giovanni Pezzulo, and Vito Pirrelli. Morphology, memory
and the mental lexicon. 2012.
[36] Andreea Lazar, Gordon Pipa, and Jochen Triesch. Sorn: a self-organizing
recurrent neural network. Frontiers in Computational Neuroscience, 3:23,
2009.
[37] Witali Aswolinskiy and Gordon Pipa. Rm-sorn: a reward-modulated self-
organizing recurrent neural network. Frontiers in Computational Neuro-
science, 9:36, 2015.
[38] Hazem Toutounji and Gordon Pipa. Spatiotemporal computations of an
excitable and plastic brain: Neuronal plasticity leads to noise-robust and
noise-constructive computations. PLOS Computational Biology, 10(3):1–
20, 03 2014.
[39] Renato Carlos Farinha Duarte, Peggy Seriès, and Abigail Morrison. Self-
organized artificial grammar learning in spiking neural networks. In CogSci,
2014.

12

You might also like