Natural Language Processing With Improved Deep Lea

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Hindawi

Scientific Programming
Volume 2022, Article ID 6028693, 8 pages
https://doi.org/10.1155/2022/6028693

Research Article
Natural Language Processing with Improved Deep Learning
Neural Networks

YiTao Zhou
Hubei Research Center for Language and Intelligent Information Processing, Wuhan University, Wuhan 430072, China

Correspondence should be addressed to YiTao Zhou; [email protected]

Received 10 October 2021; Revised 15 December 2021; Accepted 21 December 2021; Published 7 January 2022

Academic Editor: Rahman Ali

Copyright © 2022 YiTao Zhou. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

As one of the core tasks in the field of natural language processing, syntactic analysis has always been a hot topic for researchers,
including tasks such as Questions and Answer (Q&A), Search String Comprehension, Semantic Analysis, and Knowledge Base
Construction. This paper aims to study the application of deep learning and neural network in natural language syntax analysis,
which has significant research and application value. This paper first studies a transfer-based dependent syntax analyzer using a
feed-forward neural network as a classifier. By analyzing the model, we have made meticulous parameters of the model to improve
its performance. This paper proposes a dependent syntactic analysis model based on a long-term memory neural network. This
model is based on the feed-forward neural network model described above and will be used as a feature extractor. After the feature
extractor is pretrained, we use a long short-term memory neural network as a classifier of the transfer action, and the char-
acteristics extracted by the syntactic analyzer as its input to train a recursive neural network classifier optimized by sentences. The
classifier can not only classify the current pattern feature but also multirich information such as analysis of state history. Therefore,
the model is modeled in the analysis process of the entire sentence in syntactic analysis, replacing the method of modeling
independent analysis. The experimental results show that the model has achieved greater performance improvement than
baseline methods.

1. Introduction processing research, and it has significant research signifi-


cance and application value.
The study of grammar in computational linguistics refers to Syntactic analysis is mainly divided into two types:
the study of specific structures and rules contained in lan- syntactic structure parsing and dependency parsing [2]. The
guage, such as finding the rules of the order of words in main purpose of syntactic structure analysis is to obtain a
sentences and classifying words [1]. Linear laws in these sentence parsing tree, so it is often referred to as full syntactic
languages can be expressed using methods such as Language parsing, sometimes referred to as full parsing. The main
Model and Part-of-Speech Tagging. For the nonlinear in- purpose of dependency syntax analysis is to obtain a tree
formation in the sentence, we can use Syntactic Structure or structure representation of the dependency relationship
Dependency Relation between words in the sentence to between words in a sentence, which is called a dependency
express. Although this analysis and expression of sentence tree.
structure may not be the ultimate goal of natural language In the 1940s, researchers introduced the term “neural
processing problems, it is often an important step to solve network” in order to express biological information pro-
the problem [2], which is used in such as search query cessing systems [6]. The simplest one, the feed-forward
understanding [3], Question Answering, QA [4] and Se- neural network, also known as the multilayer perceptron
mantic Parsing and other issues have important applica- model, has achieved good results in many application tasks,
tions. Therefore, as one of the key technologies in many but due to the high computational complexity of the model,
natural language application tasks, Syntactic Parsing [5] has training is more difficult. With the continuous improvement
always been a hot issue in the field of natural language of computer performance, it is possible to train large-scale
2 Scientific Programming

and deep neural networks. As a result, the Deep Learning Fukushima [16] first proposed the concepts of convolution
method has made a huge breakthrough in the research of neural networks and deep networks. After that, related
multiple fields of machine learning. Deep learning learns pooling and other methods were proposed one after another.
from large-scale data to intricate structural representations. In 1986, the backpropagation algorithm was proposed by
This learning is achieved by adjusting network parameters Rumelhart et al. [17]. It greatly promoted the development of
through error-driven optimization algorithms between neural network research. The second is the emergence of
different layers of artificial neural networks through back- several public datasets. The majority of the public datasets
propagation. In recent years, deep convolution network has make the neural network no longer a toy model. In the field
made great breakthroughs in graphics and image processing, of computer vision, there is the famous ImageNet [18]. In the
video and audio processing, and other fields. At the same field of natural language processing, there is the dataset
time, recursive networks have also achieved good results in published by Twitter 2 and the data of Weibo 3 in the
sequence data such as text and voice [7]. Chinese field.
The recurrent neural network initially achieved good Bengio et al. [19] proposed the use of a recurrent neural
results in handwritten digit recognition [8]. The well-known network to build a language model. The model uses the
word vector algorithm Word2Vec was originally obtained recurrent neural network to learn a distributed represen-
from the language model learned from RNN [9]. Due to the tation for each word while also modeling the word sequence.
gradient disappearance defect of recurrent neural network This model has achieved better results in experiments than
(RNN), Long Short-Term Memory (LSTM) was proposed the optimal n-gram model of the same period and can use
[10]. Due to the recent popularity of deep learning methods, more contextual information. Bordes et al. [20] proposed a
LSTM has also been applied to work such as dialogue method for learning Structured Embeddings using neural
systems [11] and language models [12]. The neural network networks and a knowledge base. The experimental results of
model with attention mechanism proposed recently [13] has this method on WordNet and Freebase show that it can
attracted the attention of researchers. This attention embed structured information. Mikolov et al. [21] proposed
mechanism has been successfully applied to machine continuous bag of words (CBOW): In this model, to predict
translation [14] and text summaries [15] and has achieved the words in a sentence, the concept of word position in a
certain results. sentence is used; this work also proposes a skip-gram model,
The main contributions of this paper are the following: which can use a word in a certain position in a sentence and
predicts the words around it. Based on these two models,
(i) We propose a feed-forward neural network in which
Mikolov et al. [21] open-sourced the tool word2vec4 to train
the parameters propagate unidirectionally
word vectors, which has been widely used. Kim [22] in-
(ii) We use a neural network model as a classifier and troduced the convolution neural network to the sentence
use the reverse propagation algorithm as the classification task of natural language processing. This work
learning algorithm uses a convolution neural network with two channels to
(iii) We proposed a well-organized dataset to evaluate extract features from sentences and finally classify the
the proposed framework extracted features. The experimental results show that the
convolution neural network has a significant effect on the
Rest the paper is structured as follows: Section 2 de-
feature extraction of natural language. Similarly, Lauriola
scribes related work and critically analyzes and compares the
et al. [23] has critically studied and analyzed the use of deep
work done so far. Section 3 is about the proposed meth-
learning in Natural Language Processing (NLP) and the
odology describing the materials and methods adopted in
models, techniques, and tools used so far have been sum-
this study. Section 4 is about the validity of the proposed
marized. Fathi and Shoja [24] also discuss the application of
methodology and experimentation and discussions made
deep neural networks for natural language processing.
about the results produced. The work done is finally con-
Tai et al. [25] proposed a tree-like long and short-term
cluded in Section 5.
memory neural network. Because traditional recurrent
neural networks are usually used to process linear sequences,
2. Related Work and for data types with internal structures such as natural
language, this linear model may lose some information.
Concepts such as neural networks originated in the 1940s. Therefore, this model uses long and short-term memory
After the 1980s, backpropagation was successfully applied to neural networks in the analysis tree and has achieved good
neural networks. In 1989, the backpropagation algorithm results in sentiment analysis.
was successfully applied to the training of a convolutional In summary, the key limitations of existing deep
neural network. As of 2006, the graphics processing unit was learning-based approaches to natural language processing
used in the training of convolution neural networks. As a include the following: deep neural network models are
result, a new upsurge of neural network research has been set difficult to train because they need large amounts of data,
off. The early neural network models of the 1940s were very training requires powerful, expensive video cards, lack of a
simple, usually only had one layer and could not be learned. uniform representation method for different forms of the
It was not until the 1960s that early neural networks were data, such as text and image and the ambiguity resolution in
used for supervised learning, and the model became slightly natural language text at the word, phrase, and sentence level.
more complicated and had a multilayer structure. In 1979, Moreover, deep learning algorithms are not good at
Scientific Programming 3

inference and decision making, cannot directly handle X1


symbols, they are data-hungry and not suitable with small
data size, difficult to handle long-tail phenomena, black-box
nature of the models makes them difficult to understand, X1
and computational cost of the learning algorithms is high.
Apart from the limitations, the good about deep neural Hw, b (x)
networks include the following: efficiency in pattern rec- X1
ognition, data-driven approach, performance being high in
many problems, little or no domain knowledge needed in +1
+1 +1
system construction, the feasibility of cross-modal pro-
cessing, and gradient-based learning. Layer L1 Layer L2 Layer L3 Layer L4
Figure 1: Schematic diagram of feed-forward neural network.
3. Material and Method
In this section, we are going to discuss the recurrent neural
network-based model.

3.1. Feed-Forward Neural Network. As the first proposed


neural network structure, the feed-forward neural network is
the simplest kind of neural network. Inside it, the parameters
propagate unidirectionally from the input layer to the output
layer, as shown in Figure 1 as a schematic diagram of a four-
layer feed-forward neural network.

Figure 2: Schematic diagram of recurrent neural network.


3.2. Recurrent Neural Network. Recurrent Neural networks
have been a hot research field in neural network research in T
recent years. The reason why Recurrent Neural Networks L(X) � − 􏽘 logPr xt+1 |yt 􏼁. (3)
have become a research flashpoint is that the Feed-forward t�1
Neural Network or Multilayer Perceptron cannot grip data
with time series relationships well. The time recursive Similar to the feed-forward neural network, the partial
structure of the Recurrent Neural Network permits it to derivative of the loss function to the network parameters can
learn the time series information in the data so that it can be obtained by using backpropagation through time, and the
well solve this kind of job (see Figure 2). gradient descent method is used to learn the parameters of
For each moment, the activation value of the hidden the network, which is shown in Figure 3.
layer is calculated recursively as follows (t from 1 to N, n Due to the advantages of recurrent neural networks in
from 2 to N, N is the number of hidden layers): time series, in recent years, many researchers in the field of
natural language processing have applied recurrent neural
h1t � σ 􏼐Wih1 xt + Wh1 h1 h1t−1 + b1h 􏼑, networks to research such as machine translation, language
(1) model learning, semantic role tagging, and part-of-speech
hnt � σ 􏼐Wihn xt + Whn−1 hn hn−1
t + Whn hn hnt−1 + bnh 􏼑. tagging and achieved good results.
Among them, W is the parameter matrix (for example,
Wihn represents the connection weight matrix from the input 3.3. Realization of Learning Algorithm and Classification
layer to the Nth hidden layer), b is the bias vector, and σ is Model. As an essential part of the syntactical analyzer, the
the activation function. role of the classification model is to predict the analytical
Calculate the output sequence of the hidden layer, and action. The role of the learning algorithm is the param-
you can use the following formula to calculate the output eters of the learning model from training data. In this
sequence: model, we use a neural network model as a classifier,
N obviously use the reverse propagation algorithm as a
h􏽢t � by + 􏽘 Whn y hnt , learning algorithm. In this section, the precise imple-
n�1 (2) mentation of the classification model will be introduced,
yt � y y􏽢t 􏼁. and some details of the model learning will be described
later.
The output vector yt is used to estimate the probability The role of the embedded layer of the network is to
distribution Pr(xt+1 |yt ) of the input xt+1 at the next mo- convert the sparse representation of the feature into a dense
ment. The loss function L(X) of the entire network is representation. The embedding layer is divided into three
expressed by the following formula: parts: word embedding layer, part-of-speech embedding
4 Scientific Programming

ht h0 h1 h2 ht

A = A A A A

xt x0 x1 x2 ... xt

Figure 3: Schematic diagram of unfolded recurrent neural network.

layer, and dependency arc embedding layer. The three 1


embedding layers obtain input from three different features L(Θ) � − 􏽘 log􏼒pi + λ‖Θ‖􏼓, (7)
i∈A
2
corresponding to the input layer. It is worth noting that
compared with the size of the dictionary, the value set of part where A is the correct analysis sequence action set of the
of speech and dependency arc is relatively small, so the batch, λ is the regularization parameter, and Θ is the model
dimension of part of speech and arc embedding in the parameter.
embedding layer is smaller than the dimension of word The classifier in the dependency syntax analyzer is a
embedding. Specifically, the word feature in the analysis neural network classifier, and its learning algorithm is the
pattern c is mapped to dw as a dimensional vector ew ∈ Rdw , same as the general neural network learning algorithm,
and the embedding matrix is Ew ∈ Rdw ×Nw . Among them, which is a backpropagation algorithm. Using the back-
Nw is the dictionary size. propagation algorithm, the gradient of the loss function to
Similarly, part-of-speech features and dependency arc the parameters can be obtained, and then the gradient
features are mapped to ep ∈ Rdp and el ∈ Rdl after the descent method is used to update the parameters of the
conversion is completed, the layer outputs 48 dense features, model.
each of which is a real vector.
The hidden layer in the model connects the 48 output
features xh of the embedding layer end-to-end to form a 4. Experiments and Discussion
feature vector and perform linear and nonlinear transfor- In this section, we are going to discuss the dataset and the
mation operations on it. Specifically, the nonlinear trans- experimental setup and evaluate the framework.
formation function is a cubic activation function:
3
h � W1 xh + b1 􏼁 . (4)
4.1. Long and Short-Term Memory Neural Network. The
dh ×dx recursive neural network is used to translate the input se-
Among them, W1 ∈ R is the parameter matrix of
h

the hidden layer, and dxh � 18 ∗ dw + 18 ∗ dp + 12 ∗ dl , b1 is quence to an output sequence, such as a sequence identi-
the bias vector. fication problem or sequence forecast problem. However,
The last layer of the network is the softmax layer, whose many of the actual use tasks expose difficulty in training
role is to predict and analyze the probability distribution of recursive neural networks. Sequences in these issues often
actions: extent a lengthier time interval. Bengio et al., since the
gradient of the recursive neural network, will ultimately
o � W2 h + b2 , “disappear,” the recursive neural network that wants to learn
(5) a long-distance memory is more difficult, as shown in
exp oa 􏼁 Figure 4.
pa � .
􏽐a∈τ exp oa 􏼁 To solve this problem, Hochreiter and SchmidHuber
[10] proposed Long Short-Term Memory, LSTM. In this
Among them, W2 is the parameter matrix of the softmax model, the concept of “door” is added so that the network
layer, b2 is the bias vector and τ is the set of all actions in the can choose when “Forget” increasing new “memory.”
dependency syntax analysis system. As a variant of the recursive neural network, the long-
After obtaining the probability distribution of the term memory neural network in the design is to solve the
analysis action predicted by the model, the loss function of gradient disappearance of ordinary recursive neural net-
the network can be calculated. The same as the general works. The usual recursive neural network reads an input
multiclassification problem, we use the cross-entropy loss vector xt from a vector sequence (x1 , x2 , . . . , xn ) and cal-
function: culates a new hidden layer state ht . However, the problem of
1 gradient disappearance results in an ordinary recursive
C�− 􏽘􏼂y logpi + 1 − yi log 1 − pi 􏼁􏼁􏼃. (6)
n i i neural network that cannot be modeled on long-distance
dependence. Long short-term memory neural networks
In fact, the classification task is to select a correct action introduced “Memory Cell” and three “Control Gate,” which
from multiple analysis actions, so the loss function is used to control when to choose “memory,” when to choose
simplified as follows: “Forget.”
Scientific Programming 5

h0 h1 h2 ht ht+1 ht+2

A A A A A A

x0 x1 x2 ... xt xt+1 xt+2

Figure 4: Ordinary recurrent neural networks cannot handle long-distance dependencies.

Specifically, the long and short-term memory neural Table 1: Statistics of the data used in this article.
network uses an input gate, a forget gate, and an output gate. Data set A B C D E (%)
Among them, it determines the proportion of the current
Training set 33288 33251 99.89 76 99.8
input that can enter the memory unit, and the forget gate Development set 1850 1848 99.89 1 99.9
controls the proportion of the current memory that should Test set 1850 1848 99.89 — 100
be forgotten.
A is the total number of sentences; B is the number of projectable sentences;
For example, at time t, the long and short-term memory C is the percentage of projectable sentences; D is the number of sentences up
neural network is updated in the following way: to 70; E is the percentage of sentences used to projectable sentences.
At time t, given input xt , calculate the value of input gate
it , forget gate and candidate memory C 􏽥 t according to the
following formula: correct phrases in the analysis result to the number of
phrases in the analysis result:
it � σ Wi xt + Ui ht−1 + bi 􏼁,
Number of correct phrases in the analysis result
􏽥 t � tan h Wc xt + Uc ht−1 + bc 􏼁, P� .
C (8) The total number of phrases in the analysis results
ft � σ 􏼐Wf xt + Uf ht−1 + bf 􏼑, (10)

where σ is the component-wise logistic function and ⊙ is the (2) Recall Rate. The accuracy rate in phrase structure
component-wise product. analysis refers to the percentage of the number of
At the same time, the value and output value of the new correct phrases in the analysis result to the total
memory cell are given as follows: number of phrases in the test set:
􏽥 t + ft ⊙ Ct−1 ,
Ct � it ⊙ C Number of correct phrases in the analysis result
R� .
The total number of phrases in the test set
ot�σ Wo xt + Uo ht−1 + Vo Ct + bo 􏼁, (9)
(11)
ht � ot ⊙ tan h Ct 􏼁.
(3) F1 value.
2×P×R
F1 � . (12)
4.2. Experimental Data. Since batch training is required, P+R
and the analysis sequence lengths of sentences of different
lengths are not the same, we have adopted a mask method
for training. Even so, because the length of some sentences 4.4. Experimental Results and Analysis. In addition to the
is too long, other sentences in the batch have been pro- comparison with the baseline method, this topic is also
cessed and have been waiting for the long sentence to compared with two other classic dependency parsers: Malt
appear. Therefore, to train the model more quickly, we Parser and MST Parser. For Malt Parser, we used the
removed sentences with more than 70 words in the training stackproj and nivreeager options for training, which cor-
process. Such sentences have a total of 76 sentences, ac- respond to the arc-standard analysis algorithm and the arc-
counting for 0.2% of the number of sentences in the eager analysis algorithm, respectively. For MST Parser, we
training dataset. We believe that this will not affect the report the results in Chen and Manning (2014). The test
effect of the final model. After removing part of the training results are shown in Table 2.
data and verification data, the actual data used is shown in It can be seen from the table that the dependency syntax
Table 1. analyzer based on the long and short-term memory neural
network has achieved certain effects in modeling the analysis
sequence of sentences. This model has achieved 91.9% UAS
4.3. Evaluation Index. The analysis of phrase structure
accuracy and 90.5% LAS accuracy on the development set of
usually uses accuracy, recall, and F1 value for evaluation:
Penn Tree Bank, which is about 0.7% improvement over the
(1) Accuracy. The accuracy rate in phrase structure greedy neural network dependency parser of the baseline
analysis refers to the percentage of the number of method. On the test set, our model achieved a UAS accuracy
6 Scientific Programming

Table 2: Test results on WSJ.


Development set Test set
Analyzer
UAS LAS UAS LAS
Malt: standard 90.5 88.9 89.2 87.4
Malt: eager 90.2 88.8 89.4 87.5
MST parser 91.4 88.1 90.7 87.6
Baseline method 91.2 89.9 90.1 88.3
Greedy feature extractor 91.4 89.8 90.2 88.5
This model 91.9 90.5 90.7 89.0

Table 3: Test results on WSJ23.


Model Accuracy Recall rate F1 value Effective output Word count does not match Output structure error
Single follower 0.610 0.606 0.608 934 607 99
Double follow 0.827 0.826 0.827 1347 274 19

1
0.9
0.8
0.7
0.6
F1 value

0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25
Training process (unit:Every ten thousand times)
Figure 5: F1 value changes with the training process.

rate of 90.7% and an LAS accuracy rate of 89.0%, which is The results of testing on the Pennsylvania Tree Bank are
about 0.6% improvement over the greedy neural network- shown in Table 3. In the testing process, this article uses the
dependent syntax analyzer of the baseline method. column search technique, and the corresponding beam size
Compared with the most representative transfer-based is 12.
dependency parser, Malt Parser, our method has a relative It can be seen from the data in the table that the dual
improvement of about 1.4%; compared with the famous attention mechanism can effectively reduce the number of
graph model-based MST Parser, our model can obtain 0.5 on errors in the output results. In the effective output, the F1
the development set. % Improvement, the UAS accuracy rate value of the model reached 0.827, and its change with the
on the test set is comparable, and the LAS accuracy rate has training process is shown in Figure 5. Various errors change
been improved by 1.4%. with the training process, as shown in Figure 6.
The experimental results show that, compared with the By linearizing the phrase structure tree in natural lan-
greedy feed-forward neural network, the dependency guage, the phrase structure analysis task is transformed into
syntax analysis model based on the long and short-term a sequence-to-sequence conversion task. A simple imple-
memory neural network performs better. Different from mentation of the sequence-to-sequence model is carried out,
the greedy model, this model uses long and short-term and it is found that the end-to-end analysis still needs the
memory neural networks to model the entire sentence and rule restriction on the decoder side. To this end, we propose
can use historical analysis information and historical a dual attention mechanism model, that is, a sequence-to-
pattern information to help classify analysis actions, sequence model that introduces attention mechanisms at the
thereby improving the performance of the dependent input and output at the same time. Experiments show that
syntax analyzer. after the introduction of the dual attention mechanism
Scientific Programming 7

1600

1400

1200

1000
Number of errors

800

600

400

200

0
4 6 8 10 12 14 16 18 20
Training times (unit:Every ten thousand times)

Wrong word count


Tree structure error
Total number of errors
Figure 6: Variety of errors with the training process.

Table 4: Symbols used in the papers. analysis process of the entire sentence in the dependent
Symbol/abbreviation Abbreviation definition and meaning
syntax analysis and improves the greedy model to model the
independent analysis state. The experimental results show
RNN Recurrent neural network
LSTM Long short-term memory
that compared with the baseline method, the model obtains
MST parser Maximum spanning tree parser an improvement of 0.6 to 0.7 percentage points.
CBOW Continuous bag of words Through the work experience and error analysis, we can
further study the dependency syntax analysis model based
on the long and short-term memory neural network, and we
model, the performance of the model on the test set is greatly found that the attention mechanism can be introduced into
improved in Table 4. the model.

5. Conclusions Data Availability


Syntactic analysis is an indispensable part of tasks such as The data used to support the findings of this study are in-
question answering systems, search string comprehension, cluded within the article.
semantic analysis, and knowledge base construction. This
paper studies a neural network model of dependency syn- Conflicts of Interest
tactic analysis based on transfer learning. This model uses a
feed-forward neural network as the classifier in the de- The author declares no conflicts of interest regarding the
pendency syntax analyzer and adjusts its parameters by publication of this paper.
analyzing the model to achieve better results. The experi-
mental results show that after improvement, the effect of the References
model is increased by 0.1 to 0.2 percentage points. We
propose a dependency syntax analysis model based on long [1] C. D. Manning and H. Schutze, Foundations of statistical
and short-term memory neural networks. This model is natural language processing, MIT Press, Cambridge, MA,
based on the neural network model and used as a feature USA, 1999.
[2] C. Q. Zong, Statistical Natural Language Processing, Tsinghua
extractor. Specifically, the model is based on the charac-
University Press, Beijing, China, 2008.
teristics of the long and short-term memory neural network [3] J. Liu, P. Pasupat, and Y. Wang, “Query Understanding
and uses it to memorize the analysis state and analysis Enhanced by Hierarchical Parsing structures,” in Proceedings
history in the transfer-based dependency syntactic analysis of the Automatic Speech Recognition and Understanding
process so that the model can capture and utilize more (ASRU), 2013 IEEE Workshop on, pp. 72–77, IEEE, Okinawa,
historical information. In addition, the model models the Japan, December, 2017.
8 Scientific Programming

[4] K. Tymoshenko and A. Moschitti, “Assessing the impact of [22] Y. Kim, “Convolutional neural networks for sentence clas-
syntactic and semantic structures for answer passages sification,” 2014, https://arxiv.org/abs/1408.5882.
reranking,” in Proceedings of the 24th ACM International on [23] I. Lauriola, A. Lavelli, and F. Aiolli, “An introduction to deep
Conference on Information and Knowledge Management, learning in natural language processing: models, techniques,
Melbourne, Australia, October 2015. and tools,” Neurocomputing, vol. 470, 2021 Jul 22.
[5] W. Monroe and Y. Wang, Dependency Parsing Features for [24] E. Fathi and B. M. Shoja, “Deep neural networks for natural
Semantic Parsing, Stanford University, Stanford, CA, USA, language processing,” Handbook of Statistics, vol. 38,
2014. pp. 229–316, 2018 Jan 1.
[6] C. M. Bishop, Pattern Recognition and Machine Learning [25] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic
(Information Science and Statistics), Springer-Verlag New representations from tree-structured long short-term mem-
York, Inc., Secaucus, NJ, USA, 2006. ory networks,” 2015, https://arxiv.org/abs/1503.00075.
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
[8] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke,
and J. Schmidhuber, “A novel connectionist system for un-
constrained handwriting recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 31, no. 5,
pp. 855–868, 2009.
[9] T. Mikolov, W. t. Yih, and G. Zweig, “Linguistic regularities in
continuous space word representations,” in Proceedings of the
2013 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language
Technologies, pp. 746–751, Association for Computational
Linguistics, Atlanta, Georgia, June 2013.
[10] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[11] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence
Learning with Neural Networks,” in Proceedings of the 27th
International Conference on Neural Information Processing
Systems, pp. 3104–3112, Curran Associates, Inc., Montreal,
Canada, December 2014.
[12] M. Sundermeyer, R. Schlüter, and H. Ney, LSTM Neural
Networks for Language ModelingISCA Archive, Portland, OR,
USA, 2012.
[13] V. Mnih, N. Heess, and A. Graves, “Recurrent models of visual
attention,” in Proceedings of the 27th International Conference
on Neural Information Processing Systems, pp. 2204–2212,
Curran Associates, Inc., Montreal, Canada, December 2014.
[14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
translation by jointly learning to align and translate,” 2014,
https://arxiv.org/abs/1409.0473.
[15] A. M. Rush, S. Chopra, and J. Weston, “A neural attention
model for abstractive sentence summarization,” 2015, https://
arxiv.org/abs/1509.00685.
[16] K. Fukushima, “Neocognitron: a self-organizing neural net-
work model for a mechanism of pattern recognition unaf-
fected by shift in position,” Biological Cybernetics, vol. 36,
no. 4, pp. 193–202, 1980.
[17] D. E. Rumelhart and J. L. M Cc Lelland, “Parallel distributed
processing: explorations in the microstructure of cognition,”
Language, vol. 22, no. 4, pp. 98–108, 1986.
[18] O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale
visual recognition challenge,” International Journal of Com-
puter Vision, vol. 115, no. 3, pp. 211–252, 2015.
[19] Y. Bengio, R. Ducharme, and P. Vincent, “A neural proba-
bilistic language model,” Journal of Machine Learning Re-
search, vol. 3, pp. 1137–1155, 2003.
[20] A. Bordes, J. Weston, and R. Collobert, “Learning structured
embeddings of knowledge bases,” in Proceedings of the
Conference on Artificial Intelligence, AAAI, San Francisco,
CA, USA, 2011.
[21] T. Mikolov, K. Chen, and G. Corrado, “Efficient estimation of
word representations in vector space,” 2013, https://arxiv.org/
abs/1301.3781.

You might also like