Natural Language Processing With Improved Deep Lea
Natural Language Processing With Improved Deep Lea
Natural Language Processing With Improved Deep Lea
Scientific Programming
Volume 2022, Article ID 6028693, 8 pages
https://doi.org/10.1155/2022/6028693
Research Article
Natural Language Processing with Improved Deep Learning
Neural Networks
YiTao Zhou
Hubei Research Center for Language and Intelligent Information Processing, Wuhan University, Wuhan 430072, China
Received 10 October 2021; Revised 15 December 2021; Accepted 21 December 2021; Published 7 January 2022
Copyright © 2022 YiTao Zhou. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
As one of the core tasks in the field of natural language processing, syntactic analysis has always been a hot topic for researchers,
including tasks such as Questions and Answer (Q&A), Search String Comprehension, Semantic Analysis, and Knowledge Base
Construction. This paper aims to study the application of deep learning and neural network in natural language syntax analysis,
which has significant research and application value. This paper first studies a transfer-based dependent syntax analyzer using a
feed-forward neural network as a classifier. By analyzing the model, we have made meticulous parameters of the model to improve
its performance. This paper proposes a dependent syntactic analysis model based on a long-term memory neural network. This
model is based on the feed-forward neural network model described above and will be used as a feature extractor. After the feature
extractor is pretrained, we use a long short-term memory neural network as a classifier of the transfer action, and the char-
acteristics extracted by the syntactic analyzer as its input to train a recursive neural network classifier optimized by sentences. The
classifier can not only classify the current pattern feature but also multirich information such as analysis of state history. Therefore,
the model is modeled in the analysis process of the entire sentence in syntactic analysis, replacing the method of modeling
independent analysis. The experimental results show that the model has achieved greater performance improvement than
baseline methods.
and deep neural networks. As a result, the Deep Learning Fukushima [16] first proposed the concepts of convolution
method has made a huge breakthrough in the research of neural networks and deep networks. After that, related
multiple fields of machine learning. Deep learning learns pooling and other methods were proposed one after another.
from large-scale data to intricate structural representations. In 1986, the backpropagation algorithm was proposed by
This learning is achieved by adjusting network parameters Rumelhart et al. [17]. It greatly promoted the development of
through error-driven optimization algorithms between neural network research. The second is the emergence of
different layers of artificial neural networks through back- several public datasets. The majority of the public datasets
propagation. In recent years, deep convolution network has make the neural network no longer a toy model. In the field
made great breakthroughs in graphics and image processing, of computer vision, there is the famous ImageNet [18]. In the
video and audio processing, and other fields. At the same field of natural language processing, there is the dataset
time, recursive networks have also achieved good results in published by Twitter 2 and the data of Weibo 3 in the
sequence data such as text and voice [7]. Chinese field.
The recurrent neural network initially achieved good Bengio et al. [19] proposed the use of a recurrent neural
results in handwritten digit recognition [8]. The well-known network to build a language model. The model uses the
word vector algorithm Word2Vec was originally obtained recurrent neural network to learn a distributed represen-
from the language model learned from RNN [9]. Due to the tation for each word while also modeling the word sequence.
gradient disappearance defect of recurrent neural network This model has achieved better results in experiments than
(RNN), Long Short-Term Memory (LSTM) was proposed the optimal n-gram model of the same period and can use
[10]. Due to the recent popularity of deep learning methods, more contextual information. Bordes et al. [20] proposed a
LSTM has also been applied to work such as dialogue method for learning Structured Embeddings using neural
systems [11] and language models [12]. The neural network networks and a knowledge base. The experimental results of
model with attention mechanism proposed recently [13] has this method on WordNet and Freebase show that it can
attracted the attention of researchers. This attention embed structured information. Mikolov et al. [21] proposed
mechanism has been successfully applied to machine continuous bag of words (CBOW): In this model, to predict
translation [14] and text summaries [15] and has achieved the words in a sentence, the concept of word position in a
certain results. sentence is used; this work also proposes a skip-gram model,
The main contributions of this paper are the following: which can use a word in a certain position in a sentence and
predicts the words around it. Based on these two models,
(i) We propose a feed-forward neural network in which
Mikolov et al. [21] open-sourced the tool word2vec4 to train
the parameters propagate unidirectionally
word vectors, which has been widely used. Kim [22] in-
(ii) We use a neural network model as a classifier and troduced the convolution neural network to the sentence
use the reverse propagation algorithm as the classification task of natural language processing. This work
learning algorithm uses a convolution neural network with two channels to
(iii) We proposed a well-organized dataset to evaluate extract features from sentences and finally classify the
the proposed framework extracted features. The experimental results show that the
convolution neural network has a significant effect on the
Rest the paper is structured as follows: Section 2 de-
feature extraction of natural language. Similarly, Lauriola
scribes related work and critically analyzes and compares the
et al. [23] has critically studied and analyzed the use of deep
work done so far. Section 3 is about the proposed meth-
learning in Natural Language Processing (NLP) and the
odology describing the materials and methods adopted in
models, techniques, and tools used so far have been sum-
this study. Section 4 is about the validity of the proposed
marized. Fathi and Shoja [24] also discuss the application of
methodology and experimentation and discussions made
deep neural networks for natural language processing.
about the results produced. The work done is finally con-
Tai et al. [25] proposed a tree-like long and short-term
cluded in Section 5.
memory neural network. Because traditional recurrent
neural networks are usually used to process linear sequences,
2. Related Work and for data types with internal structures such as natural
language, this linear model may lose some information.
Concepts such as neural networks originated in the 1940s. Therefore, this model uses long and short-term memory
After the 1980s, backpropagation was successfully applied to neural networks in the analysis tree and has achieved good
neural networks. In 1989, the backpropagation algorithm results in sentiment analysis.
was successfully applied to the training of a convolutional In summary, the key limitations of existing deep
neural network. As of 2006, the graphics processing unit was learning-based approaches to natural language processing
used in the training of convolution neural networks. As a include the following: deep neural network models are
result, a new upsurge of neural network research has been set difficult to train because they need large amounts of data,
off. The early neural network models of the 1940s were very training requires powerful, expensive video cards, lack of a
simple, usually only had one layer and could not be learned. uniform representation method for different forms of the
It was not until the 1960s that early neural networks were data, such as text and image and the ambiguity resolution in
used for supervised learning, and the model became slightly natural language text at the word, phrase, and sentence level.
more complicated and had a multilayer structure. In 1979, Moreover, deep learning algorithms are not good at
Scientific Programming 3
ht h0 h1 h2 ht
A = A A A A
xt x0 x1 x2 ... xt
the hidden layer, and dxh � 18 ∗ dw + 18 ∗ dp + 12 ∗ dl , b1 is quence to an output sequence, such as a sequence identi-
the bias vector. fication problem or sequence forecast problem. However,
The last layer of the network is the softmax layer, whose many of the actual use tasks expose difficulty in training
role is to predict and analyze the probability distribution of recursive neural networks. Sequences in these issues often
actions: extent a lengthier time interval. Bengio et al., since the
gradient of the recursive neural network, will ultimately
o � W2 h + b2 , “disappear,” the recursive neural network that wants to learn
(5) a long-distance memory is more difficult, as shown in
exp oa Figure 4.
pa � .
a∈τ exp oa To solve this problem, Hochreiter and SchmidHuber
[10] proposed Long Short-Term Memory, LSTM. In this
Among them, W2 is the parameter matrix of the softmax model, the concept of “door” is added so that the network
layer, b2 is the bias vector and τ is the set of all actions in the can choose when “Forget” increasing new “memory.”
dependency syntax analysis system. As a variant of the recursive neural network, the long-
After obtaining the probability distribution of the term memory neural network in the design is to solve the
analysis action predicted by the model, the loss function of gradient disappearance of ordinary recursive neural net-
the network can be calculated. The same as the general works. The usual recursive neural network reads an input
multiclassification problem, we use the cross-entropy loss vector xt from a vector sequence (x1 , x2 , . . . , xn ) and cal-
function: culates a new hidden layer state ht . However, the problem of
1 gradient disappearance results in an ordinary recursive
C�− y logpi + 1 − yi log 1 − pi . (6)
n i i neural network that cannot be modeled on long-distance
dependence. Long short-term memory neural networks
In fact, the classification task is to select a correct action introduced “Memory Cell” and three “Control Gate,” which
from multiple analysis actions, so the loss function is used to control when to choose “memory,” when to choose
simplified as follows: “Forget.”
Scientific Programming 5
h0 h1 h2 ht ht+1 ht+2
A A A A A A
Specifically, the long and short-term memory neural Table 1: Statistics of the data used in this article.
network uses an input gate, a forget gate, and an output gate. Data set A B C D E (%)
Among them, it determines the proportion of the current
Training set 33288 33251 99.89 76 99.8
input that can enter the memory unit, and the forget gate Development set 1850 1848 99.89 1 99.9
controls the proportion of the current memory that should Test set 1850 1848 99.89 — 100
be forgotten.
A is the total number of sentences; B is the number of projectable sentences;
For example, at time t, the long and short-term memory C is the percentage of projectable sentences; D is the number of sentences up
neural network is updated in the following way: to 70; E is the percentage of sentences used to projectable sentences.
At time t, given input xt , calculate the value of input gate
it , forget gate and candidate memory C t according to the
following formula: correct phrases in the analysis result to the number of
phrases in the analysis result:
it � σ Wi xt + Ui ht−1 + bi ,
Number of correct phrases in the analysis result
t � tan h Wc xt + Uc ht−1 + bc , P� .
C (8) The total number of phrases in the analysis results
ft � σ Wf xt + Uf ht−1 + bf , (10)
where σ is the component-wise logistic function and ⊙ is the (2) Recall Rate. The accuracy rate in phrase structure
component-wise product. analysis refers to the percentage of the number of
At the same time, the value and output value of the new correct phrases in the analysis result to the total
memory cell are given as follows: number of phrases in the test set:
t + ft ⊙ Ct−1 ,
Ct � it ⊙ C Number of correct phrases in the analysis result
R� .
The total number of phrases in the test set
ot�σ Wo xt + Uo ht−1 + Vo Ct + bo , (9)
(11)
ht � ot ⊙ tan h Ct .
(3) F1 value.
2×P×R
F1 � . (12)
4.2. Experimental Data. Since batch training is required, P+R
and the analysis sequence lengths of sentences of different
lengths are not the same, we have adopted a mask method
for training. Even so, because the length of some sentences 4.4. Experimental Results and Analysis. In addition to the
is too long, other sentences in the batch have been pro- comparison with the baseline method, this topic is also
cessed and have been waiting for the long sentence to compared with two other classic dependency parsers: Malt
appear. Therefore, to train the model more quickly, we Parser and MST Parser. For Malt Parser, we used the
removed sentences with more than 70 words in the training stackproj and nivreeager options for training, which cor-
process. Such sentences have a total of 76 sentences, ac- respond to the arc-standard analysis algorithm and the arc-
counting for 0.2% of the number of sentences in the eager analysis algorithm, respectively. For MST Parser, we
training dataset. We believe that this will not affect the report the results in Chen and Manning (2014). The test
effect of the final model. After removing part of the training results are shown in Table 2.
data and verification data, the actual data used is shown in It can be seen from the table that the dependency syntax
Table 1. analyzer based on the long and short-term memory neural
network has achieved certain effects in modeling the analysis
sequence of sentences. This model has achieved 91.9% UAS
4.3. Evaluation Index. The analysis of phrase structure
accuracy and 90.5% LAS accuracy on the development set of
usually uses accuracy, recall, and F1 value for evaluation:
Penn Tree Bank, which is about 0.7% improvement over the
(1) Accuracy. The accuracy rate in phrase structure greedy neural network dependency parser of the baseline
analysis refers to the percentage of the number of method. On the test set, our model achieved a UAS accuracy
6 Scientific Programming
1
0.9
0.8
0.7
0.6
F1 value
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25
Training process (unit:Every ten thousand times)
Figure 5: F1 value changes with the training process.
rate of 90.7% and an LAS accuracy rate of 89.0%, which is The results of testing on the Pennsylvania Tree Bank are
about 0.6% improvement over the greedy neural network- shown in Table 3. In the testing process, this article uses the
dependent syntax analyzer of the baseline method. column search technique, and the corresponding beam size
Compared with the most representative transfer-based is 12.
dependency parser, Malt Parser, our method has a relative It can be seen from the data in the table that the dual
improvement of about 1.4%; compared with the famous attention mechanism can effectively reduce the number of
graph model-based MST Parser, our model can obtain 0.5 on errors in the output results. In the effective output, the F1
the development set. % Improvement, the UAS accuracy rate value of the model reached 0.827, and its change with the
on the test set is comparable, and the LAS accuracy rate has training process is shown in Figure 5. Various errors change
been improved by 1.4%. with the training process, as shown in Figure 6.
The experimental results show that, compared with the By linearizing the phrase structure tree in natural lan-
greedy feed-forward neural network, the dependency guage, the phrase structure analysis task is transformed into
syntax analysis model based on the long and short-term a sequence-to-sequence conversion task. A simple imple-
memory neural network performs better. Different from mentation of the sequence-to-sequence model is carried out,
the greedy model, this model uses long and short-term and it is found that the end-to-end analysis still needs the
memory neural networks to model the entire sentence and rule restriction on the decoder side. To this end, we propose
can use historical analysis information and historical a dual attention mechanism model, that is, a sequence-to-
pattern information to help classify analysis actions, sequence model that introduces attention mechanisms at the
thereby improving the performance of the dependent input and output at the same time. Experiments show that
syntax analyzer. after the introduction of the dual attention mechanism
Scientific Programming 7
1600
1400
1200
1000
Number of errors
800
600
400
200
0
4 6 8 10 12 14 16 18 20
Training times (unit:Every ten thousand times)
Table 4: Symbols used in the papers. analysis process of the entire sentence in the dependent
Symbol/abbreviation Abbreviation definition and meaning
syntax analysis and improves the greedy model to model the
independent analysis state. The experimental results show
RNN Recurrent neural network
LSTM Long short-term memory
that compared with the baseline method, the model obtains
MST parser Maximum spanning tree parser an improvement of 0.6 to 0.7 percentage points.
CBOW Continuous bag of words Through the work experience and error analysis, we can
further study the dependency syntax analysis model based
on the long and short-term memory neural network, and we
model, the performance of the model on the test set is greatly found that the attention mechanism can be introduced into
improved in Table 4. the model.
[4] K. Tymoshenko and A. Moschitti, “Assessing the impact of [22] Y. Kim, “Convolutional neural networks for sentence clas-
syntactic and semantic structures for answer passages sification,” 2014, https://arxiv.org/abs/1408.5882.
reranking,” in Proceedings of the 24th ACM International on [23] I. Lauriola, A. Lavelli, and F. Aiolli, “An introduction to deep
Conference on Information and Knowledge Management, learning in natural language processing: models, techniques,
Melbourne, Australia, October 2015. and tools,” Neurocomputing, vol. 470, 2021 Jul 22.
[5] W. Monroe and Y. Wang, Dependency Parsing Features for [24] E. Fathi and B. M. Shoja, “Deep neural networks for natural
Semantic Parsing, Stanford University, Stanford, CA, USA, language processing,” Handbook of Statistics, vol. 38,
2014. pp. 229–316, 2018 Jan 1.
[6] C. M. Bishop, Pattern Recognition and Machine Learning [25] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic
(Information Science and Statistics), Springer-Verlag New representations from tree-structured long short-term mem-
York, Inc., Secaucus, NJ, USA, 2006. ory networks,” 2015, https://arxiv.org/abs/1503.00075.
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
vol. 521, no. 7553, pp. 436–444, 2015.
[8] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke,
and J. Schmidhuber, “A novel connectionist system for un-
constrained handwriting recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 31, no. 5,
pp. 855–868, 2009.
[9] T. Mikolov, W. t. Yih, and G. Zweig, “Linguistic regularities in
continuous space word representations,” in Proceedings of the
2013 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language
Technologies, pp. 746–751, Association for Computational
Linguistics, Atlanta, Georgia, June 2013.
[10] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[11] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence
Learning with Neural Networks,” in Proceedings of the 27th
International Conference on Neural Information Processing
Systems, pp. 3104–3112, Curran Associates, Inc., Montreal,
Canada, December 2014.
[12] M. Sundermeyer, R. Schlüter, and H. Ney, LSTM Neural
Networks for Language ModelingISCA Archive, Portland, OR,
USA, 2012.
[13] V. Mnih, N. Heess, and A. Graves, “Recurrent models of visual
attention,” in Proceedings of the 27th International Conference
on Neural Information Processing Systems, pp. 2204–2212,
Curran Associates, Inc., Montreal, Canada, December 2014.
[14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
translation by jointly learning to align and translate,” 2014,
https://arxiv.org/abs/1409.0473.
[15] A. M. Rush, S. Chopra, and J. Weston, “A neural attention
model for abstractive sentence summarization,” 2015, https://
arxiv.org/abs/1509.00685.
[16] K. Fukushima, “Neocognitron: a self-organizing neural net-
work model for a mechanism of pattern recognition unaf-
fected by shift in position,” Biological Cybernetics, vol. 36,
no. 4, pp. 193–202, 1980.
[17] D. E. Rumelhart and J. L. M Cc Lelland, “Parallel distributed
processing: explorations in the microstructure of cognition,”
Language, vol. 22, no. 4, pp. 98–108, 1986.
[18] O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale
visual recognition challenge,” International Journal of Com-
puter Vision, vol. 115, no. 3, pp. 211–252, 2015.
[19] Y. Bengio, R. Ducharme, and P. Vincent, “A neural proba-
bilistic language model,” Journal of Machine Learning Re-
search, vol. 3, pp. 1137–1155, 2003.
[20] A. Bordes, J. Weston, and R. Collobert, “Learning structured
embeddings of knowledge bases,” in Proceedings of the
Conference on Artificial Intelligence, AAAI, San Francisco,
CA, USA, 2011.
[21] T. Mikolov, K. Chen, and G. Corrado, “Efficient estimation of
word representations in vector space,” 2013, https://arxiv.org/
abs/1301.3781.