A Short Text Classification Method Based On Convolutional Neural Network and Semantic Extension
A Short Text Classification Method Based On Convolutional Neural Network and Semantic Extension
A Short Text Classification Method Based On Convolutional Neural Network and Semantic Extension
Research Article
A Short Text Classification Method Based on Convolutional
Neural Network and Semantic Extension
1. INTRODUCTION BOW model has problems such as dimensional disaster and sparse
data [10].
In recent years, the rapid development of a new generation of infor-
mation technology represented by cloud computing and big data With the emergence of deep learning in recent years, models based
has promoted the arrival of a new era of the Internet. While the on deep neural networks have attracted more and more researchers’
Internet brings convenience to people’s lives, a large amount of text attention in the field of NLP, Such as recurrent neural network
data is also generated every day. Among them, short texts in the (RNN) model and convolutional neural network (CNN) model.
form of comments, Weibo, Q&A, and so on have the characteristics Mikolov et al. [11] proposed the famous RNN-based model uti-
of rapid growth and huge number. These short texts usually have lized RNN to take the expression of the whole sentence into con-
obvious limitations: lack of sufficient contextual information, there sideration, this model can capture the long-term dependencies and
may be polysemy, and sometimes there are spelling errors. How to learn the meaning of words. Kim et al. [12] proposed a multi-size
quickly and effectively extract truly valuable information from the filter CNN model to extract richer text semantic features, and pro-
massive short text data is precisely the problem that needs to be duced good results on the text classification task, thus becoming
solved in the field of natural language processing (NLP), and it also one of the most representative models in the field of NLP. Yang et
has far-reaching significance. al. [13] propose a hierarchical attention network for document clas-
sification, the model not only applies the attention mechanism on
Traditional machine learning text classification approaches such the document hierarchy, but also on the word and sentence levels,
as Naive Bayes (NB) [1,2], Support Vector Machine (SVM) [3,4], so that it can pay attention to the more and more important con-
K Nearest Neighbors (KNN) [5,6] and Decision Trees [7], and so tents when constructing the document representation. Zhou et al.
on, often use the Bag of Words (BOW) [8] model to represent text, [14] proposed a network model based on mixed attention mech-
and the Term Frequency-Inverse Document Frequency (TF-IDF) anism for Chinese short text classification, which not only con-
to represent the word weights. However, the BOW model used by sidered word-level text features and character-level features, but
these classification approaches [9], sentences and documents are also extracted semantic features related to classification through the
considered to be independent of each other, and there is no con- attention mechanism.
textual relationship between them, so that the fine-grained seman-
tic information in the text may not be effectively extracted, and the However, if only the neural network model is used to extract the
abstract features of short text semantic information, the classifica-
tion effect will largely depend on the number of layers of the neural
*
Corresponding author. E-mail: [email protected] network, so it will cause a geometric level increase in the number of
368 H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375
parameters of the entire model, thereby significantly increasing the semantic and grammatical information. The word vector technol-
training time of the model. Therefore, in order to overcome the lack ogy was first proposed by Hinto [20]. Collobert et al. [21] used the
of short text semantic information, the external knowledge base can pretrained word vectors and CNN technology to classify texts for
be used to expand the semantics of the text, thereby enriching the the first time, demonstrating the effectiveness of CNNs in text pro-
semantic features of the short text. cessing. Mikolov et al. [22] used the neural network model to learn
a new vector representation called word vector or word embedding,
In summary, in this paper we propose a short text classification
which contains the grammatical and semantic information of the
method based on CNN and semantic extension (SECNN). Under
word. Compared with the traditional word bag model representa-
the condition that the neural network model has a certain num-
tion, word vectors are characterized by low dimensionality, dense-
ber of layers, we propose a novel method to find related words in
ness and continuity. So we also used word vector technology in text
short texts, and perform semantic expansion at the sentence level
preprocessing.
and related word level of the short text at the same time, thereby the
classification effect of the short text can be improved. In recent years, natural language modeling methods have been
relying on CNN to learn word embedding and they have shown
To sum up, our contributions are as follows:
promising results. Our method also use CNN to automatically and
Firstly, in the text preprocessing process, we propose an improved effectively extract short text features. Sotthisopha et al. [23] used the
Jaro–Winkler similarity to find possible spelling errors in short clustering of word vectors to find semantic units. At the same time,
text, so the coverage of the pretraining word vector table can be Jaro–Winkler similarity was used in the process of text preprocess-
improved. ing to find spelling errors in the text, but the Jaro–Winkler simi-
larity used in this method only considers the matching degree of
Secondly, we propose a CNN model based on attention mechanism
common prefixes between strings, and ignores the suffix match-
to find related words of short text, and then use external knowledge
ing of strings. Zhang et al. [24] proposed a character-level CNN
base to conceptualize short text and related words respectively, thus
text classification model (CharCNN), which can obtain more fine-
expanding the semantic features of short text.
grained semantic features, but the model also ignores the word-level
Finally, we use the classical CNN model to extract short text features semantic features in the text.
and complete the classification process.
Compared with the traditional machine learning text classification
The rest of this paper is organized as follows. In Section 2, the methods, the deep neural network model can effectively simulate
related works regarding text classification are reviewed. Section 3 the information processing process of the human brain, which can
presents a short text preprocessing method. In Section 4, we present further extract more abstract semantic features from the input fea-
a short text classification method in details. Section 5 conducts the tures, thereby, the information that the model depends on when
extensive experiments. Section 6 discusses the experiment results. it is finally classified is more reliable. However, short text usually
The conclusion was drawn in Section 7. lacks sufficient context information and semantic features. If we
only rely on increasing the number of neural networks to improve
the classification effect of short text, it will cause a geometric
2. RELATED WORK increasing in the number of parameters of the entire model. Wang
et al. [25] proposed a method for classifying short texts using exter-
In the traditional text classification methods, the corresponding nal knowledge bases and CNNs. While conceptualizing short texts
text semantic features are usually ignored during the classification enriches semantic features, it also captures the finer-grained fea-
process, and the fine-grained semantic information in the text can- tures in aspect of character level. Wu et al. [26] proposed two
not be effectively extracted, resulting in a low interpretability of the methods (CNN-HE and CNN-VE) to combine word and contex-
final classification results. In order to settle these problems, Wei tual embeddings, then apply CNNs to capture semantic features.
et al. [15] proposed text representation and feature selection strate- The paper also uses an external knowledge base to conceptualize the
gies for Chinese text classification based on n-grams, in the feature target word, but does not consider the semantic expansion of the
selection strategy, preprocessing within classes is combined with entire sentence. This document is close to the short text classifica-
feature selection between classes. Post et al. [16] proposed the use of tion method proposed in this paper, and the method in this paper
part-of-speech (POS) tagging and tree kernel technology to extract is further studied and improved on the basis of it.
the explicit and implicit features of text. Gautam et al. [17] pro- Based on the attention mechanism, we can dynamically extract the
posed a unary word segmentation technique and semantic analy- main features of the text instead of directly processing the infor-
sis to represent the features of the text. Song et al. [18] used the mation of the entire text, so this mechanism has been widely used
probabilistic knowledge base to conceptualize short text, thereby [27–29]. Peng et al. [30] proposed a bidirectional long short-term
improved the understanding of text semantics during the text clas- memory (LSTM) neural network based on attention mechanism
sification progress. Zhang et al. [19] proposed a short text classifi- to capture the most important semantic information in sentences
cation method based on the latent dirichlet allocation (LDA) topic and use it for relationship classification. Zhang et al. [31] proposed
model, which further solved the problem of context dependence bidirectional gated recurrent units which integrates a novel atten-
of short text. Although these methods can extract rich text feature tion pooling mechanism with max-pooling operation to force the
information relatively, they also have some limitations. model to pay attention to the keywords in a sentence and main-
Word embedding is currently the most commonly used, and it is tain the most meaningful information of the text automatically.
also the most effective word vector representation for retaining Wang et al. [32] proposed a novel CNN architecture for relationship
H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375 369
classification, which uses two levels of attention mechanisms to bet- scaling factor for how much the score is adjusted upward for having
ter identify the context. Qiao et al. [33] proposed a word–character common prefixes and suffixes, its value cannot exceed 0.2. simj is
attention model for Chinese text classification, this model inte- the Jaro similarity of two English words, as defined in formula (2),
grates two levels of attention models: word-level attention model its result range is between 0 and 1, s1 and s2 respectively represent
captures salient words which have closer semantic relationship to the two English words to be compared, where |si | is the number
the text meaning, and character-level attention model selects dis- of characters of the corresponding word, that is, the length of the
criminative characters of text. string, m represents the number of matched characters ignoring the
character order and t is one-half of the number of character conver-
Through these methods, when faced with the problem of insuffi-
sions required to convert one word to another.
cient semantic information of short texts, they did not fully con-
sider the semantic expansion of sentence level and word level at the
same time. Therefore, based on the neural network based on the ⎧0 ( )
m=0
attention mechanism, we propose a novel method to find related simj = 1 m m m−t (2)
⎨ + + m≠0
words in short texts, and perform semantic expansion at the sen- ⎩ 3 |s1 | |s2 | m
tence level and related word level of the short text at the same time,
also we propose the improved Jaro–Winkler similarity to find pos- Therefore, in the process of vectorization preprocessing of short
sible spelling errors in short texts in the preprocessing of short texts, text, for words that are not covered in the data set, if the num-
thereby improving the coverage of the pretraining word vector table. ber of matching characters between the corresponding words in the
word vector table does not exceed a certain threshold, then the two
words deemed as match. Further count all the words in the word
3. SHORT TEXT PREPROCESSING vector table that match the uncovered word, calculate their similar-
ity according to formula (1) respectively, and finally select the word
This paper uses external corpus and Word2vec technology to train with the highest similarity and exceeding the minimum similarity
short text into a word vector table. Since the accuracy of the gener- threshold, then replace the uncovered words in the data set with the
ated word vector will affect the subsequent text feature extraction words in the corresponding word vector table. It can be seen from
effect of the CNN. Therefore, in the process of short text vectoriza- the above that after this preprocessing process, the spelling errors of
tion, each word should match the words in the word vector table as short text in the data set can be found as soon as possible, thereby
much as possible. However, due to the characteristics of short text, improving the coverage of the Word2vec word vector table.
some words are often spelled incorrectly, which will lead to the fail-
ure to find the corresponding word from the word vector table in
the subsequent word vectorization process, thus ignoring the key 4. THE PRESENTED METHOD
features that the word may represent.
As shown in Figure 1, the overall framework of our proposed short
In text preprocessing, we utilize the Jaro–Winkler similarity to find
text classification method is composed of four main components.
spelling errors in text, and carry out similarity comparison between
Firstly, we use the improved Jaro–Winkler similarity to find possi-
the words in the short text and the similar words in the word vec-
ble spelling errors in short texts in the preprocessing of short texts.
tor table, if the two are partially different but have a high similarity,
Secondly, related words of short text are found through a CNN
it means that the corresponding word in the short text may be mis-
model based on the attention mechanism. Thirdly, the external
spelled, then replace it according to the corresponding words in the
knowledge base Probase is used to conceptualize the short text and
word vector table. The misspelling position of short text is very ran-
the related words separately to generate the corresponding word
dom, it may be in the second half of the word, or occur in the first
vector matrix. Finally the classic CNN model is used to extract short
half of the word. However, the Jaro–Winkler distance metric only
text features to complete the classification process..
considers the matching degree of the common prefix between the
strings, that is, the spelling error of the word appears in the second
half. But at the same time, it also ignores the suffix matching of the
4.1. Find Related Word
string, that is, the spelling error occurs in the first half of the word.
For example, “argument” and “argument,” there is only one com- In short text, usually only a few words can represent the seman-
mon prefix, so the Jaro–Winkler distance cannot reflect the match- tics of the entire sentence, and most words do not contribute much
ing degree of these two strings well. to the semantic features of the short text. According to the differ-
Based on the Jaro–Winkler distance, this paper proposes an ent effects of different words on the classification effect, adding
improved Jaro–Winkler similarity with both prefix matching and an attention mechanism to the neural network model can enable
suffix matching, defined in formula (1) as follows: the model to find the words that really affect the semantics in the
context through the attention mechanism when establishing the
( ) ( )
simw = simj + l + l′ p′ 1 − simj (1) relationship between the current word and the context. This paper
refers to these words as related words in short text.
Since spelling errors in short text words are more common in the
In order to find related words of short text, this paper designs a CNN
second half of the word, the common prefix matching of the two
model, which consists of a single convolution layer and a pooling
words is given more weight, and the similarity result must not
layer with an attention mechanism, as shown in Figure 2.
exceed 1. So l is the length of the selected common prefix of two
English words, maximum value is 3, l′ is the length of the selected The convolutional layer is composed of a series of filters with learn-
common suffix of two English words, maximum value is 2, p′ is the able parameters. In this layer, by changing the weight values of these
370 H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375
S = [s1 , s2 , … , sl ] (4)
( )
𝛽i = softmax w′ ⋅ ui (6)
l
s𝛽 = ∑ 𝛽i si (7)
1
must also be taken into account when generating word vectors. The
formula for vectorization is defined in formula (8) as follows:
W′c = w′1 v′c1 ⊕ w′2 v′c2 ⊕ … ⊕ w′k v′ck (9) In the pooling layer, Max Pooling is used, that is, the maximum
value input in a certain area is used as the output of the area. This
where W′c represents the word vector matrix conceptualizing related can reduce the number of parameters in the network, and can also
words, v′c1 represents the word vector corresponding to the concept effectively prevent overfitting and improve the generalization abil-
c′i , and ⊕ is the concatenation operation. ity of the model. Through maximum pooling, a fixed-length vector
can be extracted from the feature map. The specific calculation pro-
So far, the word vector matrix Ww of the short text, the word vec- cess is shown as follows:
tor matrix Wc of the short text conceptualization, and the word
vector matrix W′c of the related word conceptualization have been
obtained. smax = max (si ) (11)
the text semantics. In this paper, the improved Jaro–Winkler sim- paper is feasible in the classification task of short text, and the clas-
ilarity is used in text preprocessing to find possible spelling errors sification effect has been significantly improved.
in short text and replace them, which improves the coverage of the
word vector table in the data set.
7. CONCLUSION
We can see that the CharCNN does not perform well in these short
text data sets, the reason is that short text usually lacks sufficient Aiming at the problem that the traditional short text classifica-
semantic features, if only extracting features from the character tion method relies heavily on the number of neural network lay-
level will not achieve a good classification effect. Compared with ers and do not perform well on short text due to the data sparsity
the WCCNN, CNN-HE, CNN-VE, SECNN not only uses short text and insufficient semantic features, we propose a short text classifi-
conceptualization but also proposes related word conceptualization cation method based on CNN and semantic expansion. In order to
to further improve the semantic information of short text, the prob- improve the coverage of the pretrained word vector table in the pro-
lem of insufficient semantic information in short texts has been cess of short text vectorization, the improved Jaro–Winkler similar-
fully resolved, and the classification effect also be improved. ity is used to find possible spelling errors in short text during text
In order to better reflect the advantages of the short text classifi- preprocessing, thus it can more accurately match the corresponding
cation method proposed in this paper, five other different models words in the corpus. At the same time, facing the problem of lim-
were selected to perform multiple iteration experiments on the MR ited semantic information that short text can provide, we introduce
data set, and then the results were compared, as shown in Figure 5. an external knowledge base to conceptualize short text and related
words in short text, extend the semantics of short text. Experiment
Where the abscissa of the graph is the number of CNN training results demonstrate that the method proposed is feasible in the clas-
epochs, and the ordinate is the accuracy of the model. It can be sification task of short text, and the classification effectiveness is
clearly seen from the figure: although the accuracy of each clas- improved remarkably. When this classification method obtains the
sification method is gradually increasing with the increase of the
number of iterations, the advantages of SECNN and WCCNN are
already reflected in the 1st epoch, and when the number of epoch is
5th, the accuracy rate is the highest, and the subsequent values are
basically stable, indicating that the model has converged. It can be
seen that the short text classification method proposed in this paper
is also superior to other classification methods in terms of stability.
In order to validate the other evaluation indicators, the classifica-
tion effect of the method in this paper has also been improved to a
certain extent. Next, the classification result F1 value of the method
proposed in this paper is compared with six comparison methods
on the MR data set and AG News data set. The experimental results
are shown in Figures 6 and 7.
As can be seen from Figures 6 and 7, the F1 value of the WCCNN
for MR dataset classification is significantly higher than the classic
text classification method such as: CNN-rand and CharCNN, then
CNN-HE and CNN-VE are slightly below the WCCNN’s F1 value. Figure 6 Comparison of F1 values of different classification
From the result, we also found that SECNN’s F1 value is slightly methods on MR data set.
better than WCCNN. This shows that the method proposed in this
related words of short text, because it involves a large number of [10] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin, A neural
word vector distance calculations, it’s inevitable to need research- probabilistic language model, J. Mach. Learn. Res. 3 (2003),
ing the time consumption. In the following research, time complex- 1137–1155.
ity will be taken into account, and the method of obtaining related [11] T. Mikolov, M. Karafiatm, S. Khudanpur, Recurrent Neural Net-
words in short text will be optimized to improve the classification work Based Language Model, International Speech Communica-
efficiency of short text. tion Association, Prague, Czech Republic, 2010.
[12] Y. Kim, Convolutional Neural Networks for Sentence Classifi-
cation, Association for Computational Linguistics (ACL), Doha,
AUTHORS’ CONTRIBUTIONS Qatar, 2014.
[13] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchi-
Haitao Wang contributed to the conception of the study; Keke Tian cal attention networks for document classification, in Conference
performed the experiment; Keke Tian performed the data analyses of the North American Chapter of the Association for Compu-
and wrote the manuscript; Zhengjiang Wu provided technical sup- tational Linguistics: Human Language Technologies, San Diego,
port; Lei Wang helped perform the analysis with constructive dis- California, 2016.
cussions, also helped check the grammar of the paper. [14] Y. Zhou, J. Xu, J. Cao, B. Xu, C. Li, B. Xu, Hybrid attention
networks for Chinese short text classification, Comp. y Sist. 21
(2017), 759–769.
ACKNOWLEDGMENTS [15] Z. Wei, D. Miao, J.H. Chauchat, R. Zhao, W. Li, N-grams based
feature selection and text representation for Chinese text classifi-
This work is support by the National Natural Science Foundation of China
cation, Int. J. Comput. Int. Sys. 2 (2009), 365–374.
(No. 11601129 61503124), Henan Science and Technology Key Project
[16] M. Post, S. Bergsma, Explicit and Implicit Syntactic Features for
(No.192102210280), the Fundamental Research Funds for the Universities
Text Classification, Association for Computational Linguistics
of Henan Province, Doctor Foundation of Henan Polytechnic University
(ACL), Sofia, Bulgaria, 2013.
(No. B2017-36).
[17] G. Gautam, D. Yadav, Sentiment Analysis of Twitter Data Using
Machine Learning Approaches and Semantic Analysis, Insti-
REFERENCES tute of Electrical and Electronics Engineers Inc., Noida, India,
2014.
[1] J. Chen, H. Huang, S. Tian, Y. Qu, Feature selection for text [18] Y. Song, Z. Wang, H. Wang, Short Text Conceptualization Using a
classification with Naive Bayes, Expert Syst. Appl. 36 (2009), Probabilistic Knowledgebase, International Joint Conferences on
5432–5435. Artificial Intelligence, Barcelona, Spain, 2011.
[2] S.B. Kim, H.C. Rim, S.H. Myaeng, K.S. Han, Some effective tech- [19] Z. Zhang, D. Miao, C. Gao, Short text classification using latent
niques for naive Bayes text classification, IEEE Trans. Knowl. Data Dirichlet allocation, J. Comput. App. 33 (2013), 1587–1590.
Eng. 18 (2006), 1457–1466. [20] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of
[3] M. Haddoud, A. Mokhtari, T. Lecroq, S. Abdeddaïm, Combin- data with neural networks, Science. 313 (2006), 504–507.
ing supervised term-weighting metrics for SVM text classification [21] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
with extended term representation, Knowl. Inf. Syst. 49 (2016), P. Kuksa, Natural language processing (almost) from scratch, J.
909–931. Mach. Learn. Res. 12 (2011), 2493–2537.
[4] H. Kim, P. Howland, H. Park, Dimension reduction in text clas- [22] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed
sification with support vector machines, J. Mach. Learn. Res. 6 Representations of Words and Phrases and their Composition-
(2005), 37–53. ality, Neural Information Processing Systems Foundation, Lake
[5] R. Li, Y. Hu, A density-based method for reducing the amount of Tahoe, NV, USA, 2013.
training data in kNN text classification, J. Comput. Res. Dev. 41 [23] N. Sotthisopha, P. Vateekul, Improving Short Text Classification
(2004), 539–545. Using Fast Semantic Expansion on Multichannel Convolutional
[6] H. Wandabwa, D. Zhang, K. Sammy, Text categorization via Neural Network, Institute of Electrical and Electronics Engineers
attribute distance weighted k-nearest neighbor classification, Inc., Busan, Korea, 2018.
in 2016 International Conference on Information Technology [24] X. Zhang, J. Zhao, Y. Lecun, Character-Level Convolutional Net-
(ICIT), Bhubaneswar, India, 2016. works for Text Classification, Neural Information Processing Sys-
[7] Y. Wang, Z. Wang, Text categorization rule extraction based on tems Foundation, Montreal, Canada, 2015.
fuzzy decision tree, Comput. App. 4 (2005), 1634–1637. [25] J. Wang, Z. Wang, D. Zhang, J. Yan, Combining Knowledge with
[8] S. Wang, C.D. Manning, Baselines and Bigrams: Simple, Good Deep Convolutional Neural Networks for Short Text Classifica-
Sentiment and Topic Classification, Association for Computa- tion, International Joint Conferences on Artificial Intelligence,
tional Linguistics (ACL), Jeju Island, Korea, 2012. Melbourne, Australia, 2017.
[9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation [26] X. Wu, Y. Cai, Q. Li, J. Xu, H.F. Leung, Combining Contextual
of word representations in vector space, Comput. Sci. (2013). Information by Self-attention Mechanism in Convolutional Neu-
https://www.engineeringvillage.com/search/doc/abstract. ral Networks for Text Classification, Springer Verlag, Dubai, UAE,
url?SEARCHID=4982a786a73c4269a1e8ca8c9ee99349& 2018.
DOCINDEX=1&database=1&pageType=quickSearch& [27] J. Su, J. Zeng, D. Xiong, Y. Liu, M. Wang, J. Xie, A hierarchy-
searchtype=Quick&dedupResultCount=null&format= to-sequence attentional neural machine translation model,
quickSearch&usageOrigin=recordpage&usageZone= IEEE/ACM Trans. Audio Speech Language Process. 26 (2018),
detailedtab&toolsinScopus=Noload 623–632.
H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375 375
[28] Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention net- [31] D. Zhang, M. Hong, L. Zou, F. Han, F. He, Z. Tu, Y. Ren, Atten-
works for image question answering, in 2016 IEEE Conference tion pooling-based bidirectional gated recurrent units model for
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, sentimental classification, Int. J. Comput. Int. Sys. 12 (2019),
NV, USA, 2015. 723–732.
[29] L. Gao, Z. Guo, H. Zhang, X. Xu, H.T. Shen, Video caption- [32] L. Wang, Z. Cao, G. de Melo, Z. Liu, Relation classification via
ing with attention-based LSTM and semantic consistency, IEEE multi-level attention CNNs, in Meeting of the Association for
Trans. Multimedia. 19 (2017), 2045–2055. Computational Linguistics, Berlin, Germany, 2016.
[30] Z. Peng, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention- [33] X. Qiao, C. Peng, Z. Liu, Y. Hu, Word-character attention model
based bidirectional long short-term memory networks for rela- for Chinese text classification, Int. J. Mach. Learn. Cybern. 10
tion classification, in Meeting of the Association for Computa- (2019), 3521–3537.
tional Linguistics, Berlin, Germany, 2016.