A Short Text Classification Method Based On Convolutional Neural Network and Semantic Extension

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

International Journal of Computational Intelligence Systems

Vol. 14(1), 2021, pp. 367–375


DOI: https://doi.org/10.2991/ijcis.d.201207.001; ISSN: 1875-6891; eISSN: 1875-6883
https://www.atlantis-press.com/journals/ijcis/

Research Article
A Short Text Classification Method Based on Convolutional
Neural Network and Semantic Extension

Haitao Wang1 , Keke Tian1 , Zhengjiang Wu1,*, , Lei Wang2


1
Henan Polytechnic University, Jiaozuo City, Henan Province, 454003, China
2
Louisiana State University, Baton Rouge, Louisiana, 70803, United States

ARTICLE INFO ABSTRACT


Article History In order to solve the problem that traditional short text classification methods do not perform well on short text due to the data
Received 06 July 2020 sparsity and insufficient semantic features, we propose a short text classification method based on convolutional neural network
Accepted 22 Nov 2020 and semantic extension. Firstly, we propose an improved similarity to improve the coverage of the word vector table in the short
text preprocessing process. Secondly, we propose a method for semantic expansion of short texts, which adding an attention
Keywords mechanism to the neural network model to find related words in the short text, and semantic expansion is performed at the
Short text sentence level and the related word level of the short text respectively. Finally, the feature extraction of short text is carried out by
Classification means of the classical convolutional neural network. The experimental results show that the proposed method is feasible during
CNN the classification task of short text, and the classification effectiveness is significantly improved.
Semantic extension
Attention mechanism
Conceptualization
© 2021 The Authors. Published by Atlantis Press B.V.
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION BOW model has problems such as dimensional disaster and sparse
data [10].
In recent years, the rapid development of a new generation of infor-
mation technology represented by cloud computing and big data With the emergence of deep learning in recent years, models based
has promoted the arrival of a new era of the Internet. While the on deep neural networks have attracted more and more researchers’
Internet brings convenience to people’s lives, a large amount of text attention in the field of NLP, Such as recurrent neural network
data is also generated every day. Among them, short texts in the (RNN) model and convolutional neural network (CNN) model.
form of comments, Weibo, Q&A, and so on have the characteristics Mikolov et al. [11] proposed the famous RNN-based model uti-
of rapid growth and huge number. These short texts usually have lized RNN to take the expression of the whole sentence into con-
obvious limitations: lack of sufficient contextual information, there sideration, this model can capture the long-term dependencies and
may be polysemy, and sometimes there are spelling errors. How to learn the meaning of words. Kim et al. [12] proposed a multi-size
quickly and effectively extract truly valuable information from the filter CNN model to extract richer text semantic features, and pro-
massive short text data is precisely the problem that needs to be duced good results on the text classification task, thus becoming
solved in the field of natural language processing (NLP), and it also one of the most representative models in the field of NLP. Yang et
has far-reaching significance. al. [13] propose a hierarchical attention network for document clas-
sification, the model not only applies the attention mechanism on
Traditional machine learning text classification approaches such the document hierarchy, but also on the word and sentence levels,
as Naive Bayes (NB) [1,2], Support Vector Machine (SVM) [3,4], so that it can pay attention to the more and more important con-
K Nearest Neighbors (KNN) [5,6] and Decision Trees [7], and so tents when constructing the document representation. Zhou et al.
on, often use the Bag of Words (BOW) [8] model to represent text, [14] proposed a network model based on mixed attention mech-
and the Term Frequency-Inverse Document Frequency (TF-IDF) anism for Chinese short text classification, which not only con-
to represent the word weights. However, the BOW model used by sidered word-level text features and character-level features, but
these classification approaches [9], sentences and documents are also extracted semantic features related to classification through the
considered to be independent of each other, and there is no con- attention mechanism.
textual relationship between them, so that the fine-grained seman-
tic information in the text may not be effectively extracted, and the However, if only the neural network model is used to extract the
abstract features of short text semantic information, the classifica-
tion effect will largely depend on the number of layers of the neural
*
Corresponding author. E-mail: [email protected] network, so it will cause a geometric level increase in the number of
368 H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375

parameters of the entire model, thereby significantly increasing the semantic and grammatical information. The word vector technol-
training time of the model. Therefore, in order to overcome the lack ogy was first proposed by Hinto [20]. Collobert et al. [21] used the
of short text semantic information, the external knowledge base can pretrained word vectors and CNN technology to classify texts for
be used to expand the semantics of the text, thereby enriching the the first time, demonstrating the effectiveness of CNNs in text pro-
semantic features of the short text. cessing. Mikolov et al. [22] used the neural network model to learn
a new vector representation called word vector or word embedding,
In summary, in this paper we propose a short text classification
which contains the grammatical and semantic information of the
method based on CNN and semantic extension (SECNN). Under
word. Compared with the traditional word bag model representa-
the condition that the neural network model has a certain num-
tion, word vectors are characterized by low dimensionality, dense-
ber of layers, we propose a novel method to find related words in
ness and continuity. So we also used word vector technology in text
short texts, and perform semantic expansion at the sentence level
preprocessing.
and related word level of the short text at the same time, thereby the
classification effect of the short text can be improved. In recent years, natural language modeling methods have been
relying on CNN to learn word embedding and they have shown
To sum up, our contributions are as follows:
promising results. Our method also use CNN to automatically and
Firstly, in the text preprocessing process, we propose an improved effectively extract short text features. Sotthisopha et al. [23] used the
Jaro–Winkler similarity to find possible spelling errors in short clustering of word vectors to find semantic units. At the same time,
text, so the coverage of the pretraining word vector table can be Jaro–Winkler similarity was used in the process of text preprocess-
improved. ing to find spelling errors in the text, but the Jaro–Winkler simi-
larity used in this method only considers the matching degree of
Secondly, we propose a CNN model based on attention mechanism
common prefixes between strings, and ignores the suffix match-
to find related words of short text, and then use external knowledge
ing of strings. Zhang et al. [24] proposed a character-level CNN
base to conceptualize short text and related words respectively, thus
text classification model (CharCNN), which can obtain more fine-
expanding the semantic features of short text.
grained semantic features, but the model also ignores the word-level
Finally, we use the classical CNN model to extract short text features semantic features in the text.
and complete the classification process.
Compared with the traditional machine learning text classification
The rest of this paper is organized as follows. In Section 2, the methods, the deep neural network model can effectively simulate
related works regarding text classification are reviewed. Section 3 the information processing process of the human brain, which can
presents a short text preprocessing method. In Section 4, we present further extract more abstract semantic features from the input fea-
a short text classification method in details. Section 5 conducts the tures, thereby, the information that the model depends on when
extensive experiments. Section 6 discusses the experiment results. it is finally classified is more reliable. However, short text usually
The conclusion was drawn in Section 7. lacks sufficient context information and semantic features. If we
only rely on increasing the number of neural networks to improve
the classification effect of short text, it will cause a geometric
2. RELATED WORK increasing in the number of parameters of the entire model. Wang
et al. [25] proposed a method for classifying short texts using exter-
In the traditional text classification methods, the corresponding nal knowledge bases and CNNs. While conceptualizing short texts
text semantic features are usually ignored during the classification enriches semantic features, it also captures the finer-grained fea-
process, and the fine-grained semantic information in the text can- tures in aspect of character level. Wu et al. [26] proposed two
not be effectively extracted, resulting in a low interpretability of the methods (CNN-HE and CNN-VE) to combine word and contex-
final classification results. In order to settle these problems, Wei tual embeddings, then apply CNNs to capture semantic features.
et al. [15] proposed text representation and feature selection strate- The paper also uses an external knowledge base to conceptualize the
gies for Chinese text classification based on n-grams, in the feature target word, but does not consider the semantic expansion of the
selection strategy, preprocessing within classes is combined with entire sentence. This document is close to the short text classifica-
feature selection between classes. Post et al. [16] proposed the use of tion method proposed in this paper, and the method in this paper
part-of-speech (POS) tagging and tree kernel technology to extract is further studied and improved on the basis of it.
the explicit and implicit features of text. Gautam et al. [17] pro- Based on the attention mechanism, we can dynamically extract the
posed a unary word segmentation technique and semantic analy- main features of the text instead of directly processing the infor-
sis to represent the features of the text. Song et al. [18] used the mation of the entire text, so this mechanism has been widely used
probabilistic knowledge base to conceptualize short text, thereby [27–29]. Peng et al. [30] proposed a bidirectional long short-term
improved the understanding of text semantics during the text clas- memory (LSTM) neural network based on attention mechanism
sification progress. Zhang et al. [19] proposed a short text classifi- to capture the most important semantic information in sentences
cation method based on the latent dirichlet allocation (LDA) topic and use it for relationship classification. Zhang et al. [31] proposed
model, which further solved the problem of context dependence bidirectional gated recurrent units which integrates a novel atten-
of short text. Although these methods can extract rich text feature tion pooling mechanism with max-pooling operation to force the
information relatively, they also have some limitations. model to pay attention to the keywords in a sentence and main-
Word embedding is currently the most commonly used, and it is tain the most meaningful information of the text automatically.
also the most effective word vector representation for retaining Wang et al. [32] proposed a novel CNN architecture for relationship
H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375 369

classification, which uses two levels of attention mechanisms to bet- scaling factor for how much the score is adjusted upward for having
ter identify the context. Qiao et al. [33] proposed a word–character common prefixes and suffixes, its value cannot exceed 0.2. simj is
attention model for Chinese text classification, this model inte- the Jaro similarity of two English words, as defined in formula (2),
grates two levels of attention models: word-level attention model its result range is between 0 and 1, s1 and s2 respectively represent
captures salient words which have closer semantic relationship to the two English words to be compared, where |si | is the number
the text meaning, and character-level attention model selects dis- of characters of the corresponding word, that is, the length of the
criminative characters of text. string, m represents the number of matched characters ignoring the
character order and t is one-half of the number of character conver-
Through these methods, when faced with the problem of insuffi-
sions required to convert one word to another.
cient semantic information of short texts, they did not fully con-
sider the semantic expansion of sentence level and word level at the
same time. Therefore, based on the neural network based on the ⎧0 ( )
m=0
attention mechanism, we propose a novel method to find related simj = 1 m m m−t (2)
⎨ + + m≠0
words in short texts, and perform semantic expansion at the sen- ⎩ 3 |s1 | |s2 | m
tence level and related word level of the short text at the same time,
also we propose the improved Jaro–Winkler similarity to find pos- Therefore, in the process of vectorization preprocessing of short
sible spelling errors in short texts in the preprocessing of short texts, text, for words that are not covered in the data set, if the num-
thereby improving the coverage of the pretraining word vector table. ber of matching characters between the corresponding words in the
word vector table does not exceed a certain threshold, then the two
words deemed as match. Further count all the words in the word
3. SHORT TEXT PREPROCESSING vector table that match the uncovered word, calculate their similar-
ity according to formula (1) respectively, and finally select the word
This paper uses external corpus and Word2vec technology to train with the highest similarity and exceeding the minimum similarity
short text into a word vector table. Since the accuracy of the gener- threshold, then replace the uncovered words in the data set with the
ated word vector will affect the subsequent text feature extraction words in the corresponding word vector table. It can be seen from
effect of the CNN. Therefore, in the process of short text vectoriza- the above that after this preprocessing process, the spelling errors of
tion, each word should match the words in the word vector table as short text in the data set can be found as soon as possible, thereby
much as possible. However, due to the characteristics of short text, improving the coverage of the Word2vec word vector table.
some words are often spelled incorrectly, which will lead to the fail-
ure to find the corresponding word from the word vector table in
the subsequent word vectorization process, thus ignoring the key 4. THE PRESENTED METHOD
features that the word may represent.
As shown in Figure 1, the overall framework of our proposed short
In text preprocessing, we utilize the Jaro–Winkler similarity to find
text classification method is composed of four main components.
spelling errors in text, and carry out similarity comparison between
Firstly, we use the improved Jaro–Winkler similarity to find possi-
the words in the short text and the similar words in the word vec-
ble spelling errors in short texts in the preprocessing of short texts.
tor table, if the two are partially different but have a high similarity,
Secondly, related words of short text are found through a CNN
it means that the corresponding word in the short text may be mis-
model based on the attention mechanism. Thirdly, the external
spelled, then replace it according to the corresponding words in the
knowledge base Probase is used to conceptualize the short text and
word vector table. The misspelling position of short text is very ran-
the related words separately to generate the corresponding word
dom, it may be in the second half of the word, or occur in the first
vector matrix. Finally the classic CNN model is used to extract short
half of the word. However, the Jaro–Winkler distance metric only
text features to complete the classification process..
considers the matching degree of the common prefix between the
strings, that is, the spelling error of the word appears in the second
half. But at the same time, it also ignores the suffix matching of the
4.1. Find Related Word
string, that is, the spelling error occurs in the first half of the word.
For example, “argument” and “argument,” there is only one com- In short text, usually only a few words can represent the seman-
mon prefix, so the Jaro–Winkler distance cannot reflect the match- tics of the entire sentence, and most words do not contribute much
ing degree of these two strings well. to the semantic features of the short text. According to the differ-
Based on the Jaro–Winkler distance, this paper proposes an ent effects of different words on the classification effect, adding
improved Jaro–Winkler similarity with both prefix matching and an attention mechanism to the neural network model can enable
suffix matching, defined in formula (1) as follows: the model to find the words that really affect the semantics in the
context through the attention mechanism when establishing the
( ) ( )
simw = simj + l + l′ p′ 1 − simj (1) relationship between the current word and the context. This paper
refers to these words as related words in short text.
Since spelling errors in short text words are more common in the
In order to find related words of short text, this paper designs a CNN
second half of the word, the common prefix matching of the two
model, which consists of a single convolution layer and a pooling
words is given more weight, and the similarity result must not
layer with an attention mechanism, as shown in Figure 2.
exceed 1. So l is the length of the selected common prefix of two
English words, maximum value is 3, l′ is the length of the selected The convolutional layer is composed of a series of filters with learn-
common suffix of two English words, maximum value is 2, p′ is the able parameters. In this layer, by changing the weight values of these
370 H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375

where f is a nonlinear function, in this paper we use ReLU as the


nonlinear function, operator (⋅) represents convolution operation, b
is a bias term, [vi ∶ vi+h−1 ] represents a sequence of words of length
h, vi represents a word.
Using filters of different heights in the convolution operation, a fea-
ture set S can be obtained by sliding the filter window, defined in
formula (4) as follows:

S = [s1 , s2 , … , sl ] (4)

where si is the feature vector generated by the convolution operation


of each filter, l is the size of the set S.
By performing the tanh activation operation on the feature set S,
the hidden representation ui of the feature vector si can be obtained,
as defined in formula (5). Then performing softmax operation on
the parameter ui , the attention weight 𝛽i of the feature vector si is
obtained, as defined in formula (6). Finally, each feature vector si
is weighted sum according to its attention weight 𝛽i to obtain the
pooled feature vector s𝛽 , as defined in formula (7).
Figure 1 Overview of short text classification method.
ui = tanh (w ⋅ si + b) (5)

( )
𝛽i = softmax w′ ⋅ ui (6)

l
s𝛽 = ∑ 𝛽i si (7)
1

In the word vector space, semantically similar words usually have


a similar distance, therefore the feature vector s𝛽 obtained through
the attention mechanism is calculated with the Euclidean distance
of the word vector corresponding to each word of the short text, and
the closest word is the related word of the short text.

4.2. Short Text Semantic Expansion


Short text usually lacks sufficient contextual information, some-
times does not follow the grammatical rules of natural language,
also there may be polysemy. This section uses the external knowl-
edge base Probase to semantically extend short text and generate
conceptual vectors of short text, which can effectively enrich the
semantic features of short text, so as to achieve the purpose of short
text semantic expansion.
By scanning the Probase knowledge base, for each instance, we will
obtain a corresponding series of related concepts, then score the
instances, concepts and their relationships. For a given short text
Figure 2 Structure diagram of attention mechanism. instance, we can acquire the corresponding conceptual relationship
through the conceptual application programming interface (API)
provided by Probase. Here we remark the concept vector as C =
filters, these filters can obtain higher activation values for specific {< c1 , w1 >, < c2 , w2 >, … , < ck , wk >}, where ci is a concept in the
features. Therefore, it is possible to extract higher-level short text knowledge base, and wi is a weight to represent the relevance of the
semantic features. The width of the filter is fixed to the value m, short text associated with ci .
which is the same as the dimension of the word vector, and the
height of the filter is h. Such as, the feature si can be extracted After obtaining the concept sequence relationship of a short text
through a filter w𝜖Rh∗m , defined in formula (3) as follows: instance, the next step is to generate the corresponding word vector.
In this paper, the Word2vec model is used to complete the pretrain-
( ) ing vectorization operation. Since the conceptualization of short
si = f w ⋅ [vi ∶ vi+h−1 ] + b (3) text contains a series of concept weights, the corresponding weights
H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375 371

must also be taken into account when generating word vectors. The
formula for vectorization is defined in formula (8) as follows:

Wc = w1 vc1 ⊕ w2 vc2 ⊕ … ⊕ wk vck (8)

where Wc represents the word vector matrix conceptualized by


short text, wi represents the weight of the degree of association
between the short text and the concept ci , vci represents the word
vector corresponding to the concept ci , and ⊕ is the concatenation
operation.
The sentence level conceptual sequence relationships obtained
through short text conceptualization can extract richer short text
semantic information. However, the concept sequence relationship Figure 3 Input layer joint word vector matrix.
at the word level in short text is also important because it can extract
more fine-grained text semantic information. Therefore, on the
basis of the conceptualization of short text at the sentence level, this
paper puts forward the conceptualization of related words.
After obtaining the related words in the short text by
using the attention mechanism, the related words can
be conceptualized to obtain the conceptual sequence, as
C′ = {< c′1 , w′1 >, < c′2 , w′2 >, … , < c′k , w′k >}, Where c′i is the con-
cept of related words in the knowledge base, and w′k is the weight
corresponding to this concept.
After the concept sequence relationship of related word instances Figure 4 Convolutional neural network (CNN) short text
is obtained above, the next step is to generate corresponding word classification structure.
vectors. Use the Word2vec model to complete the pretraining vec-
torization operation. Since the conceptualization of related words First, the joint word vector matrix is used as the input of the convo-
also includes a series of concept weights, when generating word vec- lutional layer, and a variety of height-size filters are used to perform
tors, the corresponding weights must also be taken into account. the convolution operation to extract the features of the short text
The formula for vectorization is defined as follows: and generate a set of feature vectors.

W′c = w′1 v′c1 ⊕ w′2 v′c2 ⊕ … ⊕ w′k v′ck (9) In the pooling layer, Max Pooling is used, that is, the maximum
value input in a certain area is used as the output of the area. This
where W′c represents the word vector matrix conceptualizing related can reduce the number of parameters in the network, and can also
words, v′c1 represents the word vector corresponding to the concept effectively prevent overfitting and improve the generalization abil-
c′i , and ⊕ is the concatenation operation. ity of the model. Through maximum pooling, a fixed-length vector
can be extracted from the feature map. The specific calculation pro-
So far, the word vector matrix Ww of the short text, the word vec- cess is shown as follows:
tor matrix Wc of the short text conceptualization, and the word
vector matrix W′c of the related word conceptualization have been
obtained. smax = max (si ) (11)

4.3. CNN Short Text Classification


where si represents a feature map formed by a filter performing con-
When using CNNs to classify short text, it is necessary to repre-
volution operation on short text, 0 < i ≤ M, M is the number of
sent the short text as a matrix as the input of the network model.
feature maps. With maximum pooling, each feature map will get
Therefore, it is necessary to cascade the word vector matrix Ww of
a maximum value. After pooling all the feature maps, each feature
short text, the word vector matrix Wc of conceptualization of short
value needs to be stitched together to obtain the final feature vec-
text, and the word vector matrix W′c of conceptualization of related
tor of the pooling layer. At the same time, in order to reduce the
words, Then form the joint word vector matrix W of short text, as
phenomenon of overfitting, dropout and L2 regularization mecha-
shown in Figure 3, the corresponding formula is defined as follows:
nisms are introduced in the hidden layer to randomly set some fea-
W = Ww ⊕ Wc ⊕ W′c (10) ture vectors to zero.
As the last component of the entire CNN, the fully connected layer
where W ∈ R(n+2k)∗m , n is the number of words in the short text, k
plays the role of classifier. The classification model ultimately needs
represents the number of concepts conceptualized as short text and
to complete the classification of the input short text. After the fixed-
related words, m is expressed as the dimension of the word vector.
length feature vectors obtained by the convolutional layer and the
Next, use the classic CNN model to perform convolution process- pooling layer processing, a fully connected softmax layer is intro-
ing, pooling processing and fully connected Softmax classification, duced to complete the classification operation. The softmax clas-
the structure of which is shown in Figure 4. sification layer converts a series of classification score values into
372 H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375

classification probabilities. A larger classification score value indi- 2 × Precision × Recall


F1 = (13)
cates a greater likelihood of belonging to the corresponding cate- Precision + Recall
gory. Conversely, a category with a smaller classification score value
has a lower probability. where TP is the number of actual positive classes and predicted to
be positive classes, TN is the actual negative class and predicted to
be negative classes, P and N represent the number of positive and
5. EXPERIMENTAL SETUP negative classes, respectively.

To validate the classification result, we conduct the extensive exper-


iments on the different short text data sets. The experimental 6. EXPERIMENTAL RESULTS AND
configuration mainly includes the Intel(R)i7-7700 3.60GHz proces- ANALYSIS
sor, 16GB memory and Python3.7 programming environment.
In order to validate that the short text classification method based
on CNN and SECNN we proposed has a better classification effec-
tiveness, the CNN-rand model and CNN-static model proposed by
5.1. Datasets Kim are compared. Then, Zhang’s character-level CNN text classi-
In order to demonstrate the effectiveness of the short text classifi- fication model (CharCNN) is compared. At the same time, we also
cation method proposed, we adopt the classical short text data sets compared our work with Wang’s short text classification method
which widely used in recent years for text classification tasks, and based conceptualization and convolutional neural network method
basic information of data set is listed as follows: (WCCNN). Finally, the methods most similar to our work (CNN-
HE, CNN-VE) proposed by Wu are compared with our classifica-
• MR. The MR data set is an English movie review data set, with tion method.
a total of 10662 data, the number of categories is 2, half of the Among them, In the CNN-rand model, all word vectors are ran-
positive and negative examples. The average sentence length is domly generated and trained as model parameters. In the CNN-
20. static model, Word2vec pretrained vectors are used, and the word
• TREC. The TREC data set is a question and answer data set, vectors are no longer updated during the training process. At the
with a total of 6452 data, including 5952 data in the training same time, if there are words in the short text that are not in the
set, 500 data in the test set, and 6 categories. The average pretrained dictionary, they are replaced by randomly generated vec-
sentence length is 10. tors. In the WCCNN short text classification method, the exter-
nal knowledge base is used to semantically extend the short text,
• AG News. The AG News data set is an English news article data and the word vector matrix of the short text and the conceptual-
set, with a total of 127,600 data, including 120,000 data in the ized word vector matrix are combined as the input of the CNN. In
training set, 7600 data in the test set, and 4 categories. The the CNN-HE, word embedding matrix and contextual embedding
average sentence length is 7. matrix are concatenated in horizontal orientation to obtain the final
• Twitter. The Twitter data set is an English sentiment embedding matrix. And in the CNN-VE, word embedding matrix
classification data set, with a total of 11,209 data, including and contextual embedding matrix are concatenated in vertical ori-
8204 data in the training set and 3005 data in the test set. The entation to obtain the final embedding matrix.
number of categories is 3, including positive, neutral and First, the classification accuracy of the six methods on the short
negative. The average sentence length is 19. text data sets MR, TREC, AG News, Twitter and Sogou News is
• SST-2. The SST-2 data set is an extension of the MR data set, tested through experiments. The experimental results are shown in
with a total of 9613 data. The number of categories is 2 and the Table 1.
average sentence length is 19. As can be seen from Table 1, the classification method proposed by
us has better accuracy results than other six methods. Among them,
CNN-rand is closer to CNN-static, and the latter is higher than the
5.2. Experimental Parameters former. This is because the former’s word vector model is randomly
During our classification method, the Word2vec tool is used to train initialized and modified during training. The latter is a word vector
the word vectors on the data sets, and the size of the convolution obtained by Word2vec training in advance, which can better express
kernel is 3xdim, 4xdim, 5xdim, the number of convolution kernels
is 100, the batch_size is 64 and the learning rate is 0.001. To pre- Table 1 Accuracy comparison of different classification methods (%)
vent overfitting phenomenon happens, a dropout mechanism was
introduced during training, with a Dropout rate of 0.5. Methods MR TREC AG News Twitter SST-2
CNN-rand 76.72 86.17 84.38 56.64 82.59
The evaluation indicators used in the experimental part are Accu-
CNN-static 80.77 89.26 85.34 57.21 86.23
racy and F1 value to measure the classification effect of short text, CharCNN 76.93 76.05 78.31 45.14 81.25
the formulas are defined in (12) and (13) as follows: WCCNN 82.95 90.68 85.76 57.74 86.93
CNN-HE 82.29 91.28 85.84 56.91 87.16
TP + TN CNN-VE 82.08 91.05 85.80 57.53 86.98
Accuracy = (12) SECNN 83.89 91.34 86.02 57.93 87.37
P+N
H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375 373

the text semantics. In this paper, the improved Jaro–Winkler sim- paper is feasible in the classification task of short text, and the clas-
ilarity is used in text preprocessing to find possible spelling errors sification effect has been significantly improved.
in short text and replace them, which improves the coverage of the
word vector table in the data set.
7. CONCLUSION
We can see that the CharCNN does not perform well in these short
text data sets, the reason is that short text usually lacks sufficient Aiming at the problem that the traditional short text classifica-
semantic features, if only extracting features from the character tion method relies heavily on the number of neural network lay-
level will not achieve a good classification effect. Compared with ers and do not perform well on short text due to the data sparsity
the WCCNN, CNN-HE, CNN-VE, SECNN not only uses short text and insufficient semantic features, we propose a short text classifi-
conceptualization but also proposes related word conceptualization cation method based on CNN and semantic expansion. In order to
to further improve the semantic information of short text, the prob- improve the coverage of the pretrained word vector table in the pro-
lem of insufficient semantic information in short texts has been cess of short text vectorization, the improved Jaro–Winkler similar-
fully resolved, and the classification effect also be improved. ity is used to find possible spelling errors in short text during text
In order to better reflect the advantages of the short text classifi- preprocessing, thus it can more accurately match the corresponding
cation method proposed in this paper, five other different models words in the corpus. At the same time, facing the problem of lim-
were selected to perform multiple iteration experiments on the MR ited semantic information that short text can provide, we introduce
data set, and then the results were compared, as shown in Figure 5. an external knowledge base to conceptualize short text and related
words in short text, extend the semantics of short text. Experiment
Where the abscissa of the graph is the number of CNN training results demonstrate that the method proposed is feasible in the clas-
epochs, and the ordinate is the accuracy of the model. It can be sification task of short text, and the classification effectiveness is
clearly seen from the figure: although the accuracy of each clas- improved remarkably. When this classification method obtains the
sification method is gradually increasing with the increase of the
number of iterations, the advantages of SECNN and WCCNN are
already reflected in the 1st epoch, and when the number of epoch is
5th, the accuracy rate is the highest, and the subsequent values are
basically stable, indicating that the model has converged. It can be
seen that the short text classification method proposed in this paper
is also superior to other classification methods in terms of stability.
In order to validate the other evaluation indicators, the classifica-
tion effect of the method in this paper has also been improved to a
certain extent. Next, the classification result F1 value of the method
proposed in this paper is compared with six comparison methods
on the MR data set and AG News data set. The experimental results
are shown in Figures 6 and 7.
As can be seen from Figures 6 and 7, the F1 value of the WCCNN
for MR dataset classification is significantly higher than the classic
text classification method such as: CNN-rand and CharCNN, then
CNN-HE and CNN-VE are slightly below the WCCNN’s F1 value. Figure 6 Comparison of F1 values of different classification
From the result, we also found that SECNN’s F1 value is slightly methods on MR data set.
better than WCCNN. This shows that the method proposed in this

Figure 7 Comparison of F1 values of different classification


Figure 5 Comparison of accuracy under different epochs. methods on AG News data set.
374 H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375

related words of short text, because it involves a large number of [10] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin, A neural
word vector distance calculations, it’s inevitable to need research- probabilistic language model, J. Mach. Learn. Res. 3 (2003),
ing the time consumption. In the following research, time complex- 1137–1155.
ity will be taken into account, and the method of obtaining related [11] T. Mikolov, M. Karafiatm, S. Khudanpur, Recurrent Neural Net-
words in short text will be optimized to improve the classification work Based Language Model, International Speech Communica-
efficiency of short text. tion Association, Prague, Czech Republic, 2010.
[12] Y. Kim, Convolutional Neural Networks for Sentence Classifi-
cation, Association for Computational Linguistics (ACL), Doha,
AUTHORS’ CONTRIBUTIONS Qatar, 2014.
[13] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchi-
Haitao Wang contributed to the conception of the study; Keke Tian cal attention networks for document classification, in Conference
performed the experiment; Keke Tian performed the data analyses of the North American Chapter of the Association for Compu-
and wrote the manuscript; Zhengjiang Wu provided technical sup- tational Linguistics: Human Language Technologies, San Diego,
port; Lei Wang helped perform the analysis with constructive dis- California, 2016.
cussions, also helped check the grammar of the paper. [14] Y. Zhou, J. Xu, J. Cao, B. Xu, C. Li, B. Xu, Hybrid attention
networks for Chinese short text classification, Comp. y Sist. 21
(2017), 759–769.
ACKNOWLEDGMENTS [15] Z. Wei, D. Miao, J.H. Chauchat, R. Zhao, W. Li, N-grams based
feature selection and text representation for Chinese text classifi-
This work is support by the National Natural Science Foundation of China
cation, Int. J. Comput. Int. Sys. 2 (2009), 365–374.
(No. 11601129 61503124), Henan Science and Technology Key Project
[16] M. Post, S. Bergsma, Explicit and Implicit Syntactic Features for
(No.192102210280), the Fundamental Research Funds for the Universities
Text Classification, Association for Computational Linguistics
of Henan Province, Doctor Foundation of Henan Polytechnic University
(ACL), Sofia, Bulgaria, 2013.
(No. B2017-36).
[17] G. Gautam, D. Yadav, Sentiment Analysis of Twitter Data Using
Machine Learning Approaches and Semantic Analysis, Insti-
REFERENCES tute of Electrical and Electronics Engineers Inc., Noida, India,
2014.
[1] J. Chen, H. Huang, S. Tian, Y. Qu, Feature selection for text [18] Y. Song, Z. Wang, H. Wang, Short Text Conceptualization Using a
classification with Naive Bayes, Expert Syst. Appl. 36 (2009), Probabilistic Knowledgebase, International Joint Conferences on
5432–5435. Artificial Intelligence, Barcelona, Spain, 2011.
[2] S.B. Kim, H.C. Rim, S.H. Myaeng, K.S. Han, Some effective tech- [19] Z. Zhang, D. Miao, C. Gao, Short text classification using latent
niques for naive Bayes text classification, IEEE Trans. Knowl. Data Dirichlet allocation, J. Comput. App. 33 (2013), 1587–1590.
Eng. 18 (2006), 1457–1466. [20] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of
[3] M. Haddoud, A. Mokhtari, T. Lecroq, S. Abdeddaïm, Combin- data with neural networks, Science. 313 (2006), 504–507.
ing supervised term-weighting metrics for SVM text classification [21] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
with extended term representation, Knowl. Inf. Syst. 49 (2016), P. Kuksa, Natural language processing (almost) from scratch, J.
909–931. Mach. Learn. Res. 12 (2011), 2493–2537.
[4] H. Kim, P. Howland, H. Park, Dimension reduction in text clas- [22] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed
sification with support vector machines, J. Mach. Learn. Res. 6 Representations of Words and Phrases and their Composition-
(2005), 37–53. ality, Neural Information Processing Systems Foundation, Lake
[5] R. Li, Y. Hu, A density-based method for reducing the amount of Tahoe, NV, USA, 2013.
training data in kNN text classification, J. Comput. Res. Dev. 41 [23] N. Sotthisopha, P. Vateekul, Improving Short Text Classification
(2004), 539–545. Using Fast Semantic Expansion on Multichannel Convolutional
[6] H. Wandabwa, D. Zhang, K. Sammy, Text categorization via Neural Network, Institute of Electrical and Electronics Engineers
attribute distance weighted k-nearest neighbor classification, Inc., Busan, Korea, 2018.
in 2016 International Conference on Information Technology [24] X. Zhang, J. Zhao, Y. Lecun, Character-Level Convolutional Net-
(ICIT), Bhubaneswar, India, 2016. works for Text Classification, Neural Information Processing Sys-
[7] Y. Wang, Z. Wang, Text categorization rule extraction based on tems Foundation, Montreal, Canada, 2015.
fuzzy decision tree, Comput. App. 4 (2005), 1634–1637. [25] J. Wang, Z. Wang, D. Zhang, J. Yan, Combining Knowledge with
[8] S. Wang, C.D. Manning, Baselines and Bigrams: Simple, Good Deep Convolutional Neural Networks for Short Text Classifica-
Sentiment and Topic Classification, Association for Computa- tion, International Joint Conferences on Artificial Intelligence,
tional Linguistics (ACL), Jeju Island, Korea, 2012. Melbourne, Australia, 2017.
[9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation [26] X. Wu, Y. Cai, Q. Li, J. Xu, H.F. Leung, Combining Contextual
of word representations in vector space, Comput. Sci. (2013). Information by Self-attention Mechanism in Convolutional Neu-
https://www.engineeringvillage.com/search/doc/abstract. ral Networks for Text Classification, Springer Verlag, Dubai, UAE,
url?SEARCHID=4982a786a73c4269a1e8ca8c9ee99349& 2018.
DOCINDEX=1&database=1&pageType=quickSearch& [27] J. Su, J. Zeng, D. Xiong, Y. Liu, M. Wang, J. Xie, A hierarchy-
searchtype=Quick&dedupResultCount=null&format= to-sequence attentional neural machine translation model,
quickSearch&usageOrigin=recordpage&usageZone= IEEE/ACM Trans. Audio Speech Language Process. 26 (2018),
detailedtab&toolsinScopus=Noload 623–632.
H. Wang et al. / International Journal of Computational Intelligence Systems 14(1) 367–375 375

[28] Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention net- [31] D. Zhang, M. Hong, L. Zou, F. Han, F. He, Z. Tu, Y. Ren, Atten-
works for image question answering, in 2016 IEEE Conference tion pooling-based bidirectional gated recurrent units model for
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, sentimental classification, Int. J. Comput. Int. Sys. 12 (2019),
NV, USA, 2015. 723–732.
[29] L. Gao, Z. Guo, H. Zhang, X. Xu, H.T. Shen, Video caption- [32] L. Wang, Z. Cao, G. de Melo, Z. Liu, Relation classification via
ing with attention-based LSTM and semantic consistency, IEEE multi-level attention CNNs, in Meeting of the Association for
Trans. Multimedia. 19 (2017), 2045–2055. Computational Linguistics, Berlin, Germany, 2016.
[30] Z. Peng, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention- [33] X. Qiao, C. Peng, Z. Liu, Y. Hu, Word-character attention model
based bidirectional long short-term memory networks for rela- for Chinese text classification, Int. J. Mach. Learn. Cybern. 10
tion classification, in Meeting of the Association for Computa- (2019), 3521–3537.
tional Linguistics, Berlin, Germany, 2016.

You might also like