Neural Networks For Text Classification
Neural Networks For Text Classification
Neural Networks For Text Classification
1 Introduction
The aim of Natural Language Processing(NLP) is to analyze and extract in-
formation from textual data in order to make computers understand language,
the way humans do. Unlike images which lack sequential patterns, texts involve
amplitude of such information which makes processing very distinctive.
The level of processing varies from paragraph level, sentence level, word level and
to the character level. Deep neural network architectures achieved state-of-art
results in many areas like Speech Recognition [1] and Computer Vison [2]. The
use of neural networks in Natural language processing can be traced back to [3]
where the backpropogation algorithm was used to make networks learn famil-
ial relations. The major advancement was when [4] applied neural networks to
represent words in a distributed compositional manner. [5] proposed two neural
network models CBoW and Skip-gram for an efficient distributed representation
of words. This was a major break-through in the field of NLP. From then, neural
network architectures achieved state-of-results in many NLP applications like
Machine Translation [6], Text Summarization [7] and Conversation Models [8] .
Convolutional Neural Networks [9] were devised primarily for dealing with
images and have shown remarkable results in the field of computer vision[10,11].
In addition to their contribution in Image processing, their effectiveness in Nat-
ural language processing has also been explored and shown to have strong per-
formance in Sentence[12] and Text Classification[13].
The intuition behind Self Normalizing Neural Networks(SNN) is to drive
neuron activations across all layers to emit a zero mean and unit variance out-
put. This is done with the help of the proposed activation in SNNs, SELU or
scaled exponential linear units. With the help of SELUs an effect alike to batch
normalization is replicated, hence slashing the number of parameters along with
a robust learning. Special Dropouts and Initialization also help in this learning,
which make SNNs remarkable to traditional Neural Networks. As Image based
inputs and Text based inputs differ from each other in form and characteristics,
in this paper we propose certain revisions to the SNN architecture to empower
them on texts efficiently.
In this paper, to explore effectiveness of self normalizing neural networks in
text classification, we propose an architecture, Self Normalizing Convolutional
Neural Network (SCNN) built upon convolutional neural networks. A thorough
study of SCNNs on various benchmark text datasets, is paramount to ascertain
importance of SNNs in Natural Language Processing.
2 Related Work
Prior to the success of deep learning, text classification heavily relied on good
feature engineering and various machine learning algorithms.
Convolutional Neural Networks [9] were devised primarily for dealing with
images and have shown remarkable results in the field of computer vision[10,11].
In addition to their contribution in Image processing, their effectiveness in Nat-
ural language processing has also been explored and shown to have strong per-
formance. Kim [12] represented an input sentence using word embeddings that
are stacked into a two dimensional structure where length corresponds to em-
bedding size and height with average sentence length. Processing this structure
using kernel filters of fixed window size and max pooling layer upon it to capture
the most important information has shown them promising results on text classi-
fication. Additionally, very deep CNN architectures [13] have shown state-of-the
art results in text classification, significantly reducing the error percentage. As
CNNs are limited to fixed window sizes, [14] have proposed a recurrent convolu-
tion architecture to exploit the advantages of recurrent structures that capture
distant contextual information in ways fixed windows may not be able to.
Klambauer [15] proposed Self Normalizing Neural Networks (SNN) upon feed
forward neural networks, significantly outperformed FNN architectures on vari-
ous machine learning tasks. Since then, the activation proposed in SNNs, SELU
have been widely studied in Image Processing[16,17,18], where they have been
applied on CNNs to achieve better results. SELU’s effectiveness have also been
explored in Text[19,20,21] processing tasks. However these applications are lim-
ited to applying just SELUs in their [16,17,18,19,20,21] respective architectures.
To get a normalized output in SNN without requiring layers like batch normal-
ization, the inputs are normalized.
3.2 Intialization
Lecun normal initialization draws samples centered around 0 and with standard
deviation as:
r
1
stddev = (2)
in
where in and out represent dimensions of weight matrix corresponding to number
of nodes in previous and current layer respectively.
3.3 SELU activations
4 Model
SELU activation originally proposed for SNN[15] preserve the properties of SNN
if the inputs are normalized. When inputs are normalized, applying SELU on
the activations does not shift the mean. However if inputs weren’t normalized,
due to the parameter λ in SELU activation, the neuron outputs will be scaled
by a factor λ thereby shifting the mean and variance to a value away from the
desired mean and variance. These values are further propagated to other layers
thereby shifting mean and variance more and more. Since input word embeddings
cannot be normalized as explained in section 4.1, we use ELU activation [26] in
the proposed SCNN model instead. ELU activation function is defined as :
(
x if x > 0
elu(x) = x
(4)
αe − α if x 6 0
where α is a hyper parameter, and e stands for exponent. The absence of pa-
rameter λ in ELU prevents greater scaling of neuron outputs. ELU activation
pushes the mean of the activations closer to zero even if inputs are not normal-
ized which enable faster learning [26]. We compare the performance of SCNN
with both SELU and ELU activations and the results presented in table 3 and
figure 2.
The SCNN architecture is shown in the figure 1. Let V be the vocabulary size
considered for each dataset and X ∈ RV ×d represent the word embedding matrix
where each word Xi is a d dimensional word vector. Words present in the pre-
trained word embedding† are assigned their corresponding word vectors. Word
that are not present are initialized to 0s. Based on our experiments, SCNN
showed better performance when absent words are initialized to 0s than ran-
domly initialization. A maximum sentence length of N is considered per sentence
or paragraph. If the sentence or paragraph length is less than N , zero padding is
done. Therefore, I ∈ RN ×d dimensional vector per each sentence or paragraph
is provided as input to the SCNN model.
Convolution operation is applied on I with kernel K ∈ Rh×d (h ∈ {3,4,5})
is applied to input vectors with a window size of h. The weight initialization of
these kernels is done using lecun normal [25] and bias is initialized to 0. A new
†
https://code.google.com/archive/p/word2vec/
Fig. 1: Architecture of proposed SCNN model
feature vector is C ∈ R(N −h+1)×1 is obtained after the convolution operation for
each filter.
C = f (I ~ K) (5)
where f represents the activation function (ELU). Number of convolution filters
vary depending on the dataset, table 2 summarizes the number of parameters
for all of our experiments. Maxpooling operation is applied across each filter C
to get the maximum value. The outputs from the maxpooling layer across all
filters are concatenated. Alpha dropout [15] with a dropout value 0.5 is applied
on the concatenated layer. The concatenated layer is densely connected to the
output layer with activation as sigmoid if task is binary classification and softmax
otherwise.
No of Conv.Filters No of Parameters
Datasets SCNN and SCNN and
Static CNN[12] Static CNN[12]
Short-CNN Short-CNN
MR 210 300 ≈ 254k ≈ 362k
SO 210 300 ≈ 254k ≈ 362k
IMDB 210 300 ≈ 254k ≈ 362k
TREC 210 300 ≈ 254k ≈ 362k
CR 90 300 ≈ 108k ≈ 362k
MPQA 90 300 ≈ 108k ≈ 362k
5.1 Datasets
We performed experiments on various benchmark data sets of text classification.
The summary statistics for the datasets are shown in table 1
MPAQ The dataset consists of 3311 positive reviews and 7293 negative reviews.
Binary classification task of predicting positive and negative opinion [32].
SCNN with SELU activation SNN was originally proposed with SELU ac-
tivation. We performed experiments on SCNN using SELU as the activation
function in place of ELU.
Short CNN Our model SCNN is proposed with fewer parameters compared
to Static CNN model[12]. To show the effectiveness of SCNN, we perform ex-
periments on Static CNN model with same number of parameters as SCNN, we
refer this model as Short CNN.
5.4 Training
We process the dataset as follows: Each sentence or paragraph is converted
to lower case. Stop words are not removed from the sentences. We consider
a vocabulary size V for each dataset based on the word counts. The datasets
IMDB, TREC have predefined test data and for other datasets we used 10-fold
Cross Validation. The parameters chosen vary depending on the dataset size.
Table 2 shows the parameters of SCNN model for all datasets. We used Adam
[33] as the optimizer for training SCNN.
6.2 Discussion
SCNN models against Short-CNN:
When we compare SCNN with SELU and SCNN to Short CNN, both the mod-
els of SCNN outperform Short CNN for all the datasets. This shows that SCNN
models perform better than CNN models (Short CNN) with same number of
parameters indicating a better generalization of training. There is a significant
improvement in accuracy and F1-Score when SCNN models are used in place
of CNN. We believe that the use of activation functions ELU and SELU in the
SCNN models as opposed to ReLU is the leading factor behind this performance
difference between SCNN and CNN. In particular, ReLU activation suffers from
dying ReLU problem‡ . In ReLU, the negative values are cancelled to 0. There-
fore negative values in the pretrained word vectors are ignored thereby loosing
information about negative values. This problem is solved in ELU and SELU
by having activation even for the negative values. In comparison to ReLU, ELU
and SELU have faster and accurate training convergence leading to better gen-
eralization performance.
‡
http: //cs231n.github.io/neural-networks-1/
Table 3: Performance of the models on different datasets
Datasets
Model MR SO IMDB TREC CR MPQA
Accuracy F1-Score
Short CNN 77.762 89.63 78.84 85.2 76.246 80.906
SCNN w/SELU 80.266 91.99 80.664 89.6 77.166 84.062
SCNN 80.308 91.759 82.708 90.4 77.666 84.068
CNN-static[12] 81 93 78.692 92.8 76.852 82.584
7 Conclusion
We propose SCNN for performing text classification. Our observations indicate
that SCNN has comparable performance to CNN (Static-CNN [12]) model with
substantially lesser parameters. Moreover SCNN performs significantly better
than CNN with equal number of parameters. The experimental results demon-
strate the effectiveness of self normalizing neural networks in text classification.
Currently, SCNN is proposed with relatively simple architectures. Our work can
be further extended by experimenting SCNN on deep architectures. In addition
to this, SNN can also be applied on recurrent neural networks(RNN) and its
performance can be analyzed.
References
1. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. IEEE
Signal processing magazine 29 (2012) 82–97
2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in neural information processing systems.
(2012) 1097–1105
3. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. nature 323 (1986) 533
4. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language
model. Journal of machine learning research 3 (2003) 1137–1155
5. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
6. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473 (2014)
7. Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive
sentence summarization. arXiv preprint arXiv:1509.00685 (2015)
8. Vinyals, O., Le, Q.: A neural conversational model. arXiv preprint
arXiv:1506.05869 (2015)
9. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86 (1998) 2278–2324
10. Krizhevsky, A., Sutskever, I., E. Hinton, G.: Imagenet classification with deep
convolutional neural networks. Neural Information Processing Systems 25 (2012)
11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
12. Kim, Y.: Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882 (2014)
13. Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional net-
works for text classification. arXiv preprint arXiv:1606.01781 (2016)
14. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text
classification. In: AAAI. Volume 333. (2015) 2267–2273
15. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neu-
ral networks. In Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R.,
Vishwanathan, S., Garnett, R., eds.: Advances in Neural Information Processing
Systems 30. Curran Associates, Inc. (2017) 971–980
16. Lguensat, R., Sun, M., Fablet, R., Tandeo, P., Mason, E., Chen, G.: Eddynet: A
deep neural network for pixel-wise classification of oceanic eddies. In: IGARSS
2018-2018 IEEE International Geoscience and Remote Sensing Symposium, IEEE
(2018) 1764–1767
17. Zhang, J., Shi, Z.: Deformable deep convolutional generative adversarial network
in microwave based hand gesture recognition system. In: Wireless Communications
and Signal Processing (WCSP), 2017 9th International Conference on, IEEE (2017)
1–6
18. Goh, G.B., Hodas, N.O., Siegel, C., Vishnu, A.: Smiles2vec: An interpretable
general-purpose deep neural network for predicting chemical properties. arXiv
preprint arXiv:1712.02034 (2017)
19. Kumar, S.S., Kumar, M.A., Soman, K.: Sentiment analysis of tweets in malayalam
using long short-term memory units and convolutional neural nets. In: Interna-
tional Conference on Mining Intelligence and Knowledge Exploration, Springer
(2017) 320–334
20. Rosá, A., Chiruzzo, L., Etcheverry, M., Castro, S.: Retuyt in tass 2017: Sentiment
analysis for spanish tweets using svm and cnn. arXiv preprint arXiv:1710.06393
(2017)
21. Meisheri, H., Ranjan, K., Dey, L.: Sentiment extraction from consumer-generated
noisy short texts. In: Data Mining Workshops (ICDMW), 2017 IEEE International
Conference on, IEEE (2017) 399–406
22. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. (2015) 448–456
23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research 15 (2014) 1929–1958
24. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In Teh, Y.W., Titterington, M., eds.: Proceedings of the Thir-
teenth International Conference on Artificial Intelligence and Statistics. Volume 9
of Proceedings of Machine Learning Research., Chia Laguna Resort, Sardinia, Italy,
PMLR (2010) 249–256
25. LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural Net-
works: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop,
London, UK, UK, Springer-Verlag (1998) 9–50
26. Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learn-
ing by exponential linear units (elus). CoRR abs/1511.07289 (2015)
27. Pang, B., Lee, L.: Seeing stars: Exploiting class relationships for sentiment cate-
gorization with respect to rating scales. In: Proceedings of the ACL. (2005)
28. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. In: Proceedings of the ACL. (2004)
29. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning
word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies,
Portland, Oregon, USA, Association for Computational Linguistics (2011) 142–150
30. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the 19th In-
ternational Conference on Computational Linguistics - Volume 1. COLING ’02,
Stroudsburg, PA, USA, Association for Computational Linguistics (2002) 1–7
31. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data
mining, ACM (2004) 168–177
32. Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions
in language. Language resources and evaluation 39 (2005) 165–210
33. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR
abs/1412.6980 (2014)