Bag of Tricks For Text Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Bag of Tricks for Efficient Text Classification

Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov


Facebook AI Research
{ajoulin,egrave,bojanowski,tmikolov}@fb.com

Abstract we show that linear models with a rank constraint


and a fast loss approximation can train on a billion
This paper explores a simple and efficient words within ten minutes, while achieving perfor-
baseline for text classification. Our ex- mance on par with the state-of-the-art. We eval-
periments show that our fast text clas- uate the quality of our approach fastText1 on
sifier fastText is often on par with two different tasks, namely tag prediction and sen-
deep learning classifiers in terms of ac- timent analysis.
curacy, and many orders of magnitude
faster for training and evaluation. We can 2 Model architecture
train fastText on more than one bil-
A simple and efficient baseline for sentence clas-
lion words in less than ten minutes using a
sification is to represent sentences as bag of
standard multicore CPU, and classify half
words (BoW) and train a linear classifier, e.g., a
a million sentences among 312K classes in
logistic regression or an SVM (Joachims, 1998;
less than a minute.
Fan et al., 2008). However, linear classifiers do
1 Introduction not share parameters among features and classes.
This possibly limits their generalization in the con-
Text classification is an important task in Natu- text of large output space where some classes have
ral Language Processing with many applications, very few examples. Common solutions to this
such as web search, information retrieval, rank- problem are to factorize the linear classifier into
ing and document classification (Deerwester et low rank matrices (Schütze, 1992; Mikolov et al.,
al., 1990; Pang and Lee, 2008). Recently, mod- 2013) or to use multilayer neural networks (Col-
els based on neural networks have become in- lobert and Weston, 2008; Zhang et al., 2015).
creasingly popular (Kim, 2014; Zhang and LeCun, Figure 1 shows a simple linear model with rank
2015; Conneau et al., 2016). While these models constraint. The first weight matrix A is a look-up
achieve very good performance in practice, they table over the words. The word representations are
tend to be relatively slow both at train and test then averaged into a text representation, which is
time, limiting their use on very large datasets. in turn fed to a linear classifier. The text repre-
Meanwhile, linear classifiers are often consid- sentation is an hidden variable which can be po-
ered as strong baselines for text classification tentially be reused. This architecture is similar to
problems (Joachims, 1998; McCallum and Nigam, the cbow model of Mikolov et al. (2013), where
1998; Fan et al., 2008). Despite their simplicity, the middle word is replaced by a label. We use
they often obtain state-of-the-art performances if the softmax function f to compute the probabil-
the right features are used (Wang and Manning, ity distribution over the predefined classes. For a
2012). They also have the potential to scale to very set of N documents, this leads to minimizing the
large corpus (Agarwal et al., 2014). negative log-likelihood over the classes:
In this work, we explore ways to scale these N
baselines to very large corpus with a large output 1 X
− yn log(f (BAxn )),
space, in the context of text classification. Inspired N n=1
by the recent work in efficient word representation 1
https://github.com/facebookresearch/
learning (Mikolov et al., 2013; Levy et al., 2015), fastText
output 2.2 N-gram features
Bag of words is invariant to word order but tak-
ing explicitly this order into account is often com-
hidden putationally very expensive. Instead, we use a
bag of n-grams as additional features to capture
some partial information about the local word or-
x1 x2 ... xN −1 xN der. This is very efficient in practice while achiev-
ing comparable results to methods that explicitly
Figure 1: Model architecture of fastText for a use the order (Wang and Manning, 2012).
sentence with N ngram features x1 , . . . , xN . The We maintain a fast and memory efficient
features are embedded and averaged to form the mapping of the n-grams by using the hashing
hidden variable. trick (Weinberger et al., 2009) with the same hash-
ing function as in Mikolov et al. (2011) and 10M
bins if we only used bigrams, and 100M other-
where xn is the normalized bag of features of
wise.
the n-th document, yn the label, A and B the
weight matrices. This model is trained asyn-
3 Experiments
chronously on multiple CPUs using stochastic gra-
dient descent and a linearly decaying learning rate. We evaluate fastText on two different tasks.
First, we compare it to existing text classifers on
2.1 Hierarchical softmax the problem of sentiment analysis. Then, we eval-
When the number of classes is large, comput- uate its capacity to scale to large output space on a
ing the linear classifier is computationally expen- tag prediction dataset. Note that our model could
sive. More precisely, the computational complex- be implemented with the Vowpal Wabbit library,2
ity is O(kh) where k is the number of classes but we observe in practice, that our tailored imple-
and h the dimension of the text representation. In mentation is at least 2-5× faster.
order to improve our running time, we use a hi- 3.1 Sentiment analysis
erarchical softmax (Goodman, 2001) based on the
Huffman coding tree (Mikolov et al., 2013). Datasets and baselines. We employ the same 8
During training, the computational complexity datasets and evaluation protocol of Zhang et al.
drops to O(h log2 (k)). (2015). We report the n-grams and TFIDF
baselines from Zhang et al. (2015), as well as
The hierarchical softmax is also advantageous
the character level convolutional model (char-
at test time when searching for the most likely
CNN) of Zhang and LeCun (2015), the char-
class. Each node is associated with a probability
acter based convolution recurrent network (char-
that is the probability of the path from the root to
CRNN) of (Xiao and Cho, 2016) and the very
that node. If the node is at depth l + 1 with par-
deep convolutional network (VDCNN) of Con-
ents n1 , . . . , nl , its probability is
neau et al. (2016). We also compare to Tang et
l al. (2015) following their evaluation protocol. We
report their main baselines as well as their two
Y
P (nl+1 ) = P (ni ).
i=1 approaches based on recurrent networks (Conv-
GRNN and LSTM-GRNN).
This means that the probability of a node is always
lower than the one of its parent. Exploring the tree Results. We present the results in Figure 1. We
with a depth first search and tracking the maxi- use 10 hidden units and run fastText for 5
mum probability among the leaves allows us to epochs with a learning rate selected on a valida-
discard any branch associated with a small prob- tion set from {0.05, 0.1, 0.25, 0.5}. On this task,
ability. In practice, we observe a reduction of the adding bigram information improves the perfor-
complexity to O(h log2 (k)) at test time. This ap- mance by 1-4%. Overall our accuracy is slightly
proach is further extended to compute the T -top better than char-CNN and char-CRNN and, a bit
targets at the cost of O(log(T )), using a binary 2
Using the options --nn, --ngrams and
heap. --log multi
Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
BoW (Zhang et al., 2015) 88.8 92.9 96.6 92.2 58.0 68.9 54.6 90.4
ngrams (Zhang et al., 2015) 92.0 97.1 98.6 95.6 56.3 68.5 54.3 92.0
ngrams TFIDF (Zhang et al., 2015) 92.4 97.2 98.7 95.4 54.8 68.5 52.4 91.5
char-CNN (Zhang and LeCun, 2015) 87.2 95.1 98.3 94.7 62.0 71.2 59.5 94.5
char-CRNN (Xiao and Cho, 2016) 91.4 95.2 98.6 94.5 61.8 71.7 59.2 94.1
VDCNN (Conneau et al., 2016) 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7
fastText, h = 10 91.5 93.9 98.1 93.8 60.4 72.0 55.8 91.2
fastText, h = 10, bigram 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6

Table 1: Test accuracy [%] on sentiment datasets. FastText has been run with the same parameters
for all the datasets. It has 10 hidden units and we evaluate it with and without bigrams. For char-CNN,
we show the best reported numbers without data augmentation.

Zhang and LeCun (2015) Conneau et al. (2016) fastText


small char-CNN big char-CNN depth=9 depth=17 depth=29 h = 10, bigram
AG 1h 3h 24m 37m 51m 1s
Sogou - - 25m 41m 56m 7s
DBpedia 2h 5h 27m 44m 1h 2s
Yelp P. - - 28m 43m 1h09 3s
Yelp F. - - 29m 45m 1h12 4s
Yah. A. 8h 1d 1h 1h33 2h 5s
Amz. F. 2d 5d 2h45 4h20 7h 9s
Amz. P. 2d 5d 2h45 4h25 7h 10s

Table 2: Training time for a single epoch on sentiment analysis datasets compared to char-CNN and
VDCNN.

worse than VDCNN. Note that we can increase ing convolutions are several orders of magnitude
the accuracy slightly by using more n-grams, for slower than fastText. While it is possible
example with trigrams, the performance on Sogou to have a 10× speed up for char-CNN by using
goes up to 97.1%. Finally, Figure 3 shows that more recent CUDA implementations of convolu-
our method is competitive with the methods pre- tions, fastText takes less than a minute to train
sented in Tang et al. (2015). We tune the hyper- on these datasets. The GRNNs method of Tang et
parameters on the validation set and observe that al. (2015) takes around 12 hours per epoch on CPU
using n-grams up to 5 leads to the best perfor- with a single thread. Our speed-up compared to
mance. Unlike Tang et al. (2015), fastText neural network based methods increases with the
does not use pre-trained word embeddings, which size of the dataset, going up to at least a 15,000×
can be explained the 1% difference in accuracy. speed-up.

Model Yelp’13 Yelp’14 Yelp’15 IMDB 3.2 Tag prediction


SVM+TF 59.8 61.8 62.4 40.5 Dataset and baselines. To test scalability of
CNN 59.7 61.0 61.5 37.5 our approach, further evaluation is carried on
Conv-GRNN 63.7 65.5 66.0 42.5
LSTM-GRNN 65.1 67.1 67.6 45.3 the YFCC100M dataset (Thomee et al., 2016)
which consists of almost 100M images with cap-
fastText 64.2 66.2 66.6 45.2
tions, titles and tags. We focus on predicting the
Table 3: Comparision with Tang et al. (2015). The tags according to the title and caption (we do not
hyper-parameters are chosen on the validation set. use the images). We remove the words and tags
We report the test accuracy. occurring less than 100 times and split the data
into a train, validation and test set. The train
set contains 91,188,648 examples (1.5B tokens).
Training time. Both char-CNN and VDCNN The validation has 930,497 examples and the test
are trained on a NVIDIA Tesla K40 GPU, set 543,424. The vocabulary size is 297,141 and
while our models are trained on a CPU us- there are 312,116 unique tags. We will release a
ing 20 threads. Table 2 shows that methods us- script that recreates this dataset so that our num-
Input Prediction Tags
taiyoucon 2011 digitals: individuals digital pho- #cosplay #24mm #anime #animeconvention
tos from the anime convention taiyoucon 2011 in #arizona #canon #con #convention
mesa, arizona. if you know the model and/or the #cos #cosplay #costume #mesa #play
character, please comment. #taiyou #taiyoucon
2012 twin cities pride 2012 twin cities pride pa- #minneapolis #2012twincitiesprideparade #min-
rade neapolis #mn #usa
beagle enjoys the snowfall #snow #2007 #beagle #hillsboro #january
#maddison #maddy #oregon #snow
christmas #christmas #cameraphone #mobile
euclid avenue #newyorkcity #cleveland #euclidavenue

Table 4: Examples from the validation set of YFCC100M dataset obtained with fastText with 200
hidden units and bigrams. We show a few correct and incorrect tag predictions.

Running time nificant speed-up when the number of classes is


Model prec@1
Train Test large (more than 300K here). Overall, we are more
Freq. baseline 2.2 - - than an order of magnitude faster to obtain model
Tagspace, h = 50 30.1 3h8 6h with a better quality. The speedup of the test phase
Tagspace, h = 200 35.6 5h32 15h is even more significant (a 600× speedup). Table 4
fastText, h = 50 31.2 6m40 48s shows some qualitative examples.
fastText, h = 50, bigram 36.7 7m47 50s
fastText, h = 200 41.1 10m34 1m29
fastText, h = 200, bigram 46.1 13m38 1m37
4 Discussion and conclusion
In this work, we propose a simple baseline
Table 5: Prec@1 on the test set for tag predic- method for text classification. Unlike unsuper-
tion on YFCC100M. We also report the training visedly trained word vectors from word2vec, our
time and test time. Test time is reported for a sin- word features can be averaged together to form
gle thread, while training uses 20 threads for both good sentence representations. In several tasks,
models. fastText obtains performance on par with re-
cently proposed methods inspired by deep learn-
bers could be reproduced. We report precision ing, while being much faster. Although deep neu-
at 1. ral networks have in theory much higher represen-
We consider a frequency-based baseline which tational power than shallow models, it is not clear
predicts the most frequent tag. We also compare if simple text classification problems such as sen-
with Tagspace (Weston et al., 2014), which is a timent analysis are the right ones to evaluate them.
tag prediction model similar to ours, but based We will publish our code so that the research com-
on the Wsabie model of Weston et al. (2011). munity can easily build on top of our work.
While the Tagspace model is described using con- Acknowledgement. We thank Gabriel Syn-
volutions, we consider the linear version, which naeve, Hervé Gégou, Jason Weston and Léon Bot-
achieves comparable performance but is much tou for their help and comments. We also thank
faster. Alexis Conneau, Duyu Tang and Zichao Zhang for
providing us with information about their meth-
Results and training time. Table 5 presents
ods.
a comparison of fastText and the baselines.
We run fastText for 5 epochs and com-
pare it to Tagspace for two sizes of the hidden References
layer, i.e., 50 and 200. Both models achieve a Alekh Agarwal, Olivier Chapelle, Miroslav Dudı́k, and
similar performance with a small hidden layer, but John Langford. 2014. A reliable effective terascale
adding bigrams gives us a significant boost in ac- linear learning system. Journal of Machine Learn-
curacy. At test time, Tagspace needs to compute ing Research, 15(Mar):1111–1133.
the scores for all the classes which makes it rel- Ronan Collobert and Jason Weston. 2008. A uni-
atively slow, while our fast inference gives a sig- fied architecture for natural language processing:
Deep neural networks with multitask learning. In H. Schütze. 1992. Dimensions of meaning. In Pro-
Proceedings of the 25th International Conference ceedings of the 1992 ACM/IEEE Conference on Su-
on Machine Learning, ICML ’08, pages 160–167, percomputing, Supercomputing ’92, pages 787–796,
Helsinki, Finland. ACM. Los Alamitos, CA, USA. IEEE Computer Society
Press.
Alexis Conneau, Holger Schwenk, Loı̈c Barrault, and
Yann Lecun. 2016. Very deep convolutional Duyu Tang, Bing Qin, and Ting Liu. 2015. Document
networks for natural language processing. arXiv modeling with gated recurrent neural network for
preprint arXiv:1606.01781. sentiment classification. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
Scott Deerwester, Susan T. Dumais, George W. Fur- guage Processing, pages 1422–1432, Lisbon, Portu-
nas, Thomas K. Landauer, and Richard Harshman. gal, September. Association for Computational Lin-
1990. Indexing by latent semantic analysis. Jour- guistics.
nal of the American Society for Information Science,
41(6):391–407. Bart Thomee, David A. Shamma, Gerald Fried-
land, Benjamin Elizalde, Karl Ni, Douglas Poland,
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Damian Borth, and Li-Jia Li. 2016. YFCC100M:
Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A The new data in multimedia research. Communica-
library for large linear classification. Journal of Ma- tions of the ACM, 59(2):64–73.
chine Learning Research, 9(Aug):1871–1874.
Sida Wang and Christopher Manning. 2012. Baselines
Joshua Goodman. 2001. Classes for fast maximum en- and bigrams: Simple, good sentiment and topic clas-
tropy training. In Proceedings of the International sification. In Proceedings of the 50th Annual Meet-
Conference on Acoustics, Speech, and Signal Pro- ing of the Association for Computational Linguistics
cessing, volume 1, pages 561–564, Salt Lake City, (Volume 2: Short Papers), pages 90–94, Jeju Island,
USA. IEEE. Korea, July. Association for Computational Linguis-
tics.
Thorsten Joachims. 1998. Text categorization with
support vector machines: Learning with many rel- Kilian Weinberger, Anirban Dasgupta, John Langford,
evant features. In Claire Nédellec and Céline Rou- Alex Smola, and Josh Attenberg. 2009. Feature
veirol, editors, 10th European Conference on Ma- hashing for large scale multitask learning. In Pro-
chine Learning, pages 137–142, Chemnitz, Ger- ceedings of the 26th Annual International Confer-
many. Springer Berlin Heidelberg. ence on Machine Learning, ICML ’09, pages 1113–
1120, New York, NY, USA. ACM.
Yoon Kim. 2014. Convolutional neural networks
for sentence classification. In Proceedings of the Jason Weston, Samy Bengio, and Nicolas Usunier.
2014 Conference on Empirical Methods in Natu- 2011. Wsabie: Scaling up to large vocabulary image
ral Language Processing (EMNLP), pages 1746– annotation. In Proceedings of the Twenty-Second
1751, Doha, Qatar, October. Association for Com- International Joint Conference on Artificial Intel-
putational Linguistics. ligence - Volume Volume Three, IJCAI’11, pages
2764–2770. AAAI Press.
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned Jason Weston, Sumit Chopra, and Keith Adams. 2014.
from word embeddings. Transactions of the Associ- #tagspace: Semantic embeddings from hashtags.
ation for Computational Linguistics, 3:211–225. In Proceedings of the 2014 Conference on Em-
pirical Methods in Natural Language Processing
Andrew McCallum and Kamal Nigam. 1998. A com-
(EMNLP), pages 1822–1827, Doha, Qatar, October.
parison of event models for naive bayes text clas-
Association for Computational Linguistics.
sification. In AAAI workshop on learning for text
categorization, pages 41–48, Madison, USA. Yijun Xiao and Kyunghyun Cho. 2016. Efficient
character-level document classification by combin-
Tomáš Mikolov, Anoop Deoras, Daniel Povey, Lukáš
ing convolution and recurrent layers. arXiv preprint
Burget, and Jan Černockỳ. 2011. Strategies for arXiv:1602.00367.
training large scale neural network language models.
In Workshop on Automatic Speech Recognition Un- Xiang Zhang and Yann LeCun. 2015. Text understand-
derstanding, pages 196–201, Waikoloa, USA. IEEE. ing from scratch. arXiv preprint arXiv:1502.01710.
Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Dean. 2013. Efficient estimation of word represen- Character-level convolutional networks for text clas-
tations in vector space. In 1st International Con- sification. In Advances in Neural Information
ference on Learning Representations (ICLR), Scotts- Processing Systems 28, pages 649–657, Montreal,
dale, USA. Canada.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in In-
formation Retrieval, 2(1-2):1–135, January.

You might also like