Learning To Generate Reviews and Discovering Sentiment
Learning To Generate Reviews and Discovering Sentiment
Learning To Generate Reviews and Discovering Sentiment
amounts of capacity, training data, and compute There is also a long history of unsupervised representation
time, the representations learned by these models learning (Olshausen & Field, 1997). Much of the early re-
include disentangled features corresponding to search into modern deep learning was developed and val-
high-level concepts. Specifically, we find a single idated via this approach (Hinton & Salakhutdinov, 2006)
unit which performs sentiment analysis. These (Huang et al., 2007) (Vincent et al., 2008) (Coates et al.,
representations, learned in an unsupervised man- 2010) (Le, 2013). Unsupervised learning is promising due
ner, achieve state of the art on the binary subset of to its ability to scale beyond only the subsets and domains
the Stanford Sentiment Treebank. They are also of data that can be cleaned and labeled given resource, pri-
very data efficient. When using only a handful vacy, or other constraints. This advantage is also its diffi-
of labeled examples, our approach matches the culty. While supervised approaches have clear objectives
performance of strong baselines trained on full that can be directly optimized, unsupervised approaches
datasets. We also demonstrate the sentiment unit rely on proxy tasks such as reconstruction, density estima-
has a direct influence on the generative process tion, or generation, which do not directly encourage useful
of the model. Simply fixing its value to be pos- representations for specific tasks. As a result, much work
itive or negative generates samples with the cor- has gone into designing objectives, priors, and architectures
responding positive or negative sentiment. meant to encourage the learning of useful representations.
We refer readers to Goodfellow et al. (2016) for a detailed
review.
1. Introduction and Motivating Work
Despite these difficulties, there are notable applications of
Representation learning (Bengio et al., 2013) plays a crit- unsupervised learning. Pre-trained word vectors are a vi-
ical role in many modern machine learning systems. Rep- tal part of many modern NLP systems (Collobert et al.,
resentations map raw data to more useful forms and the 2011). These representations, learned by modeling word
choice of representation is an important component of any co-occurrences, increase the data efficiency and general-
application. Broadly speaking, there are two areas of re- ization capability of NLP systems (Pennington et al., 2014)
search emphasizing different details of how to learn useful (Chen & Manning, 2014). Topic modelling can also dis-
representations. cover factors within a corpus of text which align to human
The supervised training of high-capacity models on large interpretable concepts such as art or education (Blei et al.,
labeled datasets is critical to the recent success of deep 2003).
learning techniques for a wide range of applications such How to learn representations of phrases, sentences, and
as image classification (Krizhevsky et al., 2012), speech documents is an open area of research. Inspired by the
recognition (Hinton et al., 2012), and machine transla- success of word vectors, Kiros et al. (2015) propose skip-
tion (Wu et al., 2016). Analysis of the task specific rep- thought vectors, a method of training a sentence encoder
resentations learned by these models reveals many fasci- by predicting the preceding and following sentence. The
nating properties (Zhou et al., 2014). Image classifiers representation learned by this objective performs competi-
learn a broadly useful hierarchy of feature detectors re- tively on a broad suite of evaluated tasks. More advanced
representing raw pixels as edges, textures, and objects training techniques such as layer normalization (Ba et al.,
(Zeiler & Fergus, 2014). In the field of computer vision, 2016) further improve results. However, skip-thought vec-
1
OpenAI, San Francisco, California, USA. Correspondence to: tors are still outperformed by supervised models which di-
Alec Radford <[email protected]>. rectly optimize the desired performance metric on a spe-
cific dataset. This is the case for both text classification
Generating Reviews and Discovering Sentiment
tasks, which measure whether a specific concept is well en- tation to various degrees of out-of-domain data and tasks.
coded in a representation, and more general semantic sim-
ilarity tasks. This occurs even when the datasets are rela- 2. Dataset
tively small by modern standards, often consisting of only
a few thousand labeled examples. Much previous work on language modeling has evaluated
on relatively small but competitive datasets such as Penn
In contrast to learning a generic representation on one large
Treebank (Marcus et al., 1993) and Hutter Prize Wikipedia
dataset and then evaluating on other tasks/datasets, Dai
(Hutter, 2006). As discussed in Jozefowicz et al. (2016)
& Le (2015) proposed using similar unsupervised objec-
performance on these datasets is primarily dominated by
tives such as sequence autoencoding and language model-
regularization. Since we are interested in high-quality sen-
ing to first pretrain a model on a dataset and then finetune
timent representations, we chose the Amazon product re-
it for a given task. This approach outperformed training the
view dataset introduced in McAuley et al. (2015) as a train-
same model from random initialization and achieved state
ing corpus. In de-duplicated form, this dataset contains
of the art on several text classification datasets. Combin-
over 82 million product reviews from May 1996 to July
ing language modelling with topic modelling and fitting a
2014 amounting to over 38 billion training bytes. Due to
small supervised feature extractor on top has also achieved
the size of the dataset, we first split it into 1000 shards con-
strong results on in-domain document level sentiment anal-
taining equal numbers of reviews and set aside 1 shard for
ysis (Dieng et al., 2016).
validation and 1 shard for test.
Considering this, we hypothesize two effects may be com-
bining to result in the weaker performance of purely unsu-
pervised approaches. Skip-thought vectors were trained on
a corpus of books. But some of the classification tasks they
are evaluated on, such as sentiment analysis of reviews of
consumer goods, do not have much overlap with the text of
novels. We propose this distributional issue, combined with
the limited capacity of current models, results in represen-
tational underfitting. Current generic distributed sentence
representations may be very lossy - good at capturing the
gist, but poor with the precise semantic or syntactic details
which are critical for applications.
The experimental and evaluation protocols may be under-
estimating the quality of unsupervised representation learn-
ing for sentences and documents due to certain seemingly
insignificant design decisions. Hill et al. (2016) also raises
concern about current evaluation tasks in their recent work
which provides a thorough survey of architectures and ob-
jectives for learning unsupervised sentence representations
- including the above mentioned skip-thoughts. Figure 1. The mLSTM converges faster and achieves a better re-
sult within our time budget compared to a standard LSTM with
In this work, we test whether this is the case. We focus
the same hidden state size
in on the task of sentiment analysis and attempt to learn
an unsupervised representation that accurately contains this
concept. Mikolov et al. (2013) showed that word-level re-
current language modelling supports the learning of useful 3. Model and Training Details
word vectors and we are interested in pushing this line of Many potential recurrent architectures and hyperparameter
work. As an approach, we consider the popular research settings were considered in preliminary experiments on the
benchmark of byte (character) level language modelling dataset. Given the size of the dataset, searching the wide
due to its further simplicity and generality. We are also in- space of possible configurations is quite costly. To help
terested in evaluating this approach as it is not immediately alleviate this, we evaluated the generative performance of
clear whether such a low-level training objective supports smaller candidate models after a single pass through the
the learning of high-level representations. We train on a dataset. The model chosen for the large scale experiment is
very large corpus picked to have a similar distribution as a single layer multiplicative LSTM (Krause et al., 2016)
our task of interest. We also benchmark on a wider range with 4096 units. We observed multiplicative LSTMs to
of tasks to quantify the sensitivity of the learned represen- converge faster than normal LSTMs for the hyperparam-
Generating Reviews and Discovering Sentiment
Judy Holliday struck gold in 1950 withe George Cukor's film version of
"Born Yesterday," and from that point forward, her career consisted of
trying to find material good enough to allow her to strike gold again. It
never happened. In "It Should Happen to You" (I can't think of a blander
title, by the way), Holliday does yet one more variation on the dumb
blonde who's maybe not so dumb after all, but everything about this movie
feels warmed over and half hearted. Even Jack Lemmon, in what I believe Figure 5. Performance on the binary version of the Yelp reviews
was his first film role, can't muster up enough energy to enliven this
recycled comedy. The audience knows how the movie will end virtually from
the beginning, so mostly it just sits around waiting for the film to catch
dataset as a function of labeled training examples. The models
up. Maybe if you're enamored of Holliday you'll enjoy this; otherwise I
wouldn't bother. Grade: C performance plateaus after about ten labeled examples and only
Once in a while you get amazed over how BAD a film can be, and how in the slow improves with additional data.
world anybody could raise money to make this kind of crap. There is
absolutely No talent included in this film - from a crappy script, to a
crappy story to crappy acting. Amazing...
Team Spirit is maybe made by the best intentions, but it misses the warmth
of "All Stars" (1997) by Jean van de Velde. Most scenes are identic, just Table 3. Microsoft Paraphrase Corpus
not that funny and not that well done. The actors repeat the same lines as
in "All Stars" but without much feeling.
On a small sentence level dataset of a known domain (the tense. We were curious whether a similar result could be
movie reviews of Stanford Sentiment Treebank) our model achieved using the sentiment unit. In Table 5 we show that
sets a new state of the art. But on a large, document level by simply setting the sentiment unit to be positive or neg-
dataset of a different domain (the Yelp reviews) it is only ative, the model generates corresponding positive or nega-
competitive with standard baselines. tive reviews. While all sampled negative reviews contain
sentences with negative sentiment, they sometimes contain
4.4. Other Tasks sentences with positive sentiment as well. This might be
reflective of the bias of the training corpus which contains
Besides classification, we also evaluate on two other stan- over 5x as many five star reviews as one star reviews. Nev-
dard tasks: semantic relatedness and paraphrase detection. ertheless, it is interesting to see that such a simple manipu-
While our model performs competitively on Microsoft Re- lation of the models representation has a noticeable effect
search Paraphrase Corpus (Dolan et al., 2004) in Table 3, on its behavior. The samples are also high quality for a byte
it performs poorly on the SICK semantic relatedness task level language model and often include valid sentences.
(Marelli et al., 2014) in Table 4. It is likely that the form
and content of the semantic relatedness task, which is built
on top of descriptions of images and videos and contains 5. Discussion and Future Work
sentences such as A sea turtle is hunting for fish is ef- It is an open question why our model recovers the con-
fectively out-of-domain for our model which has only been cept of sentiment in such a precise, disentangled, inter-
trained on the text of product reviews. pretable, and manipulable way. It is possible that senti-
ment as a conditioning feature has strong predictive capa-
4.5. Generative Analysis bility for language modelling. This is likely since senti-
Although the focus of our analysis has been on the prop- ment is such an important component of a review. Previous
erties of our models representation, it is trained as a gen- work analysing LSTM language models showed the exis-
erative model and we are also interested in its generative tence of interpretable units that indicate position within a
capabilities. Hu et al. (2017) and Dong et al. (2017) both line or presence inside a quotation (Karpathy et al., 2015).
designed conditional generative models to disentangle the In many ways, the sentiment unit in this model is just a
content of text from various attributes like sentiment or scaled up example of the same phenomena. The update
equation of an LSTM could play a role. The element-wise
Generating Reviews and Discovering Sentiment
operation of its gates may encourage axis-aligned repre- of Machine Learning Research, 12(Aug):24932537,
sentations. Models such as word2vec have also been ob- 2011.
served to have small subsets of dimensions strongly asso-
Dai, Andrew M and Le, Quoc V. Semi-supervised sequence
ciated with specific tasks (Li et al., 2016).
learning. In Advances in Neural Information Processing
Our work highlights the sensitivity of learned representa- Systems, pp. 30793087, 2015.
tions to the data distribution they are trained on. The results
make clear that it is unrealistic to expect a model trained Dieng, Adji B, Wang, Chong, Gao, Jianfeng, and Pais-
on a corpus of books, where the two most common gen- ley, John. Topicrnn: A recurrent neural network
res are Romance and Fantasy, to learn an encoding which with long-range semantic dependency. arXiv preprint
preserves the exact sentiment of a review. Likewise, it is arXiv:1611.01702, 2016.
unrealistic to expect a model trained on Amazon product Dolan, Bill, Quirk, Chris, and Brockett, Chris. Unsuper-
reviews to represent the precise semantic content of a cap- vised construction of large paraphrase corpora: Exploit-
tion of an image or a video. ing massively parallel news sources. In Proceedings of
There are several promising directions for future work the 20th international conference on Computational Lin-
highlighted by our results. The observed performance guistics, pp. 350. Association for Computational Lin-
plateau, even on relatively similar domains, suggests im- guistics, 2004.
proving the representation model both in terms of architec- Dong, Li, Huang, Shaohan, Wei, Furu, Lapata, Mirella,
ture and size. Since our model operates at the byte-level, Zhou, Ming, and Ke, Xu. Learning to generate prod-
hierarchical/multi-timescale extensions could improve the uct reviews from attributes. In Proceedings of the 15th
quality of representations for longer documents. The sen- Conference of the European Chapter of the Association
sitivity of learned representations to their training domain for Computational Linguistics, pp. 623632. Associa-
could be addressed by training on a wider mix of datasets tion for Computational Linguistics, 2017.
with better coverage of target tasks. Finally, our work
encourages further research into language modelling as it Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron.
demonstrates that the standard language modelling objec- Deep learning. 2016.
tive with no modifications is sufficient to learn high-quality
Hill, Felix, Cho, Kyunghyun, and Korhonen, Anna. Learn-
representations.
ing distributed representations of sentences from unla-
belled data. arXiv preprint arXiv:1602.03483, 2016.
References
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E,
Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Ge- Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, An-
offrey E. Layer normalization. arXiv preprint drew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath,
arXiv:1607.06450, 2016. Tara N, et al. Deep neural networks for acoustic mod-
eling in speech recognition: The shared views of four
Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. research groups. IEEE Signal Processing Magazine, 29
Representation learning: A review and new perspectives. (6):8297, 2012.
IEEE transactions on pattern analysis and machine in-
telligence, 35(8):17981828, 2013. Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc-
ing the dimensionality of data with neural networks. sci-
Blei, David M, Ng, Andrew Y, and Jordan, Michael I. La- ence, 313(5786):504507, 2006.
tent dirichlet allocation. Journal of machine Learning
Hu, Minqing and Liu, Bing. Mining and summarizing
research, 3(Jan):9931022, 2003.
customer reviews. In Proceedings of the tenth ACM
Chen, Danqi and Manning, Christopher D. A fast and SIGKDD international conference on Knowledge dis-
accurate dependency parser using neural networks. In covery and data mining, pp. 168177. ACM, 2004.
EMNLP, pp. 740750, 2014. Hu, Zhiting, Yang, Zichao, Liang, Xiaodan, Salakhutdinov,
Ruslan, and Xing, Eric P. Controllable text generation.
Coates, Adam, Lee, Honglak, and Ng, Andrew Y. An arXiv preprint arXiv:1703.00955, 2017.
analysis of single-layer networks in unsupervised feature
learning. Ann Arbor, 1001(48109):2, 2010. Huang, Fu Jie, Boureau, Y-Lan, LeCun, Yann, et al. Un-
supervised learning of invariant feature hierarchies with
Collobert, Ronan, Weston, Jason, Bottou, Leon, Karlen, applications to object recognition. In Computer Vision
Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Natu- and Pattern Recognition, 2007. CVPR07. IEEE Confer-
ral language processing (almost) from scratch. Journal ence on, pp. 18. IEEE, 2007.
Generating Reviews and Discovering Sentiment
Hutter, Marcus. The human knowledge compression con- Madnani, Nitin, Tetreault, Joel, and Chodorow, Martin. Re-
test. 2006. URL http://prize. hutter1. net, 2006. examining machine translation metrics for paraphrase
identification. In Proceedings of the 2012 Conference of
Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, the North American Chapter of the Association for Com-
Noam, and Wu, Yonghui. Exploring the limits of putational Linguistics: Human Language Technologies,
language modeling. arXiv preprint arXiv:1602.02410, pp. 182190. Association for Computational Linguistics,
2016. 2012.
Karpathy, Andrej, Johnson, Justin, and Fei-Fei, Li. Vi-
Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and San-
sualizing and understanding recurrent networks. arXiv
torini, Beatrice. Building a large annotated corpus of
preprint arXiv:1506.02078, 2015.
english: The penn treebank. Computational linguistics,
Kim, Yoon. Convolutional neural networks for sentence 19(2):313330, 1993.
classification. arXiv preprint arXiv:1408.5882, 2014.
Marelli, Marco, Bentivogli, Luisa, Baroni, Marco,
Kingma, Diederik and Ba, Jimmy. Adam: A Bernardi, Raffaella, Menini, Stefano, and Zamparelli,
method for stochastic optimization. arXiv preprint Roberto. Semeval-2014 task 1: Evaluation of com-
arXiv:1412.6980, 2014. positional distributional semantic models on full sen-
tences through semantic relatedness and textual entail-
Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan R, Zemel, ment. SemEval-2014, 2014.
Richard, Urtasun, Raquel, Torralba, Antonio, and Fidler,
Sanja. Skip-thought vectors. In Advances in neural in- McAuley, Julian, Pandey, Rahul, and Leskovec, Jure. Infer-
formation processing systems, pp. 32943302, 2015. ring networks of substitutable and complementary prod-
ucts. In Proceedings of the 21th ACM SIGKDD Inter-
Krause, Ben, Lu, Liang, Murray, Iain, and Renals, Steve.
national Conference on Knowledge Discovery and Data
Multiplicative lstm for sequence modelling. arXiv
Mining, pp. 785794. ACM, 2015.
preprint arXiv:1609.07959, 2016.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Mesnil, Gregoire, Mikolov, Tomas, Ranzato,
Imagenet classification with deep convolutional neural MarcAurelio, and Bengio, Yoshua. Ensemble of
networks. In Advances in neural information processing generative and discriminative techniques for sen-
systems, pp. 10971105, 2012. timent analysis of movie reviews. arXiv preprint
arXiv:1412.5335, 2014.
Kumar, Ankit, Irsoy, Ozan, Su, Jonathan, Bradbury, James,
English, Robert, Pierce, Brian, Ondruska, Peter, Gulra- Mikolov, Tomas, Yih, Wen-tau, and Zweig, Geoffrey. Lin-
jani, Ishaan, and Socher, Richard. Ask me anything: Dy- guistic regularities in continuous space word representa-
namic memory networks for natural language process- tions. 2013.
ing. CoRR, abs/1506.07285, 2015.
Miyato, Takeru, Dai, Andrew M, and Goodfellow, Ian. Ad-
Le, Quoc V. Building high-level features using large scale versarial training methods for semi-supervised text clas-
unsupervised learning. In Acoustics, Speech and Signal sification. arXiv preprint arXiv:1605.07725, 2016.
Processing (ICASSP), 2013 IEEE International Confer-
ence on, pp. 85958598. IEEE, 2013. Munkhdalai, Tsendsuren and Yu, Hong. Neural semantic
encoders. arXiv preprint arXiv:1607.04315, 2016.
Li, Jiwei, Monroe, Will, and Jurafsky, Dan. Understanding
neural networks through representation erasure. arXiv Ng, Andrew Y. Feature selection, l 1 vs. l 2 regularization,
preprint arXiv:1612.08220, 2016. and rotational invariance. In Proceedings of the twenty-
first international conference on Machine learning, pp.
Looks, Moshe, Herreshoff, Marcello, Hutchins, DeLesley, 78. ACM, 2004.
and Norvig, Peter. Deep learning with dynamic compu-
tation graphs. arXiv preprint arXiv:1702.02181, 2017. Olshausen, Bruno A and Field, David J. Sparse coding with
an overcomplete basis set: A strategy employed by v1?
Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang, Vision research, 37(23):33113325, 1997.
Dan, Ng, Andrew Y, and Potts, Christopher. Learning
word vectors for sentiment analysis. In Proceedings of Oquab, Maxime, Bottou, Leon, Laptev, Ivan, and Sivic,
the 49th Annual Meeting of the Association for Com- Josef. Learning and transferring mid-level image repre-
putational Linguistics: Human Language Technologies- sentations using convolutional neural networks. In Pro-
Volume 1, pp. 142150. Association for Computational ceedings of the IEEE conference on computer vision and
Linguistics, 2011. pattern recognition, pp. 17171724, 2014.
Generating Reviews and Discovering Sentiment
Pang, Bo and Lee, Lillian. A sentimental education: Senti- the gap between human and machine translation. arXiv
ment analysis using subjectivity summarization based on preprint arXiv:1609.08144, 2016.
minimum cuts. In Proceedings of the 42nd annual meet-
ing on Association for Computational Linguistics, pp. Yergeau, Francois. Utf-8, a transformation format of iso
271. Association for Computational Linguistics, 2004. 10646. 2003.
Pang, Bo and Lee, Lillian. Seeing stars: Exploiting class Zeiler, Matthew D and Fergus, Rob. Visualizing and under-
relationships for sentiment categorization with respect to standing convolutional networks. In European confer-
rating scales. In Proceedings of the 43rd annual meeting ence on computer vision, pp. 818833. Springer, 2014.
on association for computational linguistics, pp. 115 Zhang, Xiang, Zhao, Junbo, and LeCun, Yann. Character-
124. Association for Computational Linguistics, 2005. level convolutional networks for text classification. In
Pennington, Jeffrey, Socher, Richard, and Manning, Advances in neural information processing systems, pp.
Christopher D. Glove: Global vectors for word repre- 649657, 2015.
sentation. In EMNLP, volume 14, pp. 15321543, 2014. Zhao, Han, Lu, Zhengdong, and Poupart, Pascal. Self-
Salimans, Tim and Kingma, Diederik P. Weight normaliza- adaptive hierarchical sentence model. arXiv preprint
tion: A simple reparameterization to accelerate training arXiv:1504.05070, 2015.
of deep neural networks. In Advances in Neural Infor-
Zhou, Bolei, Khosla, Aditya, Lapedriza, Agata, Oliva,
mation Processing Systems, pp. 901901, 2016.
Aude, and Torralba, Antonio. Object detectors emerge in
Socher, Richard, Perelygin, Alex, Wu, Jean Y, Chuang, deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
Jason, Manning, Christopher D, Ng, Andrew Y, Potts,
Christopher, et al. Recursive deep models for seman-
tic compositionality over a sentiment treebank. Citeseer,
2013.
Tai, Kai Sheng, Socher, Richard, and Manning, Christo-
pher D. Improved semantic representations from tree-
structured long short-term memory networks. arXiv
preprint arXiv:1503.00075, 2015.
Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and
Manzagol, Pierre-Antoine. Extracting and composing
robust features with denoising autoencoders. In Proceed-
ings of the 25th international conference on Machine
learning, pp. 10961103. ACM, 2008.
Wang, Sida and Manning, Christopher D. Baselines and
bigrams: Simple, good sentiment and topic classifica-
tion. In Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics: Short
Papers-Volume 2, pp. 9094. Association for Computa-
tional Linguistics, 2012.
Wiebe, Janyce, Wilson, Theresa, and Cardie, Claire. An-
notating expressions of opinions and emotions in lan-
guage. Language resources and evaluation, 39(2):165
210, 2005.
Wieting, John, Bansal, Mohit, Gimpel, Kevin, and Livescu,
Karen. Towards universal paraphrastic sentence embed-
dings. arXiv preprint arXiv:1511.08198, 2015.
Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V,
Norouzi, Mohammad, Macherey, Wolfgang, Krikun,
Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, et al.
Googles neural machine translation system: Bridging