Learning to Generate Reviews and Discovering Sentiment

Alec Radford 1 Rafal Jozefowicz 1 Ilya Sutskever 1

Abstract it is now commonplace to reuse these representations on

We explore the properties of byte-level recur- a broad suite of related tasks - one of the most successful
rent language models. When given sufficient examples of transfer learning to date (Oquab et al., 2014).
amounts of capacity, training data, and compute There is also a long history of unsupervised representation
time, the representations learned by these models learning (Olshausen & Field, 1997). Much of the early re-
include disentangled features corresponding to search into modern deep learning was developed and val-
high-level concepts. Specifically, we find a single idated via this approach (Hinton & Salakhutdinov, 2006)
unit which performs sentiment analysis. These (Huang et al., 2007) (Vincent et al., 2008) (Coates et al.,
representations, learned in an unsupervised man- 2010) (Le, 2013). Unsupervised learning is promising due
ner, achieve state of the art on the binary subset of to its ability to scale beyond only the subsets and domains
the Stanford Sentiment Treebank. They are also of data that can be cleaned and labeled given resource, pri-
very data efficient. When using only a handful vacy, or other constraints. This advantage is also its diffi-
of labeled examples, our approach matches the culty. While supervised approaches have clear objectives
performance of strong baselines trained on full that can be directly optimized, unsupervised approaches
datasets. We also demonstrate the sentiment unit rely on proxy tasks such as reconstruction, density estima-
has a direct influence on the generative process tion, or generation, which do not directly encourage useful
of the model. Simply fixing its value to be pos- representations for specific tasks. As a result, much work
itive or negative generates samples with the cor- has gone into designing objectives, priors, and architectures
responding positive or negative sentiment. meant to encourage the learning of useful representations.
We refer readers to Goodfellow et al. (2016) for a detailed
1. Introduction and Motivating Work
Despite these difficulties, there are notable applications of
Representation learning (Bengio et al., 2013) plays a crit- unsupervised learning. Pre-trained word vectors are a vi-
ical role in many modern machine learning systems. Rep- tal part of many modern NLP systems (Collobert et al.,
resentations map raw data to more useful forms and the 2011). These representations, learned by modeling word
choice of representation is an important component of any co-occurrences, increase the data efficiency and general-
application. Broadly speaking, there are two areas of re- ization capability of NLP systems (Pennington et al., 2014)
search emphasizing different details of how to learn useful (Chen & Manning, 2014). Topic modelling can also dis-
representations. cover factors within a corpus of text which align to human
The supervised training of high-capacity models on large interpretable concepts such as art or education (Blei et al.,
labeled datasets is critical to the recent success of deep 2003).
learning techniques for a wide range of applications such How to learn representations of phrases, sentences, and
as image classification (Krizhevsky et al., 2012), speech documents is an open area of research. Inspired by the
recognition (Hinton et al., 2012), and machine transla- success of word vectors, Kiros et al. (2015) propose skip-
tion (Wu et al., 2016). Analysis of the task specific rep- thought vectors, a method of training a sentence encoder
resentations learned by these models reveals many fasci- by predicting the preceding and following sentence. The
nating properties (Zhou et al., 2014). Image classifiers representation learned by this objective performs competi-
learn a broadly useful hierarchy of feature detectors re- tively on a broad suite of evaluated tasks. More advanced
representing raw pixels as edges, textures, and objects training techniques such as layer normalization (Ba et al.,
(Zeiler & Fergus, 2014). In the field of computer vision, 2016) further improve results. However, skip-thought vec-
OpenAI, San Francisco, California, USA.
Alec Radford <[email protected]>. rectly optimize the desired performance metric on a spe-
cific dataset. This is the case for both text classification
Generating Reviews and Discovering Sentiment

tasks, which measure whether a specific concept is well en- tation to various degrees of out-of-domain data and tasks.
coded in a representation, and more general semantic sim-
ilarity tasks. This occurs even when the datasets are rela- 2. Dataset
tively small by modern standards, often consisting of only
a few thousand labeled examples. Much previous work on language modeling has evaluated
on relatively small but competitive datasets such as Penn
In contrast to learning a generic representation on one large
Treebank (Marcus et al., 1993) and Hutter Prize Wikipedia
dataset and then evaluating on other tasks/datasets, Dai
(Hutter, 2006). As discussed in Jozefowicz et al. (2016)
& Le (2015) proposed using similar unsupervised objec-
performance on these datasets is primarily dominated by
tives such as sequence autoencoding and language model-
regularization. Since we are interested in high-quality sen-
ing to first pretrain a model on a dataset and then finetune
timent representations, we chose the Amazon product re-
it for a given task. This approach outperformed training the
view dataset introduced in McAuley et al. (2015) as a train-
same model from random initialization and achieved state
ing corpus. In de-duplicated form, this dataset contains
of the art on several text classification datasets. Combin-
over 82 million product reviews from May 1996 to July
ing language modelling with topic modelling and fitting a
2014 amounting to over 38 billion training bytes. Due to
small supervised feature extractor on top has also achieved
the size of the dataset, we first split it into 1000 shards con-
strong results on in-domain document level sentiment anal-
taining equal numbers of reviews and set aside 1 shard for
ysis (Dieng et al., 2016).
validation and 1 shard for test.
Considering this, we hypothesize two effects may be com-
bining to result in the weaker performance of purely unsu-
pervised approaches. Skip-thought vectors were trained on
a corpus of books. But some of the classification tasks they
are evaluated on, such as sentiment analysis of reviews of
consumer goods, do not have much overlap with the text of
novels. We propose this distributional issue, combined with
the limited capacity of current models, results in represen-
tational underfitting. Current generic distributed sentence
representations may be very lossy - good at capturing the
gist, but poor with the precise semantic or syntactic details
which are critical for applications.
The experimental and evaluation protocols may be under-
estimating the quality of unsupervised representation learn-
ing for sentences and documents due to certain seemingly
insignificant design decisions. Hill et al. (2016) also raises
concern about current evaluation tasks in their recent work
which provides a thorough survey of architectures and ob-
jectives for learning unsupervised sentence representations
- including the above mentioned skip-thoughts. Figure 1. The mLSTM converges faster and achieves a better re-
sult within our time budget compared to a standard LSTM with
In this work, we test whether this is the case. We focus
the same hidden state size
in on the task of sentiment analysis and attempt to learn
an unsupervised representation that accurately contains this
concept. Mikolov et al. (2013) showed that word-level re-
current language modelling supports the learning of useful 3. Model and Training Details
word vectors and we are interested in pushing this line of Many potential recurrent architectures and hyperparameter
work. As an approach, we consider the popular research settings were considered in preliminary experiments on the
benchmark of byte (character) level language modelling dataset. Given the size of the dataset, searching the wide
due to its further simplicity and generality. We are also in- space of possible configurations is quite costly. To help
terested in evaluating this approach as it is not immediately alleviate this, we evaluated the generative performance of
clear whether such a low-level training objective supports smaller candidate models after a single pass through the
the learning of high-level representations. We train on a dataset. The model chosen for the large scale experiment is
very large corpus picked to have a similar distribution as a single layer multiplicative LSTM (Krause et al., 2016)
our task of interest. We also benchmark on a wider range with 4096 units. We observed multiplicative LSTMs to
of tasks to quantify the sensitivity of the learned represen- converge faster than normal LSTMs for the hyperparam-
Generating Reviews and Discovering Sentiment

eter settings that were explored both in terms of data and

Table 1. Small dataset classification accuracies
wall-clock time. The model was trained for a single epoch
on mini-batches of 128 subsequences of length 256 for a
total of 1 million weight updates. States were initialized M ETHOD MR CR SUBJ MPQA
to zero at the beginning of each shard and persisted across
NBSVM [49] 79.4 81.8 93.2 86.3
updates to simulate full-backpropagation and allow for the
S KIP T HOUGHT [23] 77.3 81.8 92.6 87.9
forward propagation of information outside of a given sub- S KIP T HOUGHT (LN) 79.5 83.1 93.7 89.3
sequence. Adam (Kingma & Ba, 2014) was used to ac- SDAE [12] 74.6 78.0 90.8 86.9
celerate learning with an initial 5e-4 learning rate that was C NN [21] 81.5 85.0 93.4 89.6
decayed linearly to zero over the course of training. Weight A DASENT [56] 83.1 86.3 95.5 93.3
normalization (Salimans & Kingma, 2016) was applied to BYTE M LSTM 86.9 91.4 94.6 88.5
the LSTM parameters. Data-parallelism was used across 4
Pascal Titan X gpus to speed up training and increase effec- 4. Experimental Setup and Results
tive memory size. Training took approximately one month.
The model is compact, containing approximately as many Our model processes text as a sequence of UTF-8 encoded
parameters as there are reviews in the training dataset. It bytes (Yergeau, 2003). For each byte, the model updates its
also has a high ratio of compute to total parameters com- hidden state and predicts a probability distribution over the
pared to other large scale language models due to operating next possible byte. The hidden state of the model serves
at a byte level. The selected model reaches 1.12 bits per as an online summary of the sequence which encodes all
byte. information the model has learned to preserve that is rele-
vant to predicting the future bytes of the sequence. We are
interested in understanding the properties of the learned en-
coding. The process of extracting a feature representation
is outlined as follows:

Since newlines are used as review delimiters in the

training dataset, all newline characters are replaced
with spaces to avoid the model resetting state.
Any leading whitespace is removed and replaced with
a newline+space to simulate a start token. Any trailing
whitespace is removed and replaced with a space to
simulate an end token. The text is encoded as a UTF-
8 byte sequence.
Model states are initialized to zeros. The model pro-
cesses the sequence and the final cell states of the mL-
STM are used as a feature representation. Tanh is ap-
plied to bound values between -1 and 1.

We follow the methodology established in Kiros et al.

(2015) by training a logistic regression classifier on top of
our models representation on datasets for tasks including
Figure 2. Performance on the binary version of SST as a function semantic relatedness, text classification, and paraphrase de-
of labeled training examples. The solid lines indicate the aver- tection. For the details on these comparison experiments,
age of 100 runs while the sharded regions indicate the 10th and we refer the reader to their work. One exception is that we
90th percentiles. Previous results on the dataset are plotted as
use an L1 penalty for text classification results instead of
dashed lines with the numbers indicating the amount of examples
required for logistic regression on the byte mLSTM representa-
L2 as we found this performed better in the very low data
tion to match their performance. RNTN (Socher et al., 2013) regime.
CNN (Kim, 2014) DMN (Kumar et al., 2015) LSTM (Wieting
et al., 2015) NSE (Munkhdalai & Yu, 2016) CT-LSTM (Looks 4.1. Review Sentiment Analysis
et al., 2017)
Table 1 shows the results of our model on 4 standard text
classification datasets. The performance of our model is
noticeably lopsided. On the MR (Pang & Lee, 2005) and
Generating Reviews and Discovering Sentiment

CR (Hu & Liu, 2004) sentiment analysis datasets we im-

prove the state of the art by a significant margin. The MR
and CR datasets are sentences extracted from Rotten Toma-
toes, a movie review website, and Amazon product reviews
(which almost certainly overlaps with our training corpus).
This suggests that our model has learned a rich represen-
tation of text from a similar domain. On the other two
datasets, SUBJs subjectivity/objectivity detection (Pang &
Lee, 2004) and MPQAs opinion polarity (Wiebe et al.,
2005) our model has no noticeable advantage over other
unsupervised representation learning approaches and is still
outperformed by a supervised approach.
To better quantify the learned representation, we also test
on a wider set of sentiment analysis datasets with differ-
ent properties. The Stanford Sentiment Treebank (SST)
(Socher et al., 2013) was created specifically to evaluate
more complex compositional models of language. It is de- Figure 3. Histogram of cell activation values for the sentiment
rived from the same base dataset as MR but was relabeled unit on IMDB reviews.
via Amazon Mechanical and includes dense labeling of the
phrases of parse trees computed for all sentences. For the
binary subtask, this amounts to 76961 total labels com- sentations our model learned and how they achieve the ob-
pared to the 6920 sentence level labels. As a demonstration served data efficiency. The benefit of an L1 penalty in the
of the capability of unsupervised representation learning to low data regime (see Figure 2) is a clue. L1 regulariza-
simplify data collection and remove preprocessing steps, tion is known to reduce sample complexity when there are
our reported results ignore these dense labels and computed many irrelevant features (Ng, 2004). This is likely to be the
parse trees, using only the raw text and sentence level la- case for our model since it is trained as a language model
bels. and not as a supervised feature extractor. By inspecting the
The representation learned by our model achieves 91.8% relative contributions of features on various datasets, we
significantly outperforming the state of the art of 90.2% by discovered a single unit within the mLSTM that directly
a 30 model ensemble (Looks et al., 2017). As visualized corresponds to sentiment. In Figure 3 we show the his-
in Figure 2, our model is very data efficient. It matches togram of the final activations of this unit after processing
the performance of baselines using as few as a dozen la- IMDB reviews (Maas et al., 2011) which shows a bimodal
beled examples and outperforms all previous results with distribution with a clear separation between positive and
only a few hundred labeled examples. This is under 10% negative reviews. In Figure 4 we visualize the activations
of the total sentences in the dataset. Confusingly, despite a of this unit on 6 randomly selected reviews from a set of
16% relative error reduction on the binary subtask, it does 100 high contrast reviews which shows it acts as an on-
not reach the state of the art of 53.6% on the fine-grained line estimate of the local sentiment of the review. Fitting
subtask, achieving 52.9%. a threshold to this single unit achieves a test accuracy of
92.30% which outperforms a strong supervised results on
4.2. Sentiment Unit the dataset, the 91.87% of NB-SVM trigram (Mesnil et al.,
2014), but is still below the semi-supervised state of the art
of 94.09% (Miyato et al., 2016). Using the full 4096 unit
Table 2. IMDB sentiment classification representation achieves 92.88%. This is an improvement of
only 0.58% over the sentiment unit suggesting that almost
M ETHOD E RROR all information the model retains that is relevant to senti-
ment analysis is represented in the very compact form of a
F ULL U NLABELED B OW (M AAS ET AL ., 2011) 11.11% single scalar. Table 2 has a full list of results on the IMDB
NB-SVM TRIGRAM (M ESNIL ET AL ., 2014) 8.13%
SA-LSTM (DAI & L E , 2015) 7.24%
BYTE M LSTM ( OURS ) 7.12% 4.3. Capacity Ceiling
T OPIC RNN (D IENG ET AL ., 2016) 6.24%
V IRTUAL A DV (M IYATO ET AL ., 2016) 5.91% Encouraged by these results, we were curious how well
the models representation scales to larger datasets. We
We conducted further analysis to understand what repre- try our approach on the binary version of the Yelp Dataset
Generating Reviews and Discovering Sentiment

25 August 2003 League of Extraordinary Gentlemen: Sean Connery is one of

the all time greats and I have been a fan of his since the 1950's. I went
to this movie because Sean Connery was the main actor. I had not read
reviews or had any prior knowledge of the movie. The movie surprised me
quite a bit. The scenery and sights were spectacular, but the plot was
unreal to the point of being ridiculous. In my mind this was not one of
his better movies it could be the worst. Why he chose to be in this movie
is a mystery. For me, going to this movie was a waste of my time. I will
continue to go to his movies and add his movies to my video collection.
But I can't see wasting money to put this movie in my collection

I found this to be a charming adaptation, very lively and full of fun.

With the exception of a couple of major errors, the cast is wonderful. I
have to echo some of the earlier comments -- Chynna Phillips is horribly
miscast as a teenager. At 27, she's just too old (and, yes, it DOES show),
and lacks the singing "chops" for Broadway-style music. Vanessa Williams
is a decent-enough singer and, for a non-dancer, she's adequate. However,
she is NOT Latina, and her character definitely is. She's also very
STRIDENT throughout, which gets tiresome. The girls of Sweet Apple's
Conrad Birdie fan club really sparkle -- with special kudos to Brigitta
Dau and Chiara Zanni. I also enjoyed Tyne Daly's performance, though I'm
not generally a fan of her work. Finally, the dancing Shriners are a riot,
especially the dorky three in the bar. The movie is suitable for the whole
family, and I highly recommend it.

Judy Holliday struck gold in 1950 withe George Cukor's film version of
"Born Yesterday," and from that point forward, her career consisted of
trying to find material good enough to allow her to strike gold again. It
never happened. In "It Should Happen to You" (I can't think of a blander
title, by the way), Holliday does yet one more variation on the dumb
blonde who's maybe not so dumb after all, but everything about this movie
feels warmed over and half hearted. Even Jack Lemmon, in what I believe Figure 5. Performance on the binary version of the Yelp reviews
was his first film role, can't muster up enough energy to enliven this
recycled comedy. The audience knows how the movie will end virtually from
the beginning, so mostly it just sits around waiting for the film to catch
dataset as a function of labeled training examples. The models
up. Maybe if you're enamored of Holliday you'll enjoy this; otherwise I
wouldn't bother. Grade: C performance plateaus after about ten labeled examples and only
Once in a while you get amazed over how BAD a film can be, and how in the slow improves with additional data.
world anybody could raise money to make this kind of crap. There is
absolutely No talent included in this film - from a crappy script, to a
crappy story to crappy acting. Amazing...

Team Spirit is maybe made by the best intentions, but it misses the warmth
of "All Stars" (1997) by Jean van de Velde. Most scenes are identic, just Table 3. Microsoft Paraphrase Corpus
not that funny and not that well done. The actors repeat the same lines as
in "All Stars" but without much feeling.

God bless Randy Quaid...his leachorous Cousin Eddie in Vacation and

Christmas Vacation hilariously stole the show. He even made the awful
Vegas Vacation at least worth a look. I will say that he tries hard in
this made for TV sequel, but that the script is so NON funny that the
movie never really gets anywhere. Quaid and the rest of the returning
Vacation vets (including the orginal Audrey, Dana Barron) are wasted here.
Even European Vacation's Eric Idle cannot save the show in a brief S KIP T HOUGHT (K IROS ET AL ., 2015) 73.0 82.0
cameo.... Pathetic and sad...actually painful to watch....Christmas
Vacation 2 is the worst of the Vacation franchise.
SDAE (H ILL ET AL ., 2016) 76.4 83.4
MTMETRICS [31] 77.4 84.1
BYTE M LSTM 75.0 82.8

Figure 4. Visualizing the value of the sentiment cell as it processes

six randomly selected high contrast IMDB reviews. Red indicates Table 4. SICK semantic relatedness subtask
negative sentiment while green indicates positive sentiment. Best
seen in color. M ETHOD r MSE

S KIP T HOUGHT [23] 0.858 0.792 0.269

S KIP T HOUGHT (LN) 0.858 0.788 0.270
Challenge in 2015 as introduced in Zhang et al. (2015). T REE -LSTM [47] 0.868 0.808 0.253
This dataset contains 598,000 examples which is an or- BYTE M LSTM 0.792 0.725 0.390
der of magnitude larger than any other datasets we tested
on. When visualizing performance as a function of number
of training examples in Figure 5, we observe a capacity businesses, where details like hospitality, location, and at-
ceiling where the test accuracy of our approach only im- mosphere are important. But these ideas are not present in
proves by a little over 1% across a four order of magnitude reviews of products. Additionally, there is a notable drop
increase in training data. Using the full dataset, we achieve in the relative performance of our approach transitioning
95.22% test accuracy. This better than a BoW TFIDF base- from sentence to document datasets. This is likely due to
line at 93.66% but slightly worse than the 95.64% of a lin- our model working on the byte level which leads to it fo-
ear classifier on top of the 500,000 most frequent n-grams cusing on the content of the last few sentences instead of
up to length 5. the whole document. Finally, as the amount of labeled data
increases, the performance of the simple linear model we
The observed capacity ceiling is an interesting phenomena
train on top of our static representation will eventually satu-
and stumbling point for scaling our unsupervised represen-
rate. Complex models explicitly trained for a task can con-
tations. We think a variety of factors are contributing to
tinue to improve and eventually outperform our approach
cause this. Since our model is trained only on Amazon
with enough labeled data.
reviews, it is does not appear to be sensitive to concepts
specific to other domains. For instance, Yelp reviews are of With this context, the observed results make a lot of sense.
Generating Reviews and Discovering Sentiment

Sentiment fixed to positive Sentiment fixed to negative

Just what I was looking for. Nice fitted pants, exactly The package received was blank and has no barcode. A
matched seam to color contrast with other pants I own. waste of time and money.
Highly recommended and also very happy!
This product does what it is supposed to. I always keep Great little item. Hard to put on the crib without some
three of these in my kitchen just in case ever I need a kind of embellishment. My guess is just like the screw
replacement cord. kind of attachment I had.
Best hammock ever! Stays in place and holds its shape. They didnt fit either. Straight high sticks at the end. On
Comfy (I love the deep neon pictures on it), and looks so par with other buds I have. Lesson learned to avoid.
Dixie is getting her Doolittle newsletter well see another great product but no seller. couldnt ascertain a cause.
new one coming out next year. Great stuff. And, heres Broken product. I am a prolific consumer of this company
the contents - information that we hardly know about or all the time.
I love this weapons look . Like I said beautiful !!! I rec- Like the cover, Fits good. . However, an annoying rear
ommend it to all. Would suggest this to many roleplayers piece like garbage should be out of this one. I bought this
, And I stronge to get them for every one I know. A must hoping it would help with a huge pull down my back &
watch for any man who love Chess! the black just doesnt stay. Scrap off everytime I use it....
Very disappointed.
Table 5. Random samples from the model generated when the value of sentiment hidden state is fixed to either -1 or 1 for all steps. The
sentiment unit has a strong influence on the models generative process.

On a small sentence level dataset of a known domain (the tense. We were curious whether a similar result could be
movie reviews of Stanford Sentiment Treebank) our model achieved using the sentiment unit. In Table 5 we show that
sets a new state of the art. But on a large, document level by simply setting the sentiment unit to be positive or neg-
dataset of a different domain (the Yelp reviews) it is only ative, the model generates corresponding positive or nega-
competitive with standard baselines. tive reviews. While all sampled negative reviews contain
sentences with negative sentiment, they sometimes contain
4.4. Other Tasks sentences with positive sentiment as well. This might be
reflective of the bias of the training corpus which contains
Besides classification, we also evaluate on two other stan- over 5x as many five star reviews as one star reviews. Nev-
dard tasks: semantic relatedness and paraphrase detection. ertheless, it is interesting to see that such a simple manipu-
While our model performs competitively on Microsoft Re- lation of the models representation has a noticeable effect
search Paraphrase Corpus (Dolan et al., 2004) in Table 3, on its behavior. The samples are also high quality for a byte
it performs poorly on the SICK semantic relatedness task level language model and often include valid sentences.
(Marelli et al., 2014) in Table 4. It is likely that the form
and content of the semantic relatedness task, which is built
on top of descriptions of images and videos and contains 5. Discussion and Future Work
sentences such as A sea turtle is hunting for fish is ef- It is an open question why our model recovers the con-
fectively out-of-domain for our model which has only been cept of sentiment in such a precise, disentangled, inter-
trained on the text of product reviews. pretable, and manipulable way. It is possible that senti-
ment as a conditioning feature has strong predictive capa-
4.5. Generative Analysis bility for language modelling. This is likely since senti-
Although the focus of our analysis has been on the prop- ment is such an important component of a review. Previous
erties of our models representation, it is trained as a gen- work analysing LSTM language models showed the exis-
erative model and we are also interested in its generative tence of interpretable units that indicate position within a
capabilities. Hu et al. (2017) and Dong et al. (2017) both line or presence inside a quotation (Karpathy et al., 2015).
designed conditional generative models to disentangle the In many ways, the sentiment unit in this model is just a
content of text from various attributes like sentiment or scaled up example of the same phenomena. The update
equation of an LSTM could play a role. The element-wise
Generating Reviews and Discovering Sentiment

