Customer Reviews Analysis With Deep Neural Network
INDEX TERMS recommender system, review, deep neural networks, recommendation, matrix factoriza-
tion, latent dirichlet allocation.
the weaknesses of MF methods by incorporating side infor- research is predicting ratings by using aspects of information
mation such as tags [10], [11], visual features [12], and social from users’ and items’ information. They constructed user-
relations [13], [14]. Customer reviews are one of the critical review and used it directly in the proposed model to provide
resources in developing recommender systems. A written rating predictions and recommendation lists. Besides, their
part of the review of a rating includes essential information model accomplishes a cross-domain task by transferring
on what the customer thinks about the product. word embedding.
Consequently, researchers suggest many models that ex- In another aspect-based recommender system paper, [25]
ploit reviews with ratings for improving the recommenda- proposed an aspect-aware MF model that effectively com-
tions. Some of these models are discussed in [15]–[18]. Sen- bines reviews and ratings for rating predictions. It learns the
timent analysis is one of the conventional approaches toward latent topics from reviews and ratings without having the
the analysis of customer reviews. It is mainly to predict constraint of a one-to-one mapping between latent factors
whether the attitude of a piece of text is positive or negative, and latent topics. Also, the model estimates aspect ratings and
supported or opposed [19]. Semantic analysis is employed to assign weights to the aspects. They performed experimental
analyze customer reviews [19]–[21] for different objectives results on many real-world datasets and showed the perfor-
such as to measure e-commerce service quality [22]. Some mance of their models in accurately predicting the ratings.
recent studies try to use customer reviews in developing Some aspect-based recommender systems utilize semantic
recommender systems. The approaches they utilized include analysis on reviews. For example, [26] proposed a sentiment
semantic analysis and aspect-based latent factor models [23]– utility logistic model that uses sentiment analysis of user
[26]. In this paper, we perform a customer review mining and reviews where it predicts the sentiment that the user has
extract a set of product characteristics that users mentioned about the item and then identifies the most valuable aspects
in the reviews and will use the Latent Dirichlet Allocation of the user’s possible experience with that item. For example,
(LDA) method to finalize the set of characteristics. We then the system suggests a user going to a specific restaurant (as
use the set of attributes to construct the users-attributes the primary recommendation), and also it recommends an
matrix. This matrix, however, is very sparse as each user aspect of that restaurant like the time to go to a restaurant
mentions only a few attributes in the review. Sparsity is a (breakfast, lunch, or dinner) as a valuable aspect to the user
well-known challenge in developing recommender systems. (the secondary recommendation). The experimental results
Many papers propose various solutions to deal with the demonstrated the better experience of those users who fol-
sparsity problem. To deal with this problem, we use a deep lowed the recommendations.
neural network that plays an autoencoder role which helps In the context of analyzing reviews, [28] analyzed cus-
to learn more abstract and latent attributes. Having users- tomer reviews to find out what makes a review helpful to
attributes and users-items matrix, we use an MF model to other customers. They analyzed 1,587 reviews from Ama-
predict ratings and provide recommendations. and indicated that extremity, depth of review, prod-
The rest of this paper is organized as follows. Section uct type affect the perceived helpfulness of the review. While
II presents the literature survey and related works. Section this research does not incorporate the reviews in making
III discusses the problem description and the proposed ap- recommendations, it provides information that is potentially
proach. Section IV provides information about the dataset useful in developing recommender systems.
and how we conducted data preprocessing and analyzes the [29] proposed a new recommender system that integrates
outcomes of the experimental analysis. Finally, we conclude opinion mining and recommendations. They proposed a new
current research, its limitations, and future research direc- feature and opinion extraction method based on the character-
tions. istics of online reviews which can address the problem of data
sparseness. They used the part-of-speech tagging approach
II. RELATED WORK based on association rule mining for each review. They per-
Dealing with text as unstructured data is challenging. Natural formed their empirical study on online restaurant customer
Language Processing (NLP) is a branch of computer science reviews written in Chinese and illustrated the performance of
and artificial intelligence (AI) concerned with processing and the proposed methods.
analyzing natural language data. Deep learning for NLP is [15] considered the review texts using topic modeling
one of the approaches that is improving the capability of the techniques and align the topic with rating dimensions to
computer to understand human language [27]. There are a enhance the prediction accuracy. They proposed a unified
few studies that try to incorporate customer written reviews model combining content-based and collaborative filtering,
in generating recommendations. which can deal with the cold-start problem. They applied the
Some researches on the integration of customer reviews in proposed framework to 27 classes of real-case datasets and
recommendation systems are under the category of aspect- showed the significant improvement of the recommendations
based or aspect-aware recommender systems. As an aspect- comparing to the baselines methods.
based recommender system, [24] proposed a model called [16] tried to incorporate the implicit tastes of each user
aspect-based latent factor model which integrates ratings and in order to predict ratings as the text review justifies a user’s
review texts via latent factor model. The purpose of this rating. They used latent review topics extracted from topic
models as highly interpretable textual labels for latent rating methods [30]. While latent factor models try to uncover hid-
dimensions. Also, they accurately predicted product ratings den dimensions in review ratings, LDA aims to uncover hid-
using the information extracted from the reviews, which den dimensions in the written part of the review. Introduced
can improve the recommendations for those that have too by Blei et al. [31]. LDA is a generative statistical model
few ratings. Moreover, their discovered topics are useful in for topic modeling in the natural language processing (NLP)
facilitating tasks such as automated genre discovery. In a context. Topic modeling is the task of describing a collection
similar study, [17] exploit textual review information along of documents by identifying a set of topics. In LDA, we
with ratings to model user preferences and item attributes in model each item of the collection as a finite mixture over an
a shared topic space. They used an MF model for generating underlying set of topics as a three-level hierarchical Bayesian
recommendations and used 26 real-case datasets to evaluate model. We also model each topic as an infinite mixture
the performance of their model. over an underlying set of topic probabilities which provides
As presented above, none of the abovementioned studies an explicit representation of a document [23]. In order to
used a deep neural network autoencoder to deal with the spar- describe LDA, a set of documents d ∈ D and LDA associate
sity in the user-attributes matrix extracted from the reviews. each document with a K-dimensional stochastic vector as a
To the best of the authors’ knowledge, this is the first study topic distribution θd . This association encodes the fraction
that extracts deep features from extracted latent topics from of words in a document that discusses the topic k with the
the textual user reviews to develop a recommender system. probability of θd,k . LDA associate a word distribution, φk to
In the next section, we present the proposed approach. each topic to encode the probability of a word used for that
topic. LDA assumes a Dirichlet distribution for the topic (θd ).
III. METHODOLOGY As a result of applying LDA, we have word distribution of
In this section, we provide the proposed methodology for each topic and topic distribution for each document. Having
incorporating customer written reviews in developing rec- the word distribution and topic assignment of the words, we
ommender systems. Figure 1 depicts the general framework can calculate the likelihood of a corpus T as
for transforming customer written reviews into a dense users-
attributes matrix and predicting ratings using this matrix and Nd
users-items matrix. As described before, the idea of how to p(T |θ, φ, z) = θzd,j ,wd,j (1)
use customer written reviews is investigating what attributes d∈T j=1
FIGURE 1. General framework of the proposed approach of using customer reviews in recommendation systems
of the proposed deep neural networks to process the attributes is the input layer and output layer, while the decoder is the
extracted using LDA. hidden layer and the output layer. Figure 2 illustrates the
The reason for using sparse coding is to learn more inter- purpose of the autoencoder, which is reconstructing the input
pretable features for machine learning applications [34] and data in the output layer with the same dimensionality. In other
it helps at representing the input matrix as a weighted linear words, this follows an unsupervised learning framework.
combination of a small number of basis vectors. The resulted Letting x1 , x2 , ..., xm be an unlabeled dataset, we can
matrix is capable of capturing high-level patterns that exist obtain the nonlinear representation of the input data using
in the input layer. For instance, [35] developed a sparse activation function [36]. Using a sigmoid activation for an
autoencoder as the result of combined sparse coding with unlabeled dataset xi , the representation is
the autoencoder. They implemented their idea by penalizing h(xi ; W, b) = σ(W xi + b) (2)
the deviation between the expected hidden representation and
present average activation. In more relevant research, [8] de- where W denotes weight matrix, σ is the sigmoid activation
veloped an autoencoder using deep neural networks for tag- function, and b is the bias term. This representation is also
aware recommender systems. Through experimental results, called the hyperbolic tangent function. On the other side,
they demonstrated the usefulness of the sparse autoencoders the decoder reconstructs the input into the output layer by
for the recommendation algorithms. minimizing the error between the input and the output layers.
The minimizing term is defined in Equation (4.3).
Inspired by [8], an autoencoder constitutes an input layer, m
a hidden layer, and an output layer. We can divide the X
min ||σ(W T h(xi ; W, b) + c) − xi ||2 (3)
autoencoder itself into an encoder and decoder. The encoder
The last step in generating recommendations is using the MF
model to predict ratings. First, we update the user profile
based on the users-deep features matrix obtained from the
deep neural network. Conventional collaborative filtering
models only use user-item or user-attributes matrix to gen-
erate a recommendation list. In order to use both users-items
and users-deep features information, we employ the approach
developed by Ricci et al. [38]. In this method, we should
FIGURE 2. Demonstration of a simple autoencoder find the target user neighborhood Nu based on the similarity
between the target user and other users using Equation (4.6).
X̂u , X̂v
where m denotes the number of examples. Since the min- simu,v = (6)
imization is a convex function, we can obtain the opti- ||X̂u ||||X̂v ||
mal value. We can calculate the sparsity penalty term by Having the similarity matrix between users, we can predict
Kullback-Leibler divergence between a preferred activation the rating of the target user using a weighted average of
ratio in the hidden layer and the desired hidden representa- ratings from the neighbor users using Equation (4.7).
tions [37] using Equation (4.4). X
Su,i = (πU I Y )v,i (7)
X v∈Nu
P = DKL (ρ||ρ̂j ) (4)
j=1 The final and easy step is to sort the predicted ratings for
items and generate the list of the recommendations according
in which ρ is a reset average activation that is set to be close to the size of the list, n. Please note that using this ap-
to zero in practice, and n is the number of hidden units. Also, proach, we are exploiting the ternary relation between users-
DKL (ρ||ρ̂j ) = ρ log ρ̂ρj + (1 − ρ) log 1−ρ̂j . attributes-items [8].
Combining the objective function and penalty term intro-
duced above, we obtain the final objective function for the IV. EXPERIMENTAL RESULTS
autoencoder using Equation (4.5). A. DATASET
X In this research, we use the Amazon Review dataset [39].
min ||σ(W T h(xi ; W, b) + c) − xi ||2 + βP (5) This dataset contains 142.8 million reviews on the Amazon
i=1 products between May 1996 and July 2014 along with users
where β is a hyperparameter to change the weight of the profile and item metadata. The dataset includes the ID of
penalty term. the reviewer, the ID of the product (ASIN), name of the
In order to transform this architecture into an autoencoder reviewer, helpfulness rating of the review, text of the review,
using a deep neural network, we need to use more layers as rating of the product, a summary of the review, and time of
hidden layers where the output of each layer is the input for the review. It also has the name of the product, price in US
the next layer. In other words, the procedure and calculations dollars, related products, sales rank information, brand name,
explained above will be followed for more than one time to and the list of the categories of the product. Table 1 presents
the result of each implementation. The input data will train the statistics of the Amazon Reviews dataset separated by
the first hidden layer, and the output layer of the first hidden each product category [15]. Also, Figure 3 demonstrates the
layer will serve as the input of the second hidden layer. We sparsity of the reviews in the dataset where the percentage
iterate these steps based on the number of hidden layers for a product category indicates the percentage of the users
considered for the autoencoder. We use this deep neural with no more than three ratings [17]. This sparsity can reduce
the performance of recommender systems drastically. On mance. During the performance evaluation, we performed
average, there is roughly an average of 120 words in each both on a product category to find the best results. Figures
review. 4, 5, and 6 demonstrate the MSE on two product datasets
used to tune the hyperparameters. As you can see, MSE is
not improving after using three hidden layers. MSE stops
improving significantly at 1000 and 800 neurons in the first
and second hidden layers, respectively.
We applied the proposed deep feature extractor method to all
the product categories datasets and obtained the best MSE
for our model and compared these results with our baselines.
Figure 7 and Table 2 demonstrate the results.
As you can see in these figures, the proposed method per-
FIGURE 6. Analysis of MSE based on the number of neurons in the second forms better for most of the product categories. Comparing
hidden layers to the MF model, our method is capable of predicting ratings
with an average of 8.71% improvement, in some cases up
to 20.19%. In three cases, MF shows better performance,
It also employs matrix factorization to deal with the including Books, Movies and TV, and Music. For Books
ratings. and Music categories, the model is off only by less than 1
• RMR is a hybrid model constituted of content-based percent, which implies that the performance of the model
filtering and collaborative filtering suggested by [15]. is close to the best performance for these two categories.
This model tries to exploit the information from reviews A similar situation is happening between the HFT model
and improve the recommendation list accuracy across and the proposed approach. Our deep neural network model
TABLE 2. MSE of the proposed method vs baselines and the percentage of the improvement
FIGURE 7. Comparing the MSE from the proposed method and the baselines
