Social Media Mining
Social Media Mining
Social Media Mining
Yi-Fan Wang
[email protected]
HR background.
Interests: busyness analytics.
Abstract
Nowadays in a world where we see a
mountain of data sets around digital world,
Amazon is one of leading e-commerce
companies which possess and analyze
these customers data to advance their service and revenue. In order to understand
the power of text mining, we utilize these
data sets to have a better understanding of
the perspectives between stock price and
customer comments. We also use machine
learning techniques for fake review detection and trend patterns.
Introduction
2.1
Feature extraction
Since we have more than two million reviews, extracting features from all of them and building a
classifier with that amount of samples it is computational expensive and, in some cases, even impossible. Because of this, we extracted a reduced
amount of reviews of each category taking into account not only the polarity but also the rating value
of that review. This is, we filtered the positive reviews by selecting the ones that have a polarity
grater than 0.25 and a rating value greater or equal
to 4. The same with the negative reviews but with
a polarity less than 0 and rate value less or equal to
2 and for the neutral reviews, we filtered the data
with the polarity values between 0 and 0.25. Since
we are dealing with reviews and not with complex
texts, the vocabulary used does not include many
different words, so selecting the most 15000 representative samples of each category will be enough
to represent the entire data set. After this filtering
process, we used the bag-of-words approach for
text. The most intuitive way to do this is by assigning a fixed integer id to each word occurring
in any of the samples of the training set. Then, for
each document i, we count the number of occurrences of each word w and store it in a dictionary
Xi, j as the value of feature #j where j is the
index of word w in the dictionary. Since the bagof-words approach is a good start, there is an issue: larger reviews will have higher average count
values than shorter reviews. To avoid this we can
divide the occurrence of each word in a review by
the total number of words in that review, these new
features are called tf for Term Frequencies. Another improvement on top of the tf is to downscale weights for words that occur in many reviews
in the data set and are therefore less informative
than those that occur only in a smaller portion
of the data set. This downscaling is called tf idf
for Term Frequency times Inverse Document Frequency (Baeza-Yates et al , 1999), (Manning et al
, 2008). This is a well known method widely used
by researchers in text mining. In some cases, only
the 100 or even the 25 most frequent words are
enough to describe the documents of a particular
corpus.
2.2
Classification
Accuracy
72.95%
80.11%
Time
0.1307 sec
16 min 37.8846 sec
Ranking
2
2
2
1
1
Polarity
1
1
1
1
1
Ranking
2
2
2
1
1
Polarity
1
1
1
1
1
RS
Truthful
Truthful
Deceptive
Deceptive
Truthful
more details.
4
4.1
tions. The function of visualizing these most frequent words in comments of each brand can help
us to easily distinguish the overall user opinions
for these brands accessories.
4.3
Customer subjectivity
Figure 14: Motorola, BlackBerry, and Apple polarity and subjectivity distribution
4.5
4.7
its stock price even though our data set only ranges
from 2000-3-24 to 2014-7-2. Also from 2012 to
2014, its stock price, Average rate, and polarity
have comparable trends.
Conclusions
they prefer in the future. Through the emotion distribution by subjectivity and polarity,
we can have clear view on which brands
commenters tend to be more unsatisfied as
well. But due to the limit of our computing
efficiency, we cannot provide the comprehensive scatter plot at this stage.
3. In regard with detecting fake reviews, we
have built up our own first round filter to investigate these possible fake reviews. However, according to our finding, these possible
fake reviews without further emotion analyses could only be thought as unmatched
rating behaviors with their ratings and their
emotions on the comments. Thus, for the future work, we will consider to adapt classifiers to have more accurate findings.
4. Understanding each brands pricing design
requires lots of insightful data from the markets, here we give another perspective to dig
to this pricing field to know how these vendors like Jabra design and decide their price
for different brands accessories. In the future, our findings should be compared with
reputable market survey to confirm whether
our customer review data set are representative enough to be considered as pricing strategy.
5. With our three examples from HTC, Apple,
Blackberry, we all find that during the period from 2011 to 2014, their stock price, rating, and polarity share almost identical trends
which interest us to acquire more information
to understand why these customer reviews
started to match the financial market.
As for the work division regarding to this project,
we both discussed the tasks together and aport the
same amount of work to the project. Although the
business insights and theory were provided by YiFan since his background is related to that area.
The code used through out this work will be added
to projects folder.
References
McAuley, Julian, et al. Image-based recommendations
on styles and substitutes Proceedings of the 38th
International ACM SIGIR Conference on Research
and Development in Information Retrieval. ACM,
2015.