Text Mining With R PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

11/4/2019

Text Mining with R

Paulo Canas Rodrigues


Department of Statistics, Federal University of Bahia, Salvador, Brazil

Text Mining with R


(Sentiment analysis in social media)

2/22

1
11/4/2019

Sentiment analysis in social media


Data: Twitter US Airline Sentiment

Assumptions:
The more addressees in a tweet, the harsher its words
Longer tweets are also less likely to contain favorable language

Sources:
https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length
https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook
https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-2

Files used in this course:


Twitter.Visualization.USAirlines.R
Tweets.csv (https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

R packages needed:
See Twitter.Visualization.USAirlines.R

3/22

Required R packages

R packages required (Twitter.Visualization.USAirlines.R):

install.packages("readr")
install.packages("ggplot2")
install.packages("ggthemes")
install.packages("dplyr")
install.packages("stringr")
install.packages("gridExtra")
install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("fpc")
install.packages("cluster")
install.packages("maps")

4/22

2
11/4/2019

Read in the data and inspect


# Load packages library('readr') # read files
library('ggplot2') # visualization
library('ggthemes') # visualization
library('dplyr') # data manipulation
library('stringr') # text manipulation

tweets <- read_csv('../input/Tweets.csv')

str(tweets)

# Classes 'tbl_df', 'tbl' and 'data.frame': 14640 obs. of 15 variables:


## $ tweet_id : num 5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
## $ airline_sentiment : chr "neutral" "positive" "neutral" "negative" ...
## $ airline_sentiment_confidence: num 1 0.349 0.684 1 1 ...
## $ negativereason : chr NA NA NA "Bad Flight" ...
## $ negativereason_confidence : num NA 0 NA 0.703 1 ...
## $ airline : chr "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
## $ airline_sentiment_gold : chr NA NA NA NA ...
## $ name : chr "cairdin" "jnardino" "yvonnalynn" "jnardino" ...
## $ negativereason_gold : chr NA NA NA NA ...
## $ retweet_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ text : chr "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the
experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica
it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they have little
recours"| __truncated__ ...
## $ tweet_coord : chr NA NA NA NA ...
## $ tweet_created : chr "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800" "2015-02-24 11:15:48 -0800"
"2015-02-24 11:15:36 -0800" ...
## $ tweet_location : chr NA NA "Lets Play" NA ...
## $ user_timezone : chr "Eastern Time (US & Canada)" "Pacific Time (US & Canada)" "Central Time (US & Canada)"
"Pacific Time (US & Canada)" ...

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

5/22

Prepping the data


# Create a variable holding the number of `@` characters in each tweet
tweets$at_count <- sapply(tweets$text, function(x) str_count(x, '@'))

maxAt <- max(tweets$at_count)

# Collapse number of 'ats'that are


tweets$at_countD[tweets$at_count == 1] <- '1'
tweets$at_countD[tweets$at_count == 2] <- '2'
tweets$at_countD[tweets$at_count %in% c(3:maxAt)] <- '3+'

# Change to a factor variable


tweets$at_countD <- factor(tweets$at_countD)

# Store the length of each tweet


tweets$text_length <- sapply(tweets$text, function(x) nchar(x))

The stringr package is used to count the number of @ symbols in the tweet. Of
course if there is only one, then it is the airline.
The same package is used to count the number of characters used in the tweet
where the maximum length should be 170.

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

6/22

3
11/4/2019

Preliminary visual inspection

# Getting my sentiment colors & breaks ready.


sentPlt <- c('#f93822','#fedd00','#27e833')
sentBreaks <- c('positive','neutral','negative')

# Visualize distribution of `@` counts


ggplot(tweets, aes(x = at_count)) +
geom_density(fill = '#99d6ff', alpha=0.4) +
labs(x = 'Number of @s') + theme_few() +
theme(text = element_text(size=12))

##
# Show counts
## 1 2 3+
table(tweets$at_countD)
## 12995 1420 225

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

7/22

Preliminary visual inspection

# Visualize distribution of tweet length by sentiment


ggplot(tweets, aes(x = text_length,
fill = airline_sentiment)) +
geom_density(alpha = 0.2) +
scale_fill_manual(name = 'Tweet\nSentiment',
values = sentPlt,
breaks = sentBreaks) +
geom_vline(xintercept = 170,
lwd=1, lty = 'dashed') +
labs(x = 'Tweet Length') +
theme_few() +
theme(text = element_text(size=12))

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

8/22

4
11/4/2019

Deeper analysis: angrier tweets have more @ symbols?

# Visualize proportions of positive, neutral, and negative


# sentiment tweets by number of @ symbols used
ggplot(tweets, aes(x = at_countD, fill = airline_sentiment)) +
geom_bar(position = 'fill', colour = 'black') +
scale_fill_manual(name = 'Tweet\nSentiment',
values = sentPlt,
breaks = sentBreaks) +
labs(x = 'Number of @s', y = 'Proportion') +
theme_few() +
theme(text = element_text(size=12))

While tweets containing 1, 2, and 3+ @ symbols have roughly the same proportion
of positive tweets, the negativity goes down and neutrality goes up.
This is probably because the ratio of useful text to perform sentiment analysis is
decreasing as the number of addressees in the text increases resulting in greater
uncertainty/neutrality.

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

9/22

Deeper analysis: angrier tweets have more @ symbols?


# Visualize the same plot as before but add airline
ggplot(tweets, aes(x = at_countD, fill = airline_sentiment)) +
geom_bar(position = 'fill', colour = 'black') +
facet_wrap(~airline) +
scale_fill_manual(name = 'Tweet\nSentiment',
values = sentPlt,
breaks = sentBreaks) +
labs(x = 'Number of @s', y = 'Proportion') +
theme_few() +
theme(text = element_text(size=12))

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

10/22

5
11/4/2019

Deeper analysis: nicer tweets are longer?


ggplot(tweets, aes(x = text_length, fill = airline_sentiment)) +
geom_density(alpha = 0.2) +
facet_wrap(~airline, scale = 'free') +
scale_fill_manual(name = 'Tweet\nSentiment',
values = sentPlt,
breaks = sentBreaks) +
labs(x = 'Tweet Length') +
theme_few() +
theme(text = element_text(size=12))

We see that negative tweets


tend to be considerably
longer than positive or
neutral ones.

In fact, it’s interesting to see


that ceiling effect of the 170
character limit among tweets
directed at Virgin America.

Source: https://www.kaggle.com/mrisdal/d/crowdflower/twitter-airline-sentiment/exploring-audience-text-length

11/22

Proportion of tweets with each sentiment

We can see from the bar


plot and the pie that most
tweets contain negative
sentiment.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook

12/22

6
11/4/2019

Proportion of tweets per airline

Most of the tweets are


directed towards United
Airlines, followed by
American and US Airways.
Very few tweets are targeted
towards Virgin America.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook

13/22

Proportion of negative sentiment tweets per airline

The second plot is more


informative, in the sense that
it allows as to see the
proportion of negative
sentiment tweets per airline.

We see that American,


United and US Airways
directed tweets are mostly
negative.

On the contrary, tweets


directed towards Delta,
Southwest and Virgin
contain a good proportion of
neutral and positive
sentiment tweets.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook

14/22

7
11/4/2019

Reasons for negative sentiment tweets

We see that negative


sentiment is mostly elicited
by Customer Service Issues
(presumably bad customer
service), followed by Late
Flights.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook

15/22

Reasons for negative sentiment per airline


From the plots we can see
that for American airlines,
negative sentiment is
elicited mostly by
Customer Service related
Issues, and not so much
for Late Flights. We could
speculate that American
flights depart mostly on
time.

The same seems to be


true for Virgin and
Southwest airlines. Virgin
seems to have a sub-
optimal booking system,
as booking problems is
the second reason
eliciting bad sentiment in
tweets.

US Airways and United


have a number of
complaints for Customer
Service Issues followed
closely by Late Flights.

On the contrary, for Delta


most of the complaints are
due to late flights. We
could then speculate that
Delta has problems with
having their flights depart
on time, yet they show a
perhaps better customer
service.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook

16/22

8
11/4/2019

Location of tweets: Visualisation on maps

Most tweets come from US


& Canada time zone.

Most tweets come from the


States.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-1/notebook

17/22

Conclusions
Most tweets have negative sentiment (>60%).

Most tweets are targeted towards United airlines, followed by American and US Airways.

Virgin receives very few tweets.

Most of the tweets targeted towards American, United and US Airways contain negative
sentiment.

Tweets targeted towards Delta, Virgin and Southwest containing roughly same proportion
of negative, neutral and positive sentiment.

Main reasons for negative sentiment are Customer Service Issues and Late Flights.

Negative sentiment tweets towards Delta are based mostly on late flights and not so
much on Customer Service Issues as for the rest of the airlines.

Most tweets are not retweeted.

Most tweets come from US & Canada time zone.

Most tweets come from the States.

18/22

9
11/4/2019

Cloud of words for each sentiment

The cloud of words provide a nice visual


representation of the word frequency for each
type of sentiment (negative: left or positive:
right).

The size of the word correlates with its


frequency across all tweets.

We can get an idea of what people are talking


about.

For example, for negative sentiment, people


seem to complain about cancelled or delayed
flights, and hours waiting.

However, for positive sentiment, people are


mostly thankful and they talk about great
service/flight.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-2

19/22

Clustering analysis of words

In the dendrogram, words


that are linked by short arms
are highly associated.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-2

20/22

10
11/4/2019

Clustering analysis of words

Although the dendrogram


does not seem to be
particularly informative, we
observe again the
association of words like
customer and service, and
cancelled flight.

Words that reflect complains


more generally, like waiting,
bag (presumably lost),
hours, time, hold, cluster
altogether.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-2

21/22

Clustering analysis of words

The positive tweet


dendrogram is somewhat
more informative.

We can see the association


of customer-service, and
best-airline, or love-guys,
good-time, which indicate
more clearly, what the
experience of the airline
client was.

Source: https://www.kaggle.com/solegalli/d/crowdflower/twitter-airline-sentiment/airline-sentiment-part-2

22/22

11

You might also like