Analyzing Political Discourse On Reddit

Network Analysis Final Report
Max Candocia
December 12, 2014
Abstract
This papers purpose is to show how to apply semi-supervised generative models to classify the political ideologies of users in an online network and then analyze the interactions
thoughout the website using homophily measures. For this study, Reddit, a large social media website, is used as the sample source. The model uses two sets of variables: the various
parts of the website where users post comments and whether or not it is positively received,
and various n-grams used within these comments. Oftentimes users of the Internet will discuss and debate various topics relating to politics, but the manner in which they take place
varies greatly. After classifying users, analyzing the homophily between different portions
of the website will provide insight into how diversity of ideologies (or lack thereof) plays a
role in facilitating discussions. In addition, the amount of positively-scored comments between people of different or similar ideologies is used as a way of detecting hostilities between
groups. For the generative model, one of the interesting results is which n-grams are used
more often by people of differing ideologies. While these words do not include context, the
second set of variables that involve posting locations help provide that aspect using Reddits
website structure and scoring system. Hyperparameter controls aid in how sharply different
variable sets discriminate between classes, and are shown to reduce both error and impurity
when using 5-fold cross-validation. Finally, the limitations of an algorithm are discussed, as
comment histories are treated as corpora, which can be tens of thousands of words long.
Introduction
User interactions on websites is often a topic of interest. However, the amount of data
one can gather on a particular user is often limited, especially if they do not have much
1
information publicly available or have a large amount of text available for analysis. For the
purposes of this project, Reddit is used. Reddit, one of the largest social media websites
in the United States, hosts tens of millions of discussions between millions of users. Many
of these users have extensive posting histories, and they are divided up into sections of the
website called Subreddits. Subreddits are sections of the website for which content revolves
around a single topic, which can be either specific, such as a sports team in Germany, or
broad, such as US politics. There are Subreddits for a variety of topics, and they are usually
named by putting /r/ (pronounced as the English letter r is pronounced) before them,
such as /r/politics, /r/Conservative, or /r/DebateCommunism. Anecdotally, these
communities have biases in them, but no attempt has been made to measure this bias in
terms of political ideology or affiliation. While it is relatively obvious to any user who
traverses the site in general, attempts to quantify this have not been made, largely due to
the difficulty of classifying the ideologies of tens of thousands of users.
With these results, it is possible to analyze the compositions of various Subreddits, as
well as quantify interactions between users of different ideologies in them. The results may
be interpretable on a scale of -1 to 1, where -1 indicates that a Subreddit is adebating
grounds (i.e., the sole purpose of commenting is to argue with someone of a different political
ideology), while 1 indicates that it is a pure hive mind (i.e., believers of an ideology only
discuss amongst themselves). Of course, the limitations of the classification algorithm should
be taken into account whenever analyzing the results.
Background
Ideologies
For this study, users are grouped into five different ideological categories. This is not a 100%
accurate representation of the political spectrum, but the groups are general enough to apply
to a large number of people, and they are different enough in terms of philosophical aspects
that they can be considered distinct. These terms are relatively unambiguous when used in
the context of United States politics.
Anarcho-Captialism
Anarcho-Capitalism is a subset of libertarianism as it is described below. Generally speaking, it is the belief that all interactions between humans should be facilitated according to
principles of laissez-faire capitalism and that government does not exist.

Conservatism
In this context, conservatism refers to conservatism in the United States. Generally speaking,
it is the belief that the government should not have as large of a role in economic issues.
As far as social issues go, there are varying degrees of belief that government should control
certain social interactions between people, but on Reddit, the number of conservatives who
believe in strong government control over social interactions is relatively small compared to
the number of conservatives. Most of the followers of this ideology on Reddit lean towards
the US Republican party.
Liberalism
In this context, liberalism refers to what is generally known as left-liberalism, which consists
of the beliefs of less involvement of government in social affairs between people and more
government involvement in peoples well-being and regulation of businesses. Most of the
followers of this ideology on Reddit lean towards the US Democratic party.
Libertarianism
In this context, libertarianism refers to what is generally known as right-libertarianism, which
is the belief that government should have minimal involvement in both social and economic
affairs of people. While anarcho-capitalism is a subspace of this ideology, it is excluded from
this category because of the divisiveness between the idea of minimal government and no
government.
Socialism
In this context, socialism mostly refers to communism, but it can also refer to the ideologies
of social democrats, who prefer more gradual change than other socialists, who often describe
themselves as anarchists. One of the common goals of socialism is to eliminate hierarchy in
society via the elimination of capitalism.
Learning Algorithm
Learning algorithms for large networks require relatively simple models that can support a
large number of parameters. For this reason, generative models are often used to accomplish
3
this. Nigam et al. describes a basic framework for working with a semi-supervised model
in general. Wang et al. describes some different hyperparameter choices, such as link type
importance, that are used in this methodology. However, CATHYHIN, as described in
Wang et al. is an unsupervised learning method that is designed to work on bibliographic
data, as well as creating subtopics for clustering (clusters within clusters). While this could
certainly be possible with ideologies (e.g., Anarcho-Capitalism is treated as a subtopic of
Libertarianism), this aspect of the algorithm will be ignored for the purposes of exploring a
simpler model.
Homophilies of Online Communities

Homophilies of online communities are of great interest to researchers, as data is easier to
gather and far more abundant than real life data. Generally those measuring homophily
use race, gender, and location as attribute variables. However, Bisgin et al. describes
how online communities have the potential to attract people of similar interests due to the
very low barrier of entry. Bisgin describes methods for describing pairwise similarity using
Jaccard similarity coefficients, as the similarity attribute is the shared interests between
users. However, for this data, we rely on clustering in order to determine if users are similar
or not, so that method cannot be applied here. Granted, by definition of the generative
model used, users who use similar words or post in similar places will have a tendency to be
clustered together.
Data
Sampling
The data was collected from http://www.reddit.com/ using the Reddit API over the course
of two weeks. The scraper sequentially cycled through 24 different subreddits and grabbed
the most recent post with at least 1 comment reply that was at least 2 days old. The
reason for the cutoff is that most posts do not acquire new comments after 1 day. Within
each comment thread, each comment is collected, as well as the commenting history (last
200 comments) of the user who posted the comment. There are cutoffs for threads with
an abnormally high amount of comments. However, this occurs very rarely, and most of
these comments have less substance to them and no replies to them, making them less
meaningful. This is partly due to Reddits ranking algorithm, which places more recent and
4
more positively-voted comments most visibly.
Labeling
Since ideologies are not labeled a priori, they were manually labeled by selecting a random
sample from the set of all users. Their entire comment histories were analyzed, and a decision was made regarding classification. Some users could not be classified due to insufficient
information, so they were skipped. The group sizes from this classification were used to construct priors for the generative model. The group sizes for anarcho-capitalism, conservatism,
liberalism, libertarianism, and socialism were 53, 80, 296, 87, and 59, respectively, for a total
of 578 users. This equates to priors of 9.2%, 13.9%, 51.5%, 15.1%, and 10.3%. Note that
this prior is specific to the sample, not to Reddit as a whole.
Description
A total of 18,545 users were recorded, with up to 200 comments in their comment histories
recorded, with a time-cutoff of 4 months from the date of collection. 116,810 comments from
100 different Subreddits were used in the learning algorithm. 3,550 different threads were
analyzed, with a total of 70,738 comments between users. Note that this value is the sum of
all weighted edges between users.
The n-grams used for the general algorithm were hand-selected, as topical clustering can
occur in many other ways. For example, people from all ideologies might talk about the ebola
virus, ISISs invasion of Iraq, or other current events that arent politically lopsided. Also,
note that the values for Subreddits were weighted logarithmically in order to give more weight
to comments that are recieved more positively or negatively by the community. Additionally,
the Subreddit variables were divided into two parts: positively-scored comments on the
Subreddit and negatively-scored comments. This increases the number of Subreddits from
100 to 200.
Method
Semi-Supervised Classification
Symbol Meaning
b the set of ideologies
s
the set of Subreddits
t the set of n-grams

bk
k th ideology
si
ith Subreddit
ti
ith n-gram
vt
the relative importance of terms versus Subreddits
Ws
the initial weight of manually labeled data versus the entire data set for Subreddits
Wt
the initial weight of manually labeled data versus the entire data set for n-grams
ds
the exponential decay rate for the weight of manually labeled data for Subreddits
dt
the exponential decay rate for the weight of manually labeled data for n-grams
ws
the minimum weight of manually labeled data after applying decay for Subreddits
wt
the minimum weight of manually labeled data after applying l decay for n-grams
the calculated weight of trained data for Subreddits
the calculated weight of trained data for n-grams
the raw count added to each ideologys Subreddit count; a smoothing parameter
the raw count added to each ideologys n-gram count; a smoothing parameter
sik
tik
ui
the probability of si appearing in bk

the probability of ti appearing in bk
the ith user
usij
the weighted value of sj for ui
utij
the weighted value of tj for ui
ubi
the ideology of ui
ubik
ubik
q
N
N
the likelihood of bk for ui

the probability of bk for ui
the iteration number of the EM algorithm, starting at 0
the total number of users
the total number of Subreddits
Nt
the total number of n-grams
the total number of manually labeled users
the set of manually labeled users

6
Assumptions
1. Each user will post a certain number of comments with no relation to their ideology
2. Each user will use a certain number of n-grams with no relation to their ideology
3. When a user who follows bk posts a comment in a Subreddit, there is a sik probability
that the user would post in Subreddit si .
4. When a user who follows bk uses a key n-gram in a comment, there is a tik probability
that that particular n-gram is ti .
5. For any randomly selected user, there is a pk chance that they belong to bk , where pk
is the prior probability.
Algorithm
For each iteration number q, we first estimate the weights of manually labeled data,
s =
s =
N Ws max(eqds , ws )
M
N Wt max(eqdt , wt )
M
Normally algorithms run in supervised mode
Then we calculate the Subreddit and n-gram probabilties for each ideology,
sjk
P
s + ui |ub =bk (1 + s I(ui um )) usij
P
Pi
=
s N s + sl s ui |ub =bk s + (1 + s I(ui um )) usil
i
tjk
P
t + ui |ub =bk (1 + t I(ui um )) utij
P Pi
=
t
t N + tl t ui |ub =bk t + (1 + t I(ui um )) util
i
Using these values, we then calculate the ideology membership likelihoods of different
users.
ubik =
v t
sjk uij
sj s
tjk
utij
tj t
Then the probabilities are calculated by dividing each likelihood by the sum of likelihoods:
ubik
=P
ubik
bl b
ubil
Then new memberships are assigned by selecting the ideology with the maximum probability,
ubi = barg max(k,ubik )
This process is repeated several times (usually one to two dozen) for convergence. Note
that logarithms are preferred for computational purposes, but the mathematics is the same.
Additionally, class priors were not updated, as they usually are in generative models. Doing
so favors the largest groups, which disproportionately increases error in other groups.
Variable Selection
Because variables were initially selected without much regard for their usefulness from a
statistical perspective, a simple and efficient method is used to reduce the number of Subreddit and n-gram variables: After the generative model algorithm is run, it is rerun, but
with any variable that does not have a factor of at least 20 between the highest and lowest
probabilities among its ideologies. This means that variables must be more discriminatory
in order to be considered in the model. Using cross-validation to evaulate model criteria,
other methods are possible, but this cutoff is the most efficient considering the size of the
dataset.
Validation
Because trained data exists, the model may be validated via n-fold cross-validation.
1. Choose a number k that represents the number of folds (i.e., iterations of validation)
you want to use.
2. Divide the manually labeled data into k partitions.
3. For each partition, run the generative model without having data from that partition
being labeled.
4. Compare the results of the classifications for each partition with their true (manually
labeled) classifications. Record the error rate and the impurity rate for each classification.
8
Depending on the goal of classification, other loss criteria may be used to quantify the
models accuracy.
Homophily Measures
Note that if one of these variables appears without operating on a Subreddit, it refers to the
entire dataset.
Variable
Meaning
f (sk )
the frequency matrix of comments from users of one ideology to another in sk
fij (sk )
the frequency of comment responses from someone of bi to someone of bj in sk
fj (sk )
the frequency of responses from all users to someone of bj in sk
fi (sk )
the frequency of responses from someone of bi to all users in sk
H(si )
the homophily of si
Eij (sk )
the expected number of responses from someone of bi to someone of bj in sk

assuming that there is no correlation with respect to i and j
ERij (sk )
Sij (sk )
fij (sk )/Eij (sk ), defined as 1 when it would otherwise be
0
0
proportion of comment replies from someone of bi to bj in sk that are positively-scored
Overall Homphily
Since homophily was related by class identification, a matrix of responses, f (sk ), for which
the ith row and j th column refers to the number of comment replies to users of bi to bj in sk ,
is used to calculate correlation. The homophily, H(sk ), is defined to be
P
2 trace f (sk ) f (sk )
P
H(sk ) =
f (sk )
A value of 1 means that all interactions are between people of similar ideologies, and a
value of -1 means all interactions are between users of different ideologies.
Response Likelihood
Another metric of interest is how likely someone is to respond to someone else in a Subreddit
given both their ideologies, versus what youd expect if they responded completely randomly.
The calculation for the matrix of expected values is as follows:
1
Eij (sk ) = fi (sk ) fj (sk ) P
( f (sk ))2
9
Then the ratio is calculated,
ERij =
fij (sk )
Eij (sk )
Ratios that have large enough cell counts are then reported.
Response Positivity
One final metric that is used, although not directly related to homophily, is the proportion
of responses from users of bi to bj within sk that are positive. This metric is extremely
straightforward to calculate, although it is mostly useful in analyzing specific Subreddits.
The main purpose of this metric is to indicate hostility towards particular groups from
another group, which is useful information when analyzing homophily.
Results
Semi-Supervised Classification
For variable selection, 144 out of the original 200 Subreddits were used in the model, and
154 out of 330 of the original n-grams were used in the model.
For the semi-supervised classification, 5-fold cross-validation was used in order to determine the error rate and the impurity of each group. The error rate for each group is the
proportion of members of each group that were not classified into the proper group. The
impurity rate for each group is the percentage of users classified into a group that do not
belong to that group. In this experiment, the primary goal was to minimize impurity, as it
helps ensure that the subjects used for inference are most likely to have accurate labels.
Anarcho-Captialsim Conservatism
Liberalism
Libertarianism Socialism
Error
32.8%
61.3%
11.5%
48.1%
39.2%
Impurity
26.7%
34.0%
29.0%
33.8%
21.7%
While these results are significantly better than random guesses, there is still a considerable amount of error.
With rows representing actual values and columns representing sorted values, this is the
classification matrix from the cross-validation:
10
Anarcho-Capitalism Conservatism
Liberalism
Libertarianism
Socialism
33
12
Conservatism
31
43
Liberalism
10
262
13
Libertarianism 3
32
45
20
36
Anarcho-Capitalism
Socialism
Ideology Representation from Top 10 Hub and AuthorityRanked Users

in Each Subreddit
Authority
Hub
worldpolitics
socialism
progressive
privacy
politics
PoliticalDiscussion
occupywallstreet
NeutralPolitics
moderatepolitics
Subreddit
MensRights
Ideology
Libertarian
AnarchoCapitalism
gunpolitics
Conservatism
Liberalism
EnoughLibertarianSpam
Libertarianism
Economics
Socialism
DebateFascism
DebateAnarchism
conspiracy
Conservative
changemyview
BasicIncome
Bad_Cop_No_Donut
Anarcho_Capitalism
Anarchism
AmericanPolitics
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Figure 1: The ideologies of the most important users in various Subreddits. Hub rankings
and authority rankings derived from the HITS algorithm are used here. Users with high hub
rankings tend to respond to more and important individuals, and people with high authority
rankings tend to be responded to by more and important people.
11
Homophily Measures
The following results use cross-validated averages combined with the results with all labeled
data used to classify results. Additionally, results were averaged over classification-calculated
ideologies (i.e., values of 1 or 0) and probability-calculated ideologies (derived in the model).
The overall homophily for all Subreddits combined was 0.25.
Frequency of Replies in Sampled Subreddits by Ideology of Replier and Original Poster
Displayed values are averaged from different algorithms of ideology predicting
5,068
5,096
44,855
6,767
7,634
69,421
Socialism
677
113
1,970
191
4,516
7,467
Libertarianism
562
482
3,742
1,636
192
6,614
Original Poster's Ideology
Total
Number of Comments
30000
20000
10000
Liberalism
Conservatism
AnarchoCapitalism
1,740
3,131
34,300
3,893
2,134
45,197
133
1,248
3,081
478
123
5,062
1,957
123
1,762
569
670
5,080
Liberalism
Libertarianism
Socialism
Total
AnarchoCapitalism Conservatism
Replier's Ideology
Figure 2: The frequencies of various response types across all Subreddits. Note that the
discrepancy between the number of comments in the summary statistics are not the same as
the number here, as some comments are mediated by unclassified users due to various forms
of censoring on the website (e.g., banning, comment removal).
12
Figure 3: The homophilies of various Subreddits, ordered from highest to lowest.

13
Ratio of Actual Responses to Expected Responses on Sampled Subreddits
Socialism
1.2
0.2
0.4
0.3
5.5
Libertarianism
1.2
0.9
2.5
0.3
actual replies
expected replies
color is based on a log scale
10.0
Liberalism
0.5
0.9
1.2
0.9
0.4
1.0
0.1
Conservatism
0.4
3.4
0.9
0.2
AnarchoCapitalism
5.3
0.3
0.5
1.1
1.2
Liberalism
Libertarianism
Socialism
AnarchoCapitalismConservatism
Replier's Ideology
Figure 4: The ratios of actual responses to expected responses for the entire network, by
ideologies of replier and original poster.
14
Percentage of PositivelyScored Responses

by Ideology of Responder and Original Poster
Socialism
82.8%
78.6%
86.3%
79.8%
87.3%
Libertarianism
87.1%
84.0%
80.3%
83.7%
80.4%
% Positive
100%
95%
Liberalism
83.5%
76.6%
82.5%
80.3%
84.1%
90%
85%
80%
75%
Conservatism
83.7%
82.4%
80.9%
84.9%
78.2%
AnarchoCapitalism
89.5%
80.9%
86.3%
87.9%
84.5%
Liberalism
Libertarianism
Socialism
Replier's Ideology
Figure 5: The percent of responses scored positively by ideologies of the replier and original poster. Lower values mean that comment replies recieve negative feedback from the
community it was posted in, and likely the original poster, as well.
Some specific results for Subreddits that will be used as examples in the conclusion:
15
Frequency of Replies in /r/Anarcho_Capitalism by Ideology of Replier and Original Poster

2,639
499
340
468
3,955
Socialism
320
56
38
69
485
Libertarianism
212
46
32
37
328
Total
Number of Comments
1500
1000
500
Liberalism
Conservatism
AnarchoCapitalism
323
69
48
55
496
18
24
1,766
324
221
306
2,623
Liberalism
Libertarianism
Socialism
Total
Replier's Ideology
Frequency of Replies in /r/Libertarian by Ideology of Replier and Original Poster

Total
Socialism
Libertarianism
561
222
2,050
1,965
122
4,922
12
49
40
110
228
91
745
814
39
1,917
Number of Comments
750
500
250
238
92
912
799
59
2,100
Conservatism
23
12
108
92
243
AnarchoCapitalism
60
22
235
221
13
552
Liberalism
Libertarianism
Socialism
Total
Liberalism
Replier's Ideology
Figure 6: Frequencies of /r/Anarcho Capitalism and /r/Libertarian comment replies by

ideology of replier and original poster.
16
Percentage of Comment Reply Scores that are Positive for /r/politics
Blank cells have too few entries to display.
Socialism
82%
40%
70%
64%
88%
Libertarianism
83%
76%
81%
82%
91%
% Positive Replies
1.00
0.75
Liberalism
75%
64%
78%
72%
84%
0.50
0.25
0.00
Conservatism
75%
78%
78%
76%
84%
AnarchoCapitalism
86%
66%
81%
87%
94%
AnarchoCapitalism
Conservatism
Liberalism
Libertarianism
Socialism
Replier's Ideology
Figure 7: Percentage of comments with positive scores by ideologies of replier and original
poster in /r/politics.
Conclusions
Classification
The semi-supervised classification algorithm was fairly effective at classifying liberals correctly, but across the board, the error rate was somewhat high, even for a classification
problem with 5 different groups. There is not too much to say about the algorithm without
using a competing algorithm and comparing results. The primary purpose of classification
is to have the labels as opposed to testing the algorithm.
Homophily
Unsurprisingly, the majority of interactions on the sampled Subreddits are facilitated by
liberals. This effect can be seen as a majority of Subreddits have high homophilies. Two
notable exceptions are /r/Anarcho-Capitalism and /r/Libertarian. Upon closer inspection,
viewing the frequency tables for the specific subreddits can detail what is happening. In
the case of /r/Anarcho-Capitalism, Anarcho-Capitalists tend to respond to many people of
17
differing ideologies, which lowers the homophily significantly. In the case of /r/Libertarian,
the majority of users (not seen in the graphics below) are libertarians, but most of the
interactions are facilitated by liberals. This means that there are many liberals responding
to libertarians and vice-versa, which makes /r/Libertarian a heterophilic Subreddit.
While the sampling procedures introduce bias into statistics generalized to the whole
website, each ideology demonstrated that its users had a level of preferential self-attachment,
as is seen in Figure 4. What is most notable, though, is the tendency for anarcho-capitalists
and Socialists to communicate much less frequently with those of other ideologies. There
is an obvious exception, where anarcho-capitalists are slightly more likely to respond to
libertarians and socialists and vice-versa, but with the majority of the website consisting of
liberals, clustering is obvious.
Finally, looking at the tendency for comments to have a positive score, it is obvious that
there is some hostility for conservatives by liberals, but this effect is averaged over the entire
sample. If you take a look at the main politics Subreddit, /r/politics in Figure 7, the story
becomes much more clear. It is obvious here that conservatives are not very welcome by
socialists, liberals, or anarcho-capitalists in /r/politics. Looking at these results by Subreddit
reduces biases caused by sampling and allows for a much more precise understanding of how
interactions take place in a community.
As far as practicality behind these results, individuals can use these to decide which
sections of the website are more hospitable to discussion given their ideology, and for contentmakers (e.g., blog writers), the general census results of the Subreddits can be used to
strategically post content targetting the appropriate audience.
Limitations and Future Work

There are many limitations of this analysis, the largest one being the error introduced by the
learning algorithm. Oftentimes people post little information about their beliefs, so it is very
difficult to classify them. Additionally, even when it is more obvious to a human, machine
learning algorithms become very complicated when a high level of accuracy is desired.
One possible solution in future experiments is to shift the model to use natural language
processing to establish variables, rather than rely purely on n-grams, which are much noisier
in the context of comments on social media. Additionally, a model with hyperparameters for
each class, such as those implemented in CATHYHIN in Wang et al., would allow for more
flexibility in discriminating between ideologies. Also, akin to the topical heirarchies that are
18
created in CATHYHIN, a more advanced model could find subclusters within ideologies so
that top-level groups do not have to be as homogenous with respect to their parameters.
There is also the question of how dissimilar users are in their beliefs, regardless of ideology. If two users have one political viewpoint in common and discuss it with each other,
should that be considered heterophilic? Coincidentally, that type of process is described by
Damon Cantola in Homophily, Cultural Drift, and the Co-Evolution of Cultural Groups
as one that can actually bridge groups together. However, trying to extend this model to a
network evolution model would be very difficult given the Reddit API limitations and the
vast amounts of data that would need to be collected.
Lastly, while the sample size was sufficient for the analysis that was done, a larger sample
size would enable more complex algorithms to be used, since the number of parameters is very
high for such models. One major difficulty of this is the time-consuming task of manually
labeling users, but perhaps the nature of semi-supervised learning would not necessitate an
impractically large training size.
References
1. Bisgin, Hail; Agarwal, Nitin; and Xu, Xiaowei. Investigating Homophily in Online
Social Networks. http://www.researchgate.net/publication/221158453 Investigating
Homophily in Online Social Networks/links/0fcfd5089ae5348b1c000000
2. Centola, Damon et al. Homophily, Cultural Drift, and the Co-Evolution of Cultural
Groups. http://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdf
3. H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
4. Hadley Wickham (2014). scales: Scale functions for graphics. Rpackage version 0.2.4.
http://CRAN.R-project.org/package=scales.
http://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdf
5. Nigam, Kamal; McCallum, Andrew; and Mitchell, Tom M. Semi-Supervised Text
Classification Using EM. http://people.cs.umass.edu/ mccallum/papers/semisup-em.pdf
6. R Core Team (2014). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
7. Wang et al. Constructing Topical Hierarchies in Heterogeneous Information Networks.
http://web.engr.illinois.edu/ hanj/pdf/icdm13 cwang.pdf
19
Appendix
Figure 8: Visualization of comment replies in sample. Gold = anarcho-capitalist, Red =

conservative, Blue = liberal, Green = libertarian, Pink = socialist, Black = unlabeled.
Layout created using Yifan-Hu algorithm in Gephi.
20

Analyzing Political Discourse On Reddit

Uploaded by

Copyright:

Available Formats

Analyzing Political Discourse On Reddit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analyzing Political Discourse On Reddit

Uploaded by

Copyright:

Available Formats

What is the purpose of the study?

What is the purpose of the study?

What website is used as a sample source and why?

What website is used as a sample source and why?

Network Analysis Final Report

principles of laissez-faire capitalism and that government does not exist.

Homophilies of Online Communities

more positively-voted comments most visibly.

the set of Subreddits

t the set of n-grams

the relative importance of terms versus Subreddits

the calculated weight of trained data for Subreddits

the calculated weight of trained data for n-grams

the probability of si appearing in bk

the weighted value of sj for ui

the weighted value of tj for ui

the likelihood of bk for ui

the total number of n-grams

the total number of manually labeled users

the set of manually labeled users

the frequency matrix of comments from users of one ideology to another in sk

the frequency of comment responses from someone of bi to someone of bj in sk

the frequency of responses from all users to someone of bj in sk

the frequency of responses from someone of bi to all users in sk

the expected number of responses from someone of bi to someone of bj in sk

fij (sk )/Eij (sk ), defined as 1 when it would otherwise be

proportion of comment replies from someone of bi to bj in sk that are positively-scored

Then the ratio is calculated,

Ideology Representation from Top 10 Hub and AuthorityRanked Users

Original Poster's Ideology

Figure 3: The homophilies of various Subreddits, ordered from highest to lowest.

Original Poster's Ideology

Ratio of Actual Responses to Expected Responses on Sampled Subreddits

Original Poster's Ideology

Percentage of PositivelyScored Responses

Frequency of Replies in /r/Anarcho_Capitalism by Ideology of Replier and Original Poster

Original Poster's Ideology

Frequency of Replies in /r/Libertarian by Ideology of Replier and Original Poster

Original Poster's Ideology

Figure 6: Frequencies of /r/Anarcho Capitalism and /r/Libertarian comment replies by

Percentage of Comment Reply Scores that are Positive for /r/politics

Original Poster's Ideology

Blank cells have too few entries to display.

Limitations and Future Work

Figure 8: Visualization of comment replies in sample. Gold = anarcho-capitalist, Red =

You might also like