Academia.eduAcademia.edu

SocialVec: Social Entity Embeddings

2021, ArXiv

This paper introduces SocialVec, a general framework for eliciting social world knowledge from social networks, and applies this framework to Twitter. SocialVec learns lowdimensional embeddings of popular accounts, which represent entities of general interest, based on their co-occurrences patterns within the accounts followed by individual users, thus modeling entity similarity in socio-demographic terms. Similar to word embeddings, which facilitate tasks that involve text processing, we expect social entity embeddings to benefit tasks of social flavor. We have learned social embeddings for roughly 200,000 popular accounts from a sample of the Twitter network that includes more than 1.3 million users and the accounts that they follow, and evaluate the resulting embeddings on two different tasks. The first task involves the automatic inference of personal traits of users from their social media profiles. In another study, we exploit SocialVec embeddings for gauging the political bia...

SocialVec: Social Entity Embeddings Nir Lotan, Einat Minkov arXiv:2111.03514v1 [cs.SI] 5 Nov 2021 University of Haifa [email protected], [email protected] Abstract This paper introduces SocialVec, a general framework for eliciting social world knowledge from social networks, and applies this framework to Twitter. SocialVec learns lowdimensional embeddings of popular accounts, which represent entities of general interest, based on their co-occurrences patterns within the accounts followed by individual users, thus modeling entity similarity in socio-demographic terms. Similar to word embeddings, which facilitate tasks that involve text processing, we expect social entity embeddings to benefit tasks of social flavor. We have learned social embeddings for roughly 200,000 popular accounts from a sample of the Twitter network that includes more than 1.3 million users and the accounts that they follow, and evaluate the resulting embeddings on two different tasks. The first task involves the automatic inference of personal traits of users from their social media profiles. In another study, we exploit SocialVec embeddings for gauging the political bias of news sources in Twitter. In both cases, we prove SocialVec embeddings to be advantageous compared with existing entity embedding schemes. We will make the SocialVec entity embeddings publicly available to support further exploration of social world knowledge as reflected in Twitter. Introduction World knowledge about entities and the relationships between them is vital for information processing and communication by humans and machines alike. Much effort has been invested over the last decades in constructing factual knowledge bases that describe entities and the relationships between them in a structured relational form, e.g., (Mitchell et al. 2018). Following the advances in deep learning, researchers have proposed schemes for learning entity and relationship embeddings, based on the information available in structured knowledge bases (Lerer et al. 2019), as well based on the textual contexts that surround the entity mentions (Yamada et al. 2020). However, much of the knowledge that is needed for intelligent information processing and communication extends beyond relational facts, and is in fact of social nature. Consider for example the social aspect of political polarity. Relevant knowledge about the stand of individuals who dissiminate information, or with whom one engages in a conversation, is necessary for effective information processing and communication. Likewise, the political leaning of news sources should be taken into account for critical processing of the content disseminated by them (An et al. 2012). In this work, we introduce SocialVec, a general framework for eliciting world knowledge from social networks. SocialVec learns entity representations based on the social contexts in which they occur within the social network. We apply this framework to Twitter,1 a popular and public social networking service that is considered as a credible source of social information (e.g., (O’Connor et al. 2010)). We rely on the fact that public entities, including politicians, artists, national and local businesses, and so forth, maintain active presence in social networks in general, and Twitter in particular (Marwick and Boyd 2011). We further exploit the fact that Twitter users typically follow other accounts of interest, to consume the content posted by those accounts. We focus our attention on the most popular accounts within a large sample of the social network, assuming that these accounts represent entities of general interest. Importantly, we consider entities that are co-followed by individual users to be contextually related in that they reflect the interests, opinions, and socio-demographics of each user; for example, users would typically follow entities of similar political orientation to their own (Eady et al. 2019). The proposed SocialVec framework employs neural computing to process this information into low-dimensional entity embedding representations. Similar to Word2Vec (Mikolov et al. 2013a), which learns low-dimensional word representations from the words with which they cooccur in large text corpora, SocialVec learns entity representations based on other entities that users tend to co-follow, as observed in a large sample of the Twitter network. While word embeddings facilitate tasks that involve text processing, we expect social entity embeddings to benefit information processing tasks of social flavor, such as the exploration of entity similarity in the social space, providing useful representations for downstream applications, and potentially supporting researchers and practitioners in deriving various social insights. In this work, we apply SocialVec to a large portion of the Twitter network, which includes more than 1.3 million users sampled uniformly at random, and the accounts that they follow. We then exploit and evaluate the entity embeddings 1 https://twitter.com/ produced by SocialVec in two empirical case studies. The first study applies to the task of automatically inferring the personal traits of users from their social media profiles. We show that modeling users in terms of the popular accounts followed by them yields state-of-the-art or competitive performance in predicting various personal traits, such as gender, race, and political leaning. In another case study, we exploit SocialVec embeddings for gauging the political bias of news sources in Twitter. An evaluation against the results of polls conducted by Pew research (Center 2014; Jurkowitz et al. 2020) shows high accuracy of our approach. In both studies, we show clear advantage of SocialVec over existing entity embeddings, which rely on structured and textual information sources (Lerer et al. 2019; Yamada et al. 2020). In summary, this paper presents several main contributions: (1) We outline SocialVec, a new framework for learning social vector representations of popular entities from social media, and (2) we apply and achieve high-performance results on the tasks of personal trait prediction and the identification of political bias of news sources. (3) We make the SocialVec entity embeddings publicly available, and believe that this has the potential of making a significant impact in exploring social world knowledge as reflected in Twitter. Related work The success of Word2Vec (Mikolov et al. 2013a) has inspired many related works, both within and beyond the textual domain. In the networks domain, the models of DeepWalk (Perozzi, Al-Rfou, and Skiena 2014) and Node2Vec (Grover and Leskovec 2016) learn node embeddings by sampling node sequences via random walks in the graph, and predicting selected nodes based on the representations of the neighboring nodes in the sequence. Here, we do not aim to learn embeddings for all nodes (users) in the very large Twitter graph. Rather, we aim learn world knowledge in the form of entity embeddings, where we consider a bipartite graph that includes sampled users and the popular accounts that they follow within the Twitter network. We practically model two-hop node sequences, moving from a node that denotes a popular account to a user node who follows that account, and then to another popular account followed by the same user. In this manner, we avoid sampling paths from the very large graph of the whole of Twitter, efficiently modeling relevant information within a sub-graph of Twitter. Our approach is closely related to Item2Vec (Barkan and Koenigstein 2016), a model that learns item embeddings from a bipartite graph of user-item rating history for recommendation purposes. They too compute the embeddings of items given the ratings of individual users, considering other items known to be liked by the same users as relevant contexts. They found that the Item2Vec embeddings outperformed SVD in recommendation, especially when the rating matrix was sparse. In our work, we experiment with both models of Word2Vec, namely CBOW and skip-gram, whereas they only explore the latter. We also model full contextual information as opposed to sampling by shuffling, as detailed below. Entity embeddings Using SocialVec, we elicit social world knowledge from social media, learning the representations of popular accounts, which likely correspond to entities of general interest. Accordingly, we compare our learned embeddings with existing entity encoding schemes, all of which rely on factual sources. Below, we describe and motivate the comparison of SocialVec embeddings with the the entity representations of Wikipedia2Vec and Wikidata graph embeddings. Wikipedia2Vec Wikipedia is considered to be a highquality semi-structured resource for learning entity representations.2 In this space, entities correspond to concepts represented by dedicated Wikipedia articles. A useful feature of Wikipedia is the availability of human-curated annotations of entity mentions within its articles, and the mapping of the entity mentions onto their unique identifiers via hyperlinks, pointing to relevant textual contexts of the entity mentions. The Wikipedia2vec model learns the embeddings of both words and entities from Wikipedia, with the aim of placing semantically similar words and entities close to each other in a joint vector space (Yamada et al. 2016, 2020). Concretely, this model learns word representations using Word2Vec, predicting the neighboring words of a given word in all of Wikipedia pages. In addition, for each hyperlink in Wikipedia, Wikipedia2Vec aims to predict the words that surround it given the referenced entity, thus modeling word-entity relationships, and further predicts the neighboring entities of each entity in Wikipedia’s link graph, i.e., the entities with which it is connected over some hyperlink, thus modeling direct inter-entity similarity. The resulting word and entity embeddings have been applied to various downstream tasks of natural language and knowledge processing, including entity linking, and knowledge base completion, and have been shown to outperform multiple baselines (Yamada et al. 2020). Wikidata graph Embeddings Wikidata is a popular large collaborative knowledge base developed and operated by the Wikimedia Foundation (Tanon et al. 2016), hence it is aligned with Wikipedia.3 Wikidata represents entities of various types as nodes, and relational facts as typed edges between entity pairs. Additional relation types encoded in Wikidata include taxonomic hierarchies (’is a’ relationships), and mappings to entity properties. Recently, Lerer et al (2019) have introduced the scalable PyTorch-BigGraph (PBG) framework, designed to efficiently learn graph-based node embeddings in very large graphs. We consider the entity embeddings inferred using PGB and the TransE (Bordes et al. 2013) graph embedding method from the whole of Wikidata. As reported by the authors, their implementation of TransE to Wikidata yielded higher-quality embeddings compared with the DeepWalk algorithm, as evaluated on the task of link prediction. 2 3 https://www.wikipedia.org/ https://www.wikidata.org/ Relying on curated knowledge sources like Wikipedia or Wikidata has limitations however. First, these resources are inherently incomplete, where some popular entities do not have Wikipedia pages (Hoffart et al. 2012). Similarly, the modeling of entity relationships and properties by these resources is partial (Mitchell et al. 2018). Here, we further argue and show that these methods lack the representation of social world knowledge about entities. Finally, we note that while several researchers have previously learned user embeddings from social network structure, these efforts were typically ad-hoc, being applied to specific tasks and datasets of limited size. Some related works consider the content associated with users for learning user embeddings, e.g., (Benton, Arora, and Dredze 2016). To the best of our knowledge, this is the first work that outlines and evaluates an approach for learning entity embeddings from social media at scale with the aim of capturing social world knowledge. fixed size c. The loss function of skip-gram is defined as: L=− K X X logP (wj | wi ) and the conditional probability P (wj | wi ) is defined using the following softmax function: P (wj | wi ) = P exp(uTi vj ) T k∈W exp(ui vk ) Learning We follow closely the Word2Vec approach that learns word embeddings from context word information (Mikolov et al. 2013a), adapting it to learn social contexts of entities. Given unlabeled text corpora comprised of a sequence of words (wi )K i=1 , the skip-gram network variant of Word2Vec is trained to predict the neighboring words that surround each word wi in turn within a window of a (2) where ui ∈ Rd and vi ∈ Rd are latent vectors that denote the target and context word representations in the vocabulary wi ∈ W , respectively. Equation 2 is overly costly however, as it applies to the whole vocabulary W . To alleviate this cost, negative sampling replaces the softmax function with: Learning social embeddings Users on social networks typically associate themselves with other accounts of interest, consuming the content posted by those accounts. These directed social links correspond to a graph structure, where vertices denote user accounts and edges represent follower-to-followee relationships. Naturally, a fraction of the accounts have a high number of followers, where there exists a long-tail of accounts that are followed by small social circles. We focus our attention on those accounts that are most popular, assuming that they represent public entities of general interest. Our goal is to learn low-dimensional entity representations that characterise the social contexts in which the entities appear on social media. Concretely, we consider a large sample of users U mapped to the set of Twitter accounts that each one of them follows, ui → {aij }, ui ∈ U , aij ∈ A, where A denotes the union of all the accounts that are followed by the users in our dataset. We define the set of entities of interest as the subset of the most popular accounts, E ⊂ A, for which there exist at least k users who follow them in our dataset. In modeling the social context of each entity e ∈ E, we focus our attention on the sets of other entities that are cofollowed by the individual users in our sample, i.e., considering the sets {{aij } : e ∈ {aij }, aij ∈ E}. Overall, the set of entities followed by an individual user is expected to be small, representing their personal interests and tastes. We discard information about the identity of the users, merely treating them as coherent samples of social contexts. Similarly to text sequences, which contain words that are related to each other grammatically and topically, we expect the sets of entities followed by the individual users to form meaningful units of local social contexts. (1) i=1 −c≤j≤c,c6=0 P (wj | wi ) = σ(uTi vj ) N Y σ(−uTi vk ) (3) k=1 where σ = 1/1 + exp(−x), and N is a parameter. Thus, it is aimed to distinguish the target word wi from a noise distribution that includes N negative examples, typically sampled from the unigram distribution raised to the 3/4rd power (Mikolov et al. 2013b). Training is performed by minimizing the loss function using stochastic gradient descent. While Word2Vec captures the linguistic context of words, learning to predict the adjacent words in a sequence, we wish to model socio-demographic contexts of entities, predicting other popular accounts that are co-followed by individual users. Adapting the formulation of the skip-gram model, we replace the words wi , wj with account pairs ei , ej , modifying Equation 1 as follows: X X L=− logP (ej | ei ) (4) ui ∈U ei ,ej ∈{eui },ei 6=ej where {eui } denotes the set of entities followed by user ui . Our formulation is similar to the model of Item2Vec (Barkan and Koenigstein 2016), which learns item embeddings from user-item rating data for recommendation purposes. In both cases, the context corresponds to a set as opposed to a sequence. They shuffle the context items multiple times in order to obtain diverse context representations within a local context size c (Eq. 1). We apply non-stochastic modeling of the entity neighborhoods (Eq. 4) by practically setting the window size to be large enough to include all of the co-followed entities for every focus entity ei . In addition to the skip-gram neural model, we experiment with the CBOW network variant of Word2Vec. Using the CBOW formulation, the focus word wi is predicted given all of the neighboring words within the context window. Similarly, we learn entity embeddings by learning to predict each focus entity ei in turn from the representations of all of the co-followed entities per user, {eui }, ei 6= ej . 5 Num. of followers per user [log-scale] >25K 10-25K 4 1-10K 3 250-500 500-1K 2 50-100 100-250 10-50 1 <10 0 3 4 5 6 Num. of users [log-scale] 7 8 Figure 1: Account popularity based on our sample of 1.3 million Twitter users: a small number of highly popular accounts are followed by more than 25K users, where a long tail of accounts are followed by up to 100 users. We learn the embeddings of the most followed ∼200K accounts. The Social Corpus It is desired to train the neural embeddings using data that is high-quality, representative, and abundant. We obtained a large number of unique Twitter user identifiers, sampled uniformly at random from a pool of users in the U.S. who posted tweets in the English language,4 and retrieved the full list of accounts followed by each user using Twitter API. Overall, we collected information about the accounts followed by a total of 1.3 (1.265) million distinct users. Our data, collected in the beginning of 2020, includes 1,236 million relationships, mapping the users to 90.4 million unique accounts that are followed by them. Figure 2 shows the distribution of accounts by their number of followers in our data. As shown, a small number of accounts (1.4K) are followed by more than 25K users, i.e., by more than 1.9% of the users in our dataset. A higher number of accounts (∼4,000) are followed by 10-25K users, i.e., by 0.8-1.9% of the users. Next, there are many more accounts (89K) that are followed by 1-10K users, and a long tail of accounts that are followed by less than 1K users. Learning embeddings for all of the users in our dataset is infeasible, both computationally and statistically, since sufficient context information is required for learning meaningful representations. We therefore set a threshold based on account popularity, considering the accounts that are followed by at least k = 350 users in our dataset. Roughly 200K accounts (201,247) meet this condition, and they comprise our vocabulary of entities E. Naturally, we expect there to be differences in the scope of entities represented by a knowledge base like Wikipedia and Twitter. We aligned these resources, exploiting the fact 4 We sampled the user identifiers from a corpus of 600 million tweets posted authored by over 10 million users in 2015, that was acquired from Twitter for research purposes. that Wikidata links entity entries with Wikipedia as well as with their Twitter account information.5 We found that 31.5K out of the 200K accounts represented in SocialVec (15.8%) have a Wikipedia page or Wikidata entry associated with them. Naturally, many of those entities that are represented in Wikipedia are widely-known and hence, popular in Twitter. Overall, the users in our sample follow 978 other accounts on average, among which 228 are considered popular and represented in SocialVec. About half of the accounts that one follows and are represented by SocialVec map also to Wikipedia. Thus, there is a substantial overlap between SocialVec vocabulary of entities and Wikipedia. Nevertheless, as one may expect, factual knowledge bases like Wikipedia are limited in their coverage of the world knowledge that is represented in social networks. Experimental setup We learned entity embeddings using the skip-gram and CBOW variants of Word2Vec as implemented in gensim (Rehurek and Sojka 2010). In order to capture the full context of co-followed accounts by every user, we set the window size to c = 1000 (Eq. 4). We discarded the records of users who follow more than 1000 accounts, assuming that these users are non-selective, and may not represent coherent contexts for our purposes. We applied common parameter choices in training the models, setting the initial learning rate to 0.03, and having it decrease gradually to a minimum value of 7×10−5 . We set the number of negative examples to N = 20, downsampling popular accounts by a factor of 1e-5. We have experimented and tuned these parameters based on cross-validation results using the training portion of our personal trait prediction dataset. We further experimented with learning embeddings of varying sizes, and set the embedding size to 100 dimensions–similar to Wikipedia2Vec entity embeddings– as this choice yielded good results in our cross-validation tuning experiments. The training of the models was conducted using an Intel Core i9-7920X CPU @ 2.90GHz computer with 24 CPUs, 128GB RAM and an NVIDIA GV100 GPU. Training the CBOW and skip-gram models lasted about five days and two weeks, respectively. Evaluation We evaluate SocialVec embeddings on two case studies, in which we use these embeddings as features in learning, as well as employ vector semantics to assess semantic similarity between entities. In our experiments, we compare SocialVec embeddings with the Wikipedia2Vec entity representations learned from Wikipedia, and the graphbased entity embeddings learned from Wikidata; both methods have been shown to yield SOTA results on downstream tasks, as discussed above. Concretely, we experiment with a version of Wikipedia2Vec embeddings which we trained on a dump of Wikipedia in English from October 2020.6 The 5 See Wikidata property: P2002, twitter user numeric id property P6552. We used Wikidata’s SPARQL query service available at https://pypi.org/project/qwikidata/ to retrieve the Wikipedia identifier of each Twitter account represented in SocialVec. 6 We used the code provided by Yamada et al. (2020). Attribute Age Children Education Ethnicity Gender Income Political Class Distribution ≤ 25 y.o. (56%), > 25 y.o. No (82%), Yes High School (67%), Degree Caucasian (57%), Afr. Amer. Female (56%), Male ≤ 35K (64%), >35K Democrat (76%), Republican Profiles 3,485 3,485 3,485 2,905 3,475 3,485 1,790 Table 1: Personal trait prediction: dataset statistics Wikidata graph is very large, containing 78M entities and ∼4K relation types, posing a computational challenge. We use pre-trained entity embeddings learned using the TransE method from the official dump of Wikidata as of 2019-0306 with the scalable PyTorch-BigGraph framework (Lerer et al. 2019).7 The Wikipedia2Vec and graph embeddings are of 100 and 200 dimensions, respectively. Personal Trait Prediction Researchers have long been exploring methods for automatically inferring the socio-demographic attributes of users from their digital footprints, mainly based on the content posted or consumed by them (Youyou, Kosinski, and Stillwell 2015; Volkova and Bachrach 2016). Such personal information about users is beneficial for applications like personal recommendation (Pritsker, Kuflik, and Minkov 2017), as well as for social analytics (Mueller et al. 2021). Here, we employ the SocialVec embeddings of popular accounts that users follow on Twitter as features in a supervised classification framework, targeting the prediction of various sociodemographic traits for each user. Dataset We refer to a labeled dataset due to Volkova et al. (2015). This dataset includes the identifiers of sampled Twitter users, labeled by means of crowd sourcing with respect to the personal attributes of age, gender, ethnicity, family status, education, income level, and political orientation. The labels were determined based on public information on Twitter, including the user’s self-authored account description, and any metadata and historical tweets available for the user. Thus, the labels reflect subjective and proximate judgements. All of the labels are binary, where continuous attributes, namely age and income, were manually split into distinct binary ranges, as detailed in Table 1. In order to obtain information about the accounts that the users in the dataset follow, we queried Twitter API with the relevant user account identifiers. Overall, we tracked 3,558 active user profiles in Twitter (out of 5,000 users in the source dataset). We further retrieved tweets posted by these users,8 considering text as alternative information source for attribute prediction. Overall, we collected up to 200 tweets posted by each user, similar to Volkova et al. (2015), obtaining 180 tweets per user, on average. 7 https://github.com/facebookresearch/PyTorch-BigGraph Due to legal restrictions, public datasets specify user ids, but do not contain the content posted by them. 8 While this data was collected some time after the user labels were assigned, we believe that as the labels are categorical, and were obtained based on coarse human judgement, the labeling accuracy is not severely compromised. Furthermore, we evaluate the various methods using the same data and in the same conditions, where this forms a viable evaluation setup. Methods We perform supervised classification experiments, predicting the various attribute values for each user as independent binary classification tasks. For each target attribute, we randomly split the set of labeled examples into distinct train (80%) and test (20%) sets, maintaining similar class proportions across these sets. Once the models are trained, prediction performance is evaluated against the gold labels of the test examples. In learning, we represent each user u using the vector embeddings of the popular accounts that they follow, {eu }. We follow the practice of averaging the bag-of-embedding vectors into a unified summary vector of the same dimension eu (Shen et al. 2018). We then feed the averaged vector representation into a logistic regression classification network, in which the output layer consists of a single sigmoid unit. While we experimented also with multi-layer network architectures, we found this single-layer classifier to work best. Below, we report our results applying this framework using SocialVec and the existing entity embeddings schemes. As reference, we evaluate also content-based attribute classification using the tweets posted by each user as relevant evidence. In this case, the tweets authored by each user, tu , are first converted into a 300-dimension text embedding vector using the pre-trained convolutional FastText neural model (Joulin et al. 2017),9 which is often used for tweet processing (Sosea and Caragea 2020). We average the FastText tweet embeddings (Adi et al. 2017), and feed the resulting user representation to the logistic regression network. Results The classification results for each of the target attributes are given in Table 2 in terms of the ROC AUC measure. Let us first examine our results using the SocialVec embeddings as features in classification, and compare these results with the Wikipedia2Vec and Wikidata entity embeddings. As detailed in the table, SocialVec outperforms the other embedding schemes by a large gap across all of the target attributes. For example, classification performance on age prediction is 0.74 in terms of ROC AUC using SocialVec, compared with 0.69 and 0.61 using the Wikidata graph embeddings and Wikipedia2Vec, respectively. Similar or larger gaps in favor of SocialVec are observed for each for the other traits. Importantly, only a subset of the popular accounts that are represented by SocialVec have respective embeddings based on Wikipedia and Wikidata. Figure 2 shows relevant statistics about the number of accounts followed by the users in our dataset that are represented using each method. As illustrated, the number of relevant SocialVec embeddings that 9 https://github.com/facebookresearch/fastText SocialVec Wikipedia entities: Wikidata:TransE Wikipedia2Vec SocialVec ∩ Wiki. Content-based: FastText ∗ (Volkova and Bachrach 2016) Age 0.738 Children 0.683 Education 0.739 Ethnicity 0.953 Gender 0.890 Income 0.732 Political 0.798 0.686 0.614 0.705 0.614 0.610 0.665 0.690 0.628 0.698 0.864 0.704 0.924 0.803 0.641 0.859 0.682 0.635 0.709 0.694 0.599 0.748 0.695 0.63 0.575 0.72 0.740 0.77 0.785 0.93 0.768 0.90 0.748 0.73 0.654 - Table 2: Personal trait prediction results [ROC AUC] (∗ the results by (Volkova and Bachrach 2016) were obtained on a larger version of our dataset, and are not directly comparable.) Figure 2: Personal trait prediction: the number of popular account embeddings that are associated with each user in the dataset, and the proportions of users that have a limited number of embeddings (less than 5,10, or 50) associated with them using each method. are available is substantially higher compared with the other methods. Consequently, the ratio of users that are poorly represented, being associated with less than 10 account embeddings, is 5% using SocialVec, vs. 22 and 32% using the Wikipedia- and Wikidata-based methods. These gaps in coverage illustrate the wider applicability of our approach in the social domain of Twitter. Yet, in order to conduct a fair comparison between the different embedding schemes, we performed another experiment, in which we eliminated from the feature any SocialVec embeddings of accounts that are not represented by either one of the other methods. As illustrated in Figure 2, this variant has a similar coverage profile as the Wikipedia-based methods. The results of using the restricted set of SocialVec embeddings are shown in Table 2 (’SocialVec ∩ Wiki.’). We observe that the classification performance using this strict evaluation is lower as less features are used, e.g., ROC AUC drops from 0.74 to 0.71 on the age category, however SocialVec still outperforms the performance of the alternative entity representations on each and every one of the target categories. Thus, we conclude that our social entity embeddings inferred from Twitter are more informative compared with the respective entity embeddings inferred from relational knowledge bases for personal trait prediction. Feature analysis. Table 3 demonstrates in more detail the valuable social information that SocialVec encodes for personal trait prediction. The table presents the top accounts associated with each class label in our dataset, based on the pointwise mutual information (PMI) measure (Rudinger, May, and Van Durme 2017). In general, high PMI values indicate on distinctive feature-class correlation. We observe, for example, that the top Twitter accounts followed by male (as opposed to female) users belong to men who specialize in sports, whereas the top accounts that characterize female users belong to women. Likewise, we find that the top accounts that characterise Afro-American users belong to Afro-Americans, and vice versa. We further observe that users with an academic degree distinctively follow media accounts such as the New Yorker and the Economist magazines, whereas non-academic users tend to follow rappers. Finally, distinctive accounts that represent political polarity include rappers on the Democratic side. On the Republican side, we find accounts that are related to Country music, echoing previous findings by which Country music fans are twice as likely to vote Republican than fans of other genres,10 and the accounts of Tim Tebow, a former football player, and the fast-food brand of Chick-fil-A, which are both known for their conservative views.11 Thus, the accounts that one follows on social media are highly predictive of their personal traits and interests; this social information is encoded more effectively and with better coverage using SocialVec compared with the Wikipedia-based methods. Comparison with other approaches A question of interest is how the modeling of users based on the entities that they follow as encoded by SocialVec compares with the more traditional approach of representing the users in terms of the content authored by them. Table 2 presents also our trait prediction results based on the textual content authored by the users (’FastText’). The best results per trait are highlighted in boldface in the table. As shown, the SocialVec approach achieves top performance by a large margin on all categories, except for education and income, for which content-based classification achieves comparable or slightly 10 https://news.gallup.com/poll/13942/Music-Cars-2004Election.aspx 11 See ”Tebowing” at https://en.wikipedia.org/wiki/Tim Tebow, and https://en.wikipedia.org/wiki/Chick-fil-A: Same-sex marriage controversy. Male Ian Rapoport, Sports writer and analyst (1.04) Chris Broussard, Sports analyst, Fox Sports (1.02) Adam Schefter, Sports analyst (1.02) White starwars, Star Wars on Twitter (0.80) John Krasinski, an actor, director and producer (0.78) Luke Bryan, a country music singer and songwriter (0.78) High-school 21 Savage, a rapper, songwriter, and producer (0.45) AccessJui, Music Producer (0.44) Desi Banks, a comedian, actor, and writer (0.42) Republican Chick-fil-A, a large fast food restaurant chain (1.15) Carrie Underwood, a Country singer (1.14) Tim Tebow (1.13) Female Chelsea DeBoer, a reality TV persona (0.81) womenshumor, ”tweets made for a woman” (0.80) Maci Bookout, a reality television personality (0.76) Afro-American KYLESISTER (1.17) Emmanuel Hudson, actor (1.16) Erica Dixon, TV personality (1.15) Academic The New Yorker, an American magazine (1.30) The Economist, an international newspaper (1.20) Jack Tapper, anchor and host at CNN (1.19) Democratic Bryson Tiller, a rapper (0.35) Kevin Gates, a rapper (0.35) Tami Roman, a TV personality and rapper (0.34) Table 3: The top Twitter accounts that are characteristic to different subpopulations as measured using our datasets labeled with personal attributes and the Pointwise Mutual Information (PMI) measure. higher results. (The difference in performance between FastText and SocialVec on the education category is not significant according to the McNemar χ2 statistical test as applied to accuracy results.) We find this sensible, as education and income levels are known to be manifested through writing style (Flekova, Preoţiuc-Pietro, and Ungar 2016). Finally, Table 2 includes the results previously reported by Volkova et al. (2015; 2016). They trained log-linear models using n-gram features extracted from the tweets posted by each user, showing gains over alternative models. We stress that their results are not directly comparable to ours, as the original dataset included many more labeled examples of users who are no longer active (5K vs. 3.5K labeled users in our dataset). Moreover, they relied on tweets obtained shortly after user labeling. Nevertheless, we observe that our results applied to the reduced dataset using SocialVec are comparable or exceed the results by Volkova and Bachrach (2016) on the majority of the categories. Another work that predicted gender and ethnicity from user names reported results that are lower than those reported by Volkova and Bachrach (Wood-Doughty et al. 2018). In summary, our results indicate that many sociodemographic attributes can be predicted with high accuracy using SocialVec entity embeddings. We showed superior performance compared with entity embeddings inferred from knowledge bases, both due to better coverage and the encoding of social aspects in SocialVec. And, our results outperform content-based classification for multiple attributes, even when trained using fewer labeled examples. To improve prediction results further, it is possible to model the network information of the users’ direct friends, exploiting social homophily (Pan et al. 2019). Also, trait prediction results may potentially improve via the integration of network and content information, ideally using larger datasets. Political polarity of news sources As a second case study, we investigate whether the political orientation of news sources can be inferred from the social patterns encoded in SocialVec. A recent survey by Pew Re- search estimated that 62% of the U.S. adults consume news primarily from social media sites (Mitchell 2016). A lack of awareness of the biases of these accounts can play a critical role in how news are assimilated and spread on social media, shaping people’s opinions and influencing their choices to the extreme of swaying the outcomes of political elections (Allcott and Gentzkow 2017). In addition, identifying the slant of news accounts on social media may help address political bubbles, where users are exposed primarily to ideologically congenial political information (Eady et al. 2019). Various research works aimed to infer the political slant of media sources based on the language used by them, the framing of political issues by these sources (Baly et al. 2020), or the language used by their followers (Stefanov et al. 2020).12 Ribeiro et al. (2018) quantified the biases of news outlets by analyzing their readership directly, considering the proportions of liberal and conservative users within the source’s audience. Their work is limited to Facebook, as they relied on explicit account statistics that it provides to advertisers. In this work, we exploit SocialVec social entity embeddings for predicting the political leaning of news accounts on Twitter. As we show empirically, the resulting assessments are highly accurate, yielding similar results to formal polls, whereas factual entity embeddings lack the necessary social information that is encoded by SocialVec. Methods The Word2Vec metric tends to place two words close to each other if they occur in similar contexts (Levy and Goldberg 2014). Likewise, we gauge the social similarity between entities based on the cosine similarity of their embeddings. Assuming that individual users follow the accounts of politicians, media sources, and other entities with similar political orientation to their own, we expect the distribution of accounts that are followed by right- and left-leaning populations to be distinguishable. That is, the embeddings of en12 (Stefanov et al. 2020) referred to somewhat disputable judgements by the mediaBiasFactCheck website. We could not obtain the subset of accounts that they evaluated for comparison purposes. tities of similar political orientation should exhibit higher similarity in the vector space, compared with the embeddings of accounts of opposite political polarity. We therefore compute the bias of news accounts on Twitter based on their similarity to popular accounts with distinct political polarity in the embedding space. Specifically, we consider the accounts of the Republican Donald Trump, the incumbent U.S. president at the time that our data was collected,13 and of Barack Obama, the former Democratic president.14 As of 2020, both accounts were among the top-followed Twitter accounts in the U.S., ranked at fourth and first positions, respectively, based on the number of their followers.15 Let us denote the SocialVec embedding of a specified news account as en , and the embeddings of the Democratic and Republican anchor accounts, which we set to the accounts of Obama and Trump, as eD and eR , respectively. We measure the similarity of the news source in the embedding space with these Republican and Democratic anchors. We then assess the political orientation of the news account, considering the difference between those similarity scores. Formally, we compute the political orientation (PO) score of en as the difference between the cosine similarities: P O(en ) = Sim(eR , en ) − Sim(eD , en ) (5) Accordingly, a positive score indicates on overall conservative (Republican) social orientation, whereas a negative score indicates on a liberal (Democratic) social bias. The greater the gap between the similarity scores, the greater is the social political polarity. In our experiments, we rank selected news accounts according to their computed political orientation scores, and compare our results against formal polls. Again, we evaluate the embeddings of SocialVec, as well the embeddings of Wikipedia2Vec and the graph-based Wikidata embeddings, gauging the extent to which each of these methods captures the social phenomena of political leaning. Ground-truth datasets We refer to the results of two formal polls conducted by Pew Research in 2014 and 2019 (Jurkowitz et al. 2020), with the goal of gauging the political polarization in the American public. The participants in the polls were recruited using random sampling of residential addresses, and the data was weighted to match the U.S. adult population by gender, race, ethnicity, education and other categories. In both polls, Pew researchers classified the audience of selected popular news media outlets based on a ten question survey covering a range of issues like homosexuality, immigration, economic policy, and the role of government. The media sources were then ranked, according to those poll participants who said they got political and election news there in the week before, taking into consideration the party identification (Republican or Democrat) and ideology (conservative, moderate or liberal) of those participants. 13 https://twitter.com/realDonaldTrump; suspended in Jan. 2021. https://twitter.com/BarackObama 15 https://www.socialbakers.com/statistics/twitter/profiles/unitedstates 14 Poll 2014 2020 # Accounts 31 30 SocialVec 0.82 (31) 0.85 (30) Wikipedia2Vec 0.36 (28) 0.28 (28) Wikidata -0.40 (23) -0.32 (23) Table 4: Spearman’s correlation results of ranking news accounts by political slant using different entity embeddings, compared with the poll-based rankings reported by Pew Research in 2014 and 2020. The number of available news account embeddings is given in parenthesis for each method. Poll 2014 2020 Accounts All Common All Common SocialVec 0.94 (31) 0.95 (22) 0.97 (30) 0.95 (22) Wikipedia2Vec 0.55 (28) 0.55 (22) 0.60 (28) 0.50 (22) Wikidata 0.32 (23) 0.27 (22) 0.27 (23) 0.23 (22) Table 5: Accuracy results: predicting political slant as binary polarity, for all accounts available per method (‘all’), or for the accounts represented by all methods (‘common’). Overall, the polls conducted in 2014 and 2019 apply to 36 and 30 selected news media outlets, respectively. These two sets include 43 unique media outlets jointly, where there are 18 news sources that overlap between the two surveys. We manually mapped the various sources to their Twitter accounts, identifying the accounts for the majority of news sources included in the earlier poll (31 out of 36), and practically all of the news sources included in the more recent poll (30). As detailed in Table 4, all of those Twitter accounts are included in SocialVec, where most of the news sources have respective Wikipedia2Vec and Wikidata embeddings. Results The Pew surveys assign numerical political polarity score to each of the news sources that do not match the range, or interpretation, of the computed cosine similarity scores. In order to assess the correlation between our entity similarity metric and the survey results, we therefore consider the relative ranking of the various news sources, ranging from conservative/Republican to liberal/Democrat. Table 4 reports the similarity of the poll-based rankings with the rankings generated using Eq. 5 and the different entity embedding schemes in terms of the Spearman’s ranking correlation measure (Hill, Reichart, and Korhonen 2015). A perfect Spearman correlation of +1 indicates that the rankings are identical, where correlation of -1 would mean that the rankings are perfectly inverse. As shown in the table, the rankings produced using SocialVec are highly similar to the ground-truth rankings, as measured by the high correlation scores of 0.82 and 0.85 per the two polls. In contrast, the rankings produced by Wikipedia2Vec are not wellaligned with the ground-truth rankings, as indicated by the low Spearman correlation scores of 0.36 and 0.28. The rankings generated using the Wikidata graph embeddings are not meaningful altogether, yielding negative correlation scores. Figure 3 illustrates the distribution of the political orientation scores of the news sources included in the poll of 2020, as computed using SocialVec. The accounts are placed on Figure 3: Ranking of political polarity based on our embeddings the range of Democratic (left) to Republican (right), and are spaced along this range relative to their scores. Similar to the poll results, we observe that some news sources lie close to each other on this scale of political bias. Notably, gauging ranking similarity is highly sensitive, in that any differences in the ordering of accounts with similar scores are penalized. Table 5 reports the results of a more lenient evaluation, where we consider the proportion of evaluated news accounts for which the polarity is correctly estimated. Here, we assign the computed political orientation to be conservative/Republican if the PO score is positive, and vice versa (Eq. 5). As shown in the table, the binary political orientation predicted using SocialVec is accurate for 94% and 97% of the evaluated news accounts, where there exists a single mistake per the polls of 2014 and 2020, respectively. In contrast, Wikipedia2Vec embeddings yield low accuracy of 55% and 60% per those polls, and the Wikidata graph embeddings yield poor accuracies of 32% and 27%. To account for the differences in coverage, Table 5 reports prediction accuracy also for the subset of accounts which are represented by all methods (‘common’). As shown, the same trends persist. In error analysis, we found that faulty polarity predictions by SocialVec applied to news accounts for which the number of contexts (followers) in our sample of Twitter was the lowest among all the accounts included in each poll. If accounts with less than 5,000 followers in our data are removed from the evaluation, then SocialVec achieves perfect polarity classification results for all of the Twitter news accounts that are included in both of the reference polls. Overall, our results demonstrate that the political orientation of Twitter accounts can be accurately predicted based on the social contexts embedded in SocialVec. In contrast, the Wikipedia- and Wikidata-based entity embeddings fail to relate different entities by social aspects such as political affinity. We believe that the framework presented here for predicting political bias can be employed in future research for the assessment of various social biases using SocialVec. Conclusion We presented SocialVec, a framework for learning social entity embeddings from social networks, considering the accounts that users tend to co-follow as relevant contexts. We demonstrated the applicability of SocialVec embeddings in two case studies, obtaining competitive or SOTA results using minimal or no supervision, and showing advantageous performance over entity representations derived from knowledge bases. There are naturally some limitations to our approach. An inherent limitation is that the social network of Twitter may provide a biased reflection of the real world. It has been shown, for example, that Twitter users are younger and more Democrat than the general public.16 . In addition, while public figures like politicians and artists typically maintain popular Twitter accounts, some entity types, e.g., locations, may not be well-represented in Twitter, or invoke low interest in this platform. Furthermore, accounts may be banned from social networks like Twitter. Yet, we believe that SocialVec encapsulate valuable, wide and truthful social knowledge. Another inherent limitation of SocialVec, as any other embedding method, is that the quality of particular entity embeddings depends on sufficient context statistics for those entities. In the future, we plan to extend the scope of data that is modeled in SocialVec, where this would enable the learning of high-quality representations for entities that are popular locally, or within particular sub-communities. This may be the first work to present social entity embeddings. We believe that the implications of this work go far and beyond the particular case studies presented in this work. We will make SocialVec embeddings publicly available, hoping to promote social knowledge modeling and exploration. As future research directions, we are interested to enhance the social knowledge encoded by SocialVec with account semantic types. We also wish to explore the integration of social context in content analysis, for example, for identifying entity mentions in tweets, and for the modeling of social context in applications of opinion mining. References Adi, Y.; Kermany, E.; Belinkov, Y.; Lavi, O.; and Goldberg, Y. 2017. Fine-grained Analysis of Sentence Embeddings Using Aux16 https://www.pewresearch.org/internet/2019/04/24/sizing-uptwitter-users iliary Prediction Tasks. In Intl. Conf. on Learning Representations. Allcott, H.; and Gentzkow, M. 2017. Social media and fake news in the 2016 election. Journal of economic perspectives 31(2): 211–36. An, J.; Cha, M.; Gummadi, K.; Crowcroft, J.; and Quercia, D. 2012. Visualizing media bias through Twitter. In ICWSM, volume 6. Baly, R.; Karadzhov, G.; An, J.; Kwak, H.; Dinkov, Y.; Ali, A.; Glass, J.; and Nakov, P. 2020. What Was Written vs. Who Read It: News Media Profiling Using Text Analysis and Social Media Context. In Proceedings of the Annual Meeting of ACL. Barkan, O.; and Koenigstein, N. 2016. ITEM2VEC: Neural item embedding for collaborative filtering. In Workshop on MLSP. Mueller, A.; Wood-Doughty, Z.; Amir, S.; Dredze, M.; and Nobles, A. L. 2021. Demographic representation and collective storytelling in the me too twitter hashtag activism movement. Proceedings of the ACM on Human-Computer Interaction 5. O’Connor, B.; Balasubramanyan, R.; Routledge, B.; and Smith, N. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the International AAAI Conference on Web and Social Media, volume 4. Pan, J.; Bhardwaj, R.; Lu, W.; Chieu, H. L.; Pan, X.; and Puay, N. Y. 2019. Twitter homophily: Network based prediction of user’s occupation. In Proceedings of the Annual Meeting of ACL. Benton, A.; Arora, R.; and Dredze, M. 2016. Learning Multiview Embeddings of Twitter Users. In ACL (Vol.2). Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the ACM SIGKDD international conference. Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multirelational data. In Neural Information Processing Systems (NIPS). Pritsker, E. W.; Kuflik, T.; and Minkov, E. 2017. Assessing the contribution of twitter’s textual information to graph-based recommendation. In Proc. of the Intl Conf. on Intelligent User Interfaces. Center, P. R. 2014. Political polarization in the american public. Annual Review of Political Science . Rehurek, R.; and Sojka, P. 2010. Software framework for topic modelling with large corpora. In In Proceedings of the workshop on new challenges for NLP frameworks. Eady, G.; Nagler, J.; Guess, A.; Zilinsky, J.; and Tucker, J. A. 2019. How many people live in political bubbles on social media? Evidence from linked survey and Twitter data. Sage Open 9(1). Flekova, L.; Preoţiuc-Pietro, D.; and Ungar, L. 2016. Exploring Stylistic Variation with Age and Income on Twitter. In ACL. Grover, A.; and Leskovec, J. 2016. Node2Vec: Scalable feature learning for networks. In Proceedings of the ACM SIGKDD. Hill, F.; Reichart, R.; and Korhonen, A. 2015. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Computational Linguistics 41(4). Hoffart, J.; Seufert, S.; Nguyen, D. B.; Theobald, M.; and Weikum, G. 2012. KORE: keyphrase overlap relatedness for entity disambiguation. In Information and knowledge management (CIKM). Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. 2017. Bag of Tricks for Efficient Text Classification. In Proc. of EACL. Ribeiro, F.; Henrique, L.; Benevenuto, F.; Chakraborty, A.; Kulshrestha, J.; Babaei, M.; and Gummadi, K. 2018. Media bias monitor: Quantifying biases of social media news outlets at large-scale. In In Proc. of the Intl AAAI Conference on Web and Social Media. Rudinger, R.; May, C.; and Van Durme, B. 2017. Social Bias in Elicited Natural Language Inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. Shen, D.; Wang, G.; Wang, W.; Min, M. R.; Su, Q.; Zhang, Y.; Li, C.; Henao, R.; and Carin, L. 2018. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In Proceedings of the Annual Meeting of ACL. Sosea, T.; and Caragea, C. 2020. CANCEREMO: A Dataset for Fine-Grained Emotion Detection. In EMNLP. Stefanov, P.; Darwish, K.; Atanasov, A.; and Nakov, P. 2020. Predicting the Topical Stance and Political Leaning of Media using Tweets. In Proceedings of the Annual Meeting of ACL. Jurkowitz, M.; Mitchell, A.; Shearer, E.; and Walker, M. 2020. US media polarization and the 2020 election: A nation divided. Pew Research Center 24. Tanon, T. P.; Vrandečić, D.; Schaffert, S.; Steiner, T.; and Pintscher, L. 2016. From freebase to wikidata: The great migration. In WWW. Lerer, A.; Wu, L.; Shen, J.; Lacroix, T.; Wehrstedt, L.; Bose, A.; and Peysakhovich, A. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proc. OF SysML. Volkova, S.; and Bachrach, Y. 2016. Inferring perceived demographics from user emotional tone and user-environment emotional contrast. In Proceedings of the Annual Meeting of ACL. Levy, O.; and Goldberg, Y. 2014. Neural word embedding as implicit matrix factorization. Proc. of NIPS 27. Volkova, S.; Bachrach, Y.; Armstrong, M.; and Sharma, V. 2015. Inferring latent user properties from texts published in social media. In Proceedings of the AAAI Conf. on Artificial Intelligence. Marwick, A.; and Boyd, D. 2011. To see and be seen: Celebrity practice on Twitter. Convergence 17(2): 139–158. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In Conference on Neural Information Processing Systems NIPS. Mitchell, A. 2016. Key findings on the traits and habits of the modern news consumer. Pew Research Center . Mitchell, T.; Cohen, W.; Hruschka, E.; Talukdar, P.; Yang, B.; Betteridge, J.; Carlson, A.; Dalvi, B.; Gardner, M.; Kisiel, B.; et al. 2018. Never-ending learning. Communications of the ACM 61(5): 103–115. Wood-Doughty, Z.; Andrews, N.; Marvin, R.; and Dredze, M. 2018. Predicting twitter user demographics from names alone. In Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media. Yamada, I.; Asai, A.; Sakuma, J.; Shindo, H.; Takeda, H.; Takefuji, Y.; and Matsumoto, Y. 2020. Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. In EMNLP: System demos. Yamada, I.; Shindo, H.; Takeda, H.; and Takefuji, Y. 2016. Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation. In Proceedings of the SIGNLL Conference on Computational Natural Language Learning. Youyou, W.; Kosinski, M.; and Stillwell, D. 2015. Computer-based personality judgments are more accurate than those made by humans. PNAS 112(4): 1036–1040.