This document discusses modeling individuals using distributional semantics based on their personal linguistic experiences and interactions. It proposes representing individuals as vectors in a semantic space based on the language data they are exposed to, such as edits they make on Wikipedia. Over 4000 individuals were modeled from Wikipedia edit logs. Their "person vectors" encapsulate aspects of their individuality and can predict attributes about them, such as areas of expertise. The vectors were tested to identify the most relevant person to answer information requests, outperforming standard information retrieval.
This document discusses modeling individuals using distributional semantics based on their personal linguistic experiences and interactions. It proposes representing individuals as vectors in a semantic space based on the language data they are exposed to, such as edits they make on Wikipedia. Over 4000 individuals were modeled from Wikipedia edit logs. Their "person vectors" encapsulate aspects of their individuality and can predict attributes about them, such as areas of expertise. The vectors were tested to identify the most relevant person to answer information requests, outperforming standard information retrieval.
This document discusses modeling individuals using distributional semantics based on their personal linguistic experiences and interactions. It proposes representing individuals as vectors in a semantic space based on the language data they are exposed to, such as edits they make on Wikipedia. Over 4000 individuals were modeled from Wikipedia edit logs. Their "person vectors" encapsulate aspects of their individuality and can predict attributes about them, such as areas of expertise. The vectors were tested to identify the most relevant person to answer information requests, outperforming standard information retrieval.
This document discusses modeling individuals using distributional semantics based on their personal linguistic experiences and interactions. It proposes representing individuals as vectors in a semantic space based on the language data they are exposed to, such as edits they make on Wikipedia. Over 4000 individuals were modeled from Wikipedia edit logs. Their "person vectors" encapsulate aspects of their individuality and can predict attributes about them, such as areas of expertise. The vectors were tested to identify the most relevant person to answer information requests, outperforming standard information retrieval.
in a vector space: modelling individual speakers with distributional semantics
Aurélie Herbelot Behrang QasemiZadeh
Centre for Mind/Brain Sciences DFG Collaborative Research Centre 991 University of Trento Heinrich-Heine-Universität Düsseldorf [email protected][email protected]
Abstract an individual speaker develops the verbal side of
his or her conceptual apparatus from the linguistic The linguistic experiences of a person are experiences he or she is exposed to, together with an important part of their individuality. In the perceptual situations surrounding those expe- this paper, we show that people can be riences. modelled as vectors in a semantic space, One natural consequence of the distributional using their personal interaction with spe- claim is that meaning is both speaker-dependent cific language data. We also demonstrate and community-bound. On the one hand, depend- that these vectors can be taken as repre- ing on who they are, speakers will be exposed sentative of ‘the kind of person’ they are. to different linguistic and perceptual experiences, We build over 4000 speaker-dependent and by extension develop separate vocabularies subcorpora using logs of Wikipedia ed- and conceptual representations. For instance, a its, which are then used to build distri- chef and a fisherman may have different represen- butional vectors that represent individual tations of the word fish (Wierzbicka, 1984). On the speakers. We show that such ‘person vec- other hand, the vocabularies and conceptual rep- tors’ are informative to others, and they resentations of individual people should be close influence basic patterns of communication enough that they can successfully communicate: like the choice of one’s interlocutor in this is ensured by the fact that many linguistic ut- conversation. Tested on an information- terances are shared amongst a community. seeking scenario, where natural language There is a counterpart to the claim that questions must be answered by addressing ‘language is speaker-dependent’: speakers are the most relevant individuals in a commu- language-dependent. That is, the type of person nity, our system outperforms a standard in- someone is can be correlated with their linguis- formation retrieval algorithm by a consid- tic experience. For instance, the fact that fish and erable margin. boil are often seen in the linguistic environment of an individual may indicate that this individual 1 Introduction has much to do with cooking (contrast with high Distributional Semantics (DS) (Turney and Pan- co-occurrences of fish and net). In some contexts, tel, 2010; Clark, 2012; Erk, 2012) is an approach linguistic data might even be the only source of in- to computational semantics which has historical formation we have about a person: in an academic roots in the philosophical work of Wittgenstein, context, we often infer from the papers a person and in particular in the claim that ‘meaning is has written and cited which kind of expertise they use’, i.e. words acquire a semantics which is a might have. function of the contexts in which they are used This paper offers a model of individuals based (Wittgenstein, 1953). The technique has been used on (a subset of) their linguistic experience. That is, in psycholinguistics to model various phenom- we model how, by being associated with particular ena, from priming to similarity judgements (Lund types of language data, people develop a unique- and Burgess, 1996), and even aspects of lan- ness representable as a vector in a semantic space. guage acquisition (Landauer and Dumais, 1997; Further, we evaluate those ‘person vectors’ along Kwiatkowski et al., 2012). The general idea is that one particular dimension: the type of knowledge we expect them to hold. and read a lot about linguistics, and this correlates The rest of this paper is structured as follows. with broad features of theirs, e.g. they are com- We first give a short introduction to the topic putational linguists and are interested in language. of modelling linguistic individuality (§2) and we So, as particular stylistic features can predict who discuss how DS is a suitable tool to represent a person is, a specific semantic experience might the associated characteristics for a given person give an insight into what kind of person they are. (§3). We describe a model of individuals in a In what follows, we describe how, by selecting a community using ‘person vectors’ (§4). We then public subset of a person’s linguistic environment, highlight the challenges associated with evaluat- we can build a representation of that person which ing such vectors, and propose a prediction task encapsulates and summarises a part of their indi- which has for goal to identify someone with a par- viduality. The term ‘public subset’ is important ticular expertise, given a certain information need here, as the entire linguistic experience of an indi- (§5, §6). Concretely, we model a community of vidual is (at this point in time!) only accessible to over 4000 individuals from their linguistic interac- them, and the nature of the subset dictates which tion with Wikipedia (§7). We finally evaluate our aspect of the person we can model. For instance, model on the suggested task and compare results knowing what a particular academic colleague has against a standard information retrieval algorithm. written, read and cited may let us model their work expertise, while chatting with them at a barbecue 2 Individuality and how it is seen party might give us insight into their personal life. We further contend that what we know about a A speaker’s linguistic experience—what they person conditions the type of interaction we have read, write, say and hear—is individual in all with them: we are more likely to start a conver- the ways language can be described, from syn- sation about linguistics with someone we see as a tax to pragmatics, including stylistics and regis- linguist, and to talk about the bad behaviour of our ter. One area of work where linguistic individual- dog with a person we have primarily modelled as ity has been extensively studied is author profiling a dog trainer. In other words, the model we have and identification (Zheng et al., 2006; Stamatatos, of people helps us successfully communicate with 2009). It has been shown, in particular, how sub- them. tle syntactic and stylistic features (including meta- linguistic features such as sentence length) can be 3 Some fundamentals of DS a unique signature of a person. This research, of- ten conducted from the point of view of forensic The basis of any DS system is a set of word mean- linguistics, has person identification as its main ing representations (‘distributions’) built from goal and does not delve much into semantics, for large corpora. In their simplest form,1 distribu- the simple reason that the previously mentioned tions are vectors in a so-called semantic space syntactic and structural clues often perform better where each dimension represents a term from the in evaluation (Baayen et al., 1996). overall system’s vocabulary. The value of a vec- This paper questions in which way the seman- tor along a particular dimension expresses how tic aspects of someone’s linguistic experience con- characteristic the dimension is for the word mod- tributes to their individuality. One aspect that elled by the vector (as calculated using, e.g., Point- comes to mind is variations in word usage (as wise Mutual Information). It will be found, typ- mentioned in the introduction). Unfortunately, ically, that the vector cat has high weight along this aspect of the problem is also the most diffi- the dimension meow but low weight along poli- cult to approach computationally, for sheer lack tics. More complex architectures result in com- of data: we highlight in §5 some of the reasons pact representations with reduced dimensionality, why obtaining (enough) speaker-specific language which can integrate a range of non-verbal informa- data remains a technical and privacy minefield. tion such as visual and sound features (Feng and Another aspect, which is perhaps more straight- Lapata, 2010; Kiela and Clark, 2015). forwardly modellable, is the extent to which the Word vectors have been linked to conceptual type of linguistic material someone is exposed to 1 There are various possible ways to construct distribu- broadly correlates with who they are. It is likely, tions, including predictive language models based on neural for instance, that the authors of this paper write networks (Mikolov et al., 2013). representations both theoretically (Erk, 2013) and peatedly been found that simple addition of vec- experimentally, for instance in psycholinguistic tors performs well in modelling the meaning of and neurolinguistic work (Anderson et al., 2013; larger constituents (i.e., we express the meaning Mitchell et al., 2008). The general idea is that a of black cat by simply summing the vectors for distribution encapsulates information about what black and cat). To some extent, it is also possible kind of thing a particular concept might be. Re- to get the ‘gist’ of simple sentences by summing trieving such information in ways that can be ver- their constituent words. The fundamental idea be- balised is often done by looking at the ‘nearest hind simple addition is that, given a coherent set neighbours’ of a vector. Indeed, a natural con- of words (i.e. words which ‘belong together and sequence of the DS architecture is that similar are close in the semantic space), their sum will ex- words cluster in the same area of the semantic press the general topic of those words by creating space: it has been shown that the distance between a centroid vector sitting in their midst. This notion DS vectors correlates well with human similarity of coherence is important: summing two vectors judgements (Baroni et al., 2014b; Kiela and Clark, that are far away from each other in the space will 2014). So we can find out what a cat is by inspect- result in a vector which is far from both the base ing the subspace in which the vector cat lives, and terms (this is one of the intuitions used in (Vec- finding items such as animal, dog, pet, scratch etc. chi et al., 2011) to capture semantically anomalous In what follows, we use this feature of vector phrases). spaces to give an interpretable model of an indi- We take this idea further by assuming that peo- vidual, i.e., we can predict that a person might be ple are on the whole coherent (see (Herbelot, a linguist by knowing that their vector is the close 2015) for a similar argument about proper names): neighbour of, say, semantics, reference, model. their experiences reflect who they are. For in- stance, by virtue of being a chef, or someone inter- 4 A DS model of a community ested in cooking, someone will have many inter- 4.1 People in semantic spaces connected experiences related to food. In particu- lar, a good part of their linguistic experiences will Summing up what we have said so far, we follow involve talking, reading and writing about food. the claim that we can theoretically talk about the It follows that we can represent a person by sum- linguistic experience of a speaker in distributional ming the vectors corresponding to the words they terms. The words that a person has read, written, have been exposed to. When aggregating the vo- spoken or heard, are a very individual signature cabulary most salient for a chef, we would hope- for that person. The sum of those words carries fully create a vector inhabiting the ‘food’ section important information about the type of concepts of the space. As we will see in §6, the model we someone may be familiar with, about their social propose is slightly more complex, but the intuition environment (indicated by the registers observed remains the same. in their linguistic experience) and, broadly speak- Note that, in spite of being ‘coherent’, peo- ing, their interests. ple are not one-sided, and a cook can also be a We further posit that people’s individuality can bungee-jumper in their spare time. So depending be modelled as vectors in a semantic space, in a on the spread of data we have about a person, our way that the concepts surrounding a person’s vec- method is not completely immune to creating vec- tor reflect their experience. For instance, a cook tors which sit a little too far away from the topics might ‘live’ in a subspace inhabited by other cooks they encapsulate. This is a limit of our approach and concepts related to cooking. In that sense, the which could be solved by attributing a set of vec- person can be seen as any other concept inhabiting tors, rather than a single representation, to each that space. person. In this work, however, we do not consider In order to compute such person vectors, we this option and assume that the model is still dis- expand on a well-known result of compositional criminative enough to distinguish people. distributional semantics (CDS). CDS studies how words combine to form phrases and sentences. 4.2 From person vectors to interacting agents While various, more or less complex frameworks have been proposed (Clark et al., 2008; Mitchell In what sense are person vectors useful represen- and Lapata, 2010; Baroni et al., 2014a), it has re- tations? We have said that, as any distribution in a semantic space, they give information about the type of thing/person modelled by the vector. We also mentioned in §2 that knowing who someone is (just like knowing what something is) influences our interaction with them. So we would like to model in which ways our people representations help us successfully communicate with them. For the purpose of this paper, we choose an in- formation retrieval task as our testbed, described in §5. The task, which involves identifying a rel- evant knowledge holder for a particular question, requires us to embed our person vectors into sim- ple agent-like entities, with a number of linguis- tic, knowledge-processing and communicative ca- pabilities. A general illustration of the structure of each agent is shown in Fig. 1. An agent stores (and dynamically updates) a) a person vector; b) a memory which, for the purpose of our evalua- tion (§5), is a store of linguistic experiences (some Figure 1: A person is exposed to a set of linguistic expe- riences. Computationally, each experience is represented as data the person has read or written, e.g. informa- a vector in a memory store. The sum of those experiences tion on Venezuelan cocoa beans). The memory make up the individual’s ‘person vector’. The person also acts as a knowledge base which can be queried, i.e. has a model of their community in the form of other individ- uals’ person vectors. In response to a particular communica- relevant parts can be ‘remembered’ (e.g. the per- tion need, the person can direct their attention to the relevant son remember reading about some Valrhona co- actors in that community. coa, with a spicy flavour). Further, the agent has some awareness of others: it holds a model of its community consisting of other people’s vectors We assume that those agents are fully connected (e.g., the agent knows Bob, who is a chef, and Al- and aware of each other, in a way that they can ice, who is a linguist). When acted by a particular direct specific questions to the individuals most communication need, the agent can direct its at- likely to answer them. Our evaluation procedure tention to the appropriate people in its community tests whether, for a given information need, ex- and engage with them. pressed in natural language by one agent (e.g. What is Venezuelan chocolate like?), the commu- 5 Evaluating person vectors nity is modelled in a way that an answer can be successfully obtained (i.e. an agent with relevant 5.1 The task expertise has been found, and ‘remembers’ some To evaluate our person vectors, we choose a task information that satisfies the querier’s need). Note which relies on having a correct representation of that we are not simulating any real communication the expertise of an individual. between agents, which would require that the in- Let’s imagine a person with a particular infor- formation holder generates a natural language an- mation need, for instance, getting sightseeing tips swer to the question. Rather, the contacted agent for a holiday destination. Let’s also say that we simply returns the information in its memory store are in a pre-Internet era, where information is typ- which seems most relevant to the query at hand. ically sought from other actors in one’s real-world We believe this is enough to confirm that the per- community. The communication process associ- son vector was useful in acquiring the information: ated with satisfying this information need takes if the querying agent contacts the ‘wrong’ person, two steps: a) identifying the actors most likely to the system has failed in successfully fulfulling the hold relevant knowledge (perhaps a friend who has information need. done the trip before, or a local travel agent); b) asking them to share relevant knowledge. 5.2 Comparative evaluation In the following, we replicate this situation us- We note that the task we propose can be seen as ing a set of agents, created as described in §4. an information retrieval (IR) problem over a dis- tributed network: a query is matched to some rele- formation about users’ personal experiences. We vant knowledge unit, with all available knowledge attempt to solve this conundrum by using infor- being split across a number of ‘peers’ (the indi- mation freely available on Wikipedia. We com- viduals in our community). So in order to know bine a Wikipedia-based Question Answering (QA) how well the system does at retrieving relevant in- dataset with contributor logs from the online ency- formation, we can use as benchmark standard IR clopedia. software. We use the freely available ‘WikiQA’ dataset of We compare the performance of our system (Yang et al., 2015).4 This dataset contains 3047 with a classic, centralised IR algorithm, as im- questions sampled from the Bing search engine’s plemented in the Apache Lucene search engine. data. Each question is associated with a Wikipedia Lucene is an open source library for implementing page which received user clicks at query time. The (unstructured) document retrieval systems, which dataset is further annotated with the particular sen- has been employed in many full-text search en- tence in the Wikipedia article which answers the gine systems (for an overview of the library, see query – if it exists. Many pages that were cho- (Bialecki et al., 2012)). We use the out-of-the-box sen by the Bing users do not actually hold the an- ‘standard’ indexing solution provided by Lucene,2 swer to their questions, reducing the data to 1242 which roughly implements a term-by-document queries and the 1194 corresponding pages which Vector Space Model, in which terms are lemma- can be considered relevant for those queries (41% tised and associated to documents using their tf-idf of all questions). We use this subset for our ex- scores (Spärck-Jones, 1972) computed from the periments, regarding each document in the dataset input Wikipedia corpus of our evaluation. Simi- as a ‘linguistic experience’, which can be stored in larly, queries are parsed using Lucene’s standard the memory of the agent exposed to it. query parser and then searched and ranked by the To model individuals, we download a log of computed ‘default’ similarities.3 Wikipedia contributions (March 2015). This log is Our hypothesis is that, if our system can match described as a ‘log events to all pages and users’. the performance of a well-known IR system, we We found that it does not, in fact, contain all pos- can also conclude that the person vectors were a sible edits (presumably because of storage issues). good summary of the information held by a par- Of the 1194 pages in our WikiQA subset, only ticular agent. 625 are logged. We record the usernames of all contributors to those 625 documents, weeding out 5.3 Data challenges contributors whose usernames contain the string bot and have more than 10,000 edits (under the as- Finding data to set up the evaluation of our sys- sumption that those are, indeed, bots). Finally, for tem is an extremely challenging task. It involves each user, we download and clean all articles they finding a) personalised linguistic data which can have contributed to. be split into coherent ‘linguistic experiences’; b) In summary, we have a dataset which consists of realistic natural language queries; c) a gold stan- a) 662 WikiQA queries linked to 625 documents dard matching queries and relevant experiences. relevant for those queries; b) a community of 4379 There is very little openly available data on peo- individuals/agents, with just over 1M documents ple’s personal linguistic experience. What is avail- spread across the memories of all agents. able comes mostly from the Web science and user personalisation communities and such data is ei- 6 Implementation ther not annotated for IR evaluation purposes (e.g. (von der Weth and Hauswirth, 2013)), or propri- Our community is modelled as a distributed net- etary and not easily accessible or re-distributable work of 4379 agents {a1 , . . . , a4379 }. Each agent (e.g. (Collins-Thompson et al., 2011)). Con- ak has two components: a) a personal profile com- versely, standard IR datasets do not give any in- ponent, which fills the agent’s memory with in- formation from the person’s linguistic experience 2 Ver. 5.4.1, obtained from http://apache. (i.e., documents she/he reads or edits) and cal- lauf-forum.at/lucene/java/5.4.1. 3 For an explanation of query matching and simi- culates the corresponding person vector; b) an larity computation see http://lucene.apache. ‘attention’ component which gets activated when org/core/5_4_1/core/org/apache/lucene/ 4 search/similarities/Similarity.html. http://aka.ms/WikiQA a communication need is felt. All agents share where freq(t) is the frequency of term t in the doc- a common semantic space S which gives back- ument and H(t) is its entropy, as calculated over ground vectorial representations for words in the a larger corpus. The representation of the docu- system’s vocabulary. In our current implemen- ment is then the weighted sum of the 10 terms5 tation, S is given by the CBOW semantic space with highest importance for that text: of (Baroni et al., 2014b), a 400-dimension vec- X ~e = wt ∗ ~t. (3) tor space of 300,000 items built using the neu- t∈t1 ...t10 ral network language model of (Mikolov et al., 2013). This space shows high correlation with hu- Note that both vectors ~t and ~e are normalised to man similarity judgements (i.e., ρ = 0.80) over unit length. the 3000 pairs of the MEN dataset (Bruni et al., For efficiency reasons, we compute weights 2012). Note that using a standard space means the only over the first 20 lines of documents, also fol- we assume shared meaning presentations across lowing the observation that the beginning of a doc- the community (i.e., at this stage, we don’t model ument is often more informative as to its topic than inter-speaker differences at the lexical item level). the rest (Manning et al., 2008). Person vectors: A person vector is the nor- Attention: The ‘attention’ module directs the malised sum of that person’s linguistic experi- agent to the person most relevant for its current in- ences: formation need. In this paper, it is operationalised as cosine similarity between vectors. The module X takes a query q and translates it into a vector ~q by p~ = e~k . (1) 1..k..n summing the words in the query, as in Eq. 3. It then goes through a 2-stage process: 1) find po- As mentioned previously, in our current setup, tentially helpful people by calculating the cosine linguistic experiences correspond to documents. distance between ~q and all person vectors p~1 ...p~n ; Document/experience vectors: we posit that 2) query the m most relevant people, who will cal- the (rough) meaning of a document can be ex- culate the distance between ~q and all documents in pressed as an additive function acting over (some their memory, Dk = {d1 ...dt }. Receive the docu- of) the words of that document. Specifically, we ments corresponding to the highest scores, ranked sum the 10 words that are most characteristic for in descending order. the document. While this may seem to miss out on much of the document’s content, it is important to 7 Describing the community remember that the background DS representations 7.1 Qualitative checks used in the summation are already rich in content: As a sanity check, it is possible to inspect where the vector for Italy, for instance, will typically sit each experience/document vector sits in the se- next to Rome, country and pasta in the semantic mantic space, by looking at its ‘nearest neigh- space. The summation roughly captures the doc- bours’ (i.e., the m words closest to it in the space). ument’s content in a way equivalent to a human We show below two documents with their nearest describing a text as being about so and so. neighbours, as output by our system: We need to individually build document vectors Artificial_intelligence: for potentially sparse individual profiles, without ai artificial intelligence intelligent necessitating access to the overall document col- computational research researchers computing cognitive computer lection of the system (because ak is not necessar- ily aware of am ’s experiences). Thus, standard Anatoly_Karpov: measures such as tf-idf are not suitable to calcu- chess ussr moscow tournament ukraine russia soviet russian champion opponent late the importance of a word for a document. We alleviate this issue by using a static list of word en- We also consider whether each user inhabits a tropies (calculated over the ukWaC 2 billion words seemingly coherent area of the semantic space. corpus, (Baroni et al., 2009)) and the following The following shows a user profile, as output by weighting measure: our system, which corresponds to a person with an interest in American history: f req(t) 5 We experimented with a range of values, not reported wt = , (2) here for space reasons. log(H(t) + 1) # agents # docs # relevant docs # agents containing doc 2939 1-100 176 1 944 100-500 169 2-4 226 500-1000 100 5-9 145 1000-2000 64 10-19 82 2000-5000 45 20-49 49 50-99 15 10000-200000 19 100-199 3 200-399 Table 1: Distribution of documents across peo- ple. For example, 2939 agents contain 1–100 doc- Table 2: Redundancy of relevant documents uments. across people. For example, 176 documents are found in one agent; 169 documents are found in name = [...] 2–4 agents, etc. topics = confederate indians american americans mexican mexico states army soldiers navy coherence = 0.452686176513 0.01. So despite a high disparity in memory sizes, p_vector:0.004526 0.021659 [...] 0.029680 the coherence is roughly stable. For reference, a The profile includes a username and the 10 near- cosine similarity of 0.32 in our semantic space cor- est neighbours to the user’s pk vector (which give a responds to a fair level of relatedness: for instance, human-readable representation of the broad exper- some words related to school at the 0.30 level are tise of the user), the corresponding coherence fig- studied, lessons, attend, district, church. ure (see next section for information about coher- Information redundancy: we investigate the ence) and the actual person vector for that agent. redundancy of the created network with respect to our documents of interest: given a document D 7.2 Quantitative description which answers one or more query in the dataset, Distribution of documents across agents: An we ask how many memory stores contain D. This investigation of the resulting community indicates information is given in Table 2. We observe that that the distribution of documents across people is 176 documents are contained in only one agent out highly skewed: 12% of all agents only contain one of 4379. Overall, around 70% of the documents document, 31% contain less than 10 documents. that answer a query in the dataset are to be found in Table 1 shows the overall distribution. less than 10 agents. So as far as our pages of inter- Topic coherence: We compute the ‘topic coher- est are concerned, the knowledge base of our com- ence’ of each person vector, that is, the extent to munity is minimally redundant, making the task which it focuses on related topics. We expect that all the more challenging. it will be easier to identify a document answer- 8 Evaluation ing a query on e.g. baking if it is held by an agent which contains a large proportion of other The WikiQA dataset gives us information about cooking-related information. Following the intu- the document dgold that was clicked on by users ition of (Newman et al., 2010), we define the co- after issuing a particular query q. This indicates herence of a set of documents d1 , · · · , dn as the that dgold was relevant for q, but does not give us mean of their pairwise similarities: information about which other documents might have also be deemed relevant by the user. In this Coherence(d1...n ) = mean{Sim(di , dj ), respect, the dataset differs from fully annotated IR ij ∈ 1 . . . n, i < j} (4) collections like the TREC data (Harman, 1993). where Sim is the cosine similarity between two In what follows, we report Mean Reciprocal Rank documents. (MRR), which takes into account that only one The mean coherence over the 4379 person vec- document per query is considered relevant in our tors is 0.40 with a variance of 0.06. The high vari- dataset: X ance is due to the number of agents containing one M RR = P (q), (5) q∈Q document only (which have coherence 1.0). When only considering the agents with at least two doc- where Q is the set of all queries, and P (q) is the uments, the mean coherence is 0.32, with variance precision of the system for query q. P (q) itself is 0.3 only uses simple linear algebra operations over MRR score raw, non-lemmatised data. 0.2 M RR figures are not necessarily very intuitive, 0.1 Lucene Our system so we inspect how many times an agent is found 0 who can answer the query (i.e. its memory store 5 10 15 20 contains the document that was marked as holding cut-off point the answer to the query in WikiQA). We find that the system finds a helpful hand 39% of the time for Figure 2: MRR for Lucene and our system (best 5 m = 5 and 52% at m = 50. These relatively mod- person vectors). est figures demonstrate the difficulty of our task and dataset. We must however also acknowledge that finding appropriate helpers amongst a com- given by: munity of 4000 individuals is highly non-trivial. ( 1 if rq < cutoff Overall, the system is very precise once a good rq P (q) = , agent has been identified (i.e., it is likely to re- 0 otherwise turn the correct document in the first few results). where rq is the rank at which the correct docu- This is shown by the fact that the M RR only in- ment is returned for query q, and the cutoff is a creases slightly between cut-off point 1 and 20, predefined number of considered results (e.g., top from 0.29 to 0.31 (compare with Lucene, which 20 documents). achieves M RR = 0.02 at rank 1). This behaviour The MRR scores for Lucene and our system are can be explained by the fact that the agent over- shown in Fig. 2. The x-axis shows different cut- whelmingly prefers ‘small’ memory sizes: 78% of off points (e.g., cut-off point 10 means that we are the agents selected in the first phase of the query- only considering the top 10 documents returned ing process contain less than 100 documents. This by the system). The graph gives results for the is an important aspect which should guide further case where the agent contacts the p = 5 people modelling. We hypothesise that people with larger potentially most relevant for the query. We also memory stores are perhaps less attractive to the tried m = {10, 20, 50} and found that end results querying agent because their profiles are less top- are fairly stable, despite the fact that the chance ically defined (i.e., as the number of documents of retrieving at least one ‘useful’ agent increases. browsed by a user increases, it is more likely that This is due to the fact that, as people are added they cover a wider range of topics). As pointed out to the first phase of querying, confusion increases in §4, we suggest that our person representations (more documents are inspected) and the system is may need more structure, perhaps in the form of more likely to return the correct page at a slightly several coherent ‘topic vectors’. It makes intuitive lower rank (e.g., as witnessed by the performance sense to assume that a) the interests of a person of Lucene’s centralised indexing mechanism). are not necessarily close to each other (e.g. some- Our hypothesis was that matching the perfor- one may be a linguist and a hobby gardener); b) mance of an IR algorithm would validate our when a person with an information need selects model as a useful representation of a community. ‘who can help’ amongst their acquaintances, they We find, in fact, that our method considerably only consider the relevant aspects of an individ- outperforms Lucene, reaching M RR = 0.31 for ual (e.g., the hobby gardener is a good match for m = 5 against M RR = 0.22. This is a very inter- a query on gardening, irrespectively of their other esting result, as it suggests that retaining the natu- persona as a linguist). ral relationship between information and knowl- Finally, we note that all figures reported here are edge holders increases the ability of the system below their true value (including those pertaining to retrieve it, and this, despite the intrinsic diffi- to Lucene). This is because we attempt to retrieve culty of searching in a distributed setting. This is the page labelled as containing the answer to the especially promising, as the implementation pre- query in the WikiQA dataset. Pages which are rel- sented here is given in its purest form, without evant but not contained in WikiQA are incorrectly heavy pre-processing or parameter setting. Aside given a score of 0. For instance, the query what from a short list of common stopwords, the agent classes are considered humanities returns Outline of the humanities as the first answer, but the cho- Acknowledgements sen document in WikiQA is Humanities. We thank Germán Kruszewski, Angeliki Lazari- dou and Ann Copestake for interesting discussions 9 Conclusion about this work. The first author is funded through ERC Starting Grant COMPOSES (283554). We have investigated the notion of ‘person vec- tor’, built from a set of linguistic experiences as- sociated with a real individual. These ‘person vec- References tors’ live in the same semantic space as concepts Andrew J Anderson, Elia Bruni, Ulisse Bordignon, and, as any semantic vector, give information Massimo Poesio, and Marco Baroni. 2013. Of about the kind of entity they describe, i.e. what words, eyes and brains: Correlating image-based kind of person someone is. We modelled a com- distributional semantic models with neural represen- munity of speakers from 1M ‘experiences’ (doc- tations of concepts. In EMNLP, pages 1960–1970. uments read or edited by Wikipedians), shared Harald Baayen, Hans Van Halteren, and Fiona across over 4000 individuals. We tested the repre- Tweedie. 1996. Outside the cave of shadows: sentations obtained for each individual by engag- Using syntactic annotation to enhance authorship ing them into an information-seeking task neces- attribution. Literary and Linguistic Computing, 11(3):121–132. sitating some understanding of the community for successful communication. We showed that our Marco Baroni, Silvia Bernardini, Adriano Ferraresi, system outperforms a standard IR algorithm, as and Eros Zanchetta. 2009. The WaCky wide implemented by the Lucene engine. We hope to web: a collection of very large linguistically pro- cessed web-crawled corpora. Language resources improve our modelling by constructing structured and evaluation, 43(3):209–226. sets of person vectors that explicitly distinguish the various areas of expertise of an individual. Marco Baroni, Raffaela Bernardi, and Roberto Zam- One limit of our approach is that we assumed parelli. 2014a. Frege in space: A program of com- positional distributional semantics. Linguistic Is- person vectors to be unique across the community, sues in Language Technology, 9. i.e. that there is some kind of ground truth about the representation of a person. This is of course Marco Baroni, Georgiana Dinu, and Germán unrealistic, and the picture that Bob has of Alice Kruszewski. 2014b. Don’t count, predict! a systematic comparison of context-counting vs. should be different from the picture that Kim has context-predicting semantic vectors. In Proceedings of her, and again different from the picture that of ACL, pages 238–247. Alice has of herself. Modelling these fine distinc- tions, and finding an evaluation strategy for such A. Bialecki, R. Muir, and G. Ingersoll. 2012. Apache Lucene 4. pages 17–24. modelling, is reserved for future work. A more in-depth analysis of our model would Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- also need to consider more sophisticated compo- Khanh Tran. 2012. Distributional semantics in tech- nicolor. In Proceedings of ACL, pages 136–145. sition methods. We chose addition in this pa- per for its ease of implementation and efficiency, Stephen Clark, Bob Coecke, and Mehrnoosh but other techniques are known to perform better Sadrzadeh. 2008. A compositional distribu- for representing sentences and documents (Le and tional model of meaning. In Proceedings of the Mikolov, 2014)). Second Quantum Interaction Symposium (QI-2008), pages 133–140. We believe that person vectors, aside from be- ing interesting theoretical objects, are also useful Stephen Clark. 2012. Vector space models of lexical constructs for a range of application, especially in meaning. In Shalom Lappin and Chris Fox, editors, Handbook of Contemporary Semantics – second edi- the social media area. As a demonstration of this, tion. Wiley-Blackwell. we have made our system available at https: //github.com/PeARSearch in the form of a Kevyn Collins-Thompson, Paul N Bennett, Ryen W distributed information retrieval engine. The code White, Sebastian de la Chica, and David Sontag. 2011. Personalizing web search results by reading for the specific experiments presented in this paper level. In Proceedings of the 20th ACM international is at https://github.com/PeARSearch/ conference on Information and knowledge manage- PeARS-evaluation. ment, pages 403–412. ACM. Katrin Erk. 2012. Vector space models of word mean- Tom M. Mitchell, Svetlana V. Shinkareva, Andrew ing and phrase meaning: A survey. Language and Carlson, Kai-Min Chang, Vicente L. Malave, Linguistics Compass, 6(10):635–653. Robert A. Mason, and Marcel Adam Just. 2008. Predicting human brain activity associated with the Katrin Erk. 2013. Towards a semantics for distribu- meanings of nouns. Science, 320(5880):1191–1195. tional representations. In Proceedings of the Tenth International Conference on Computational Seman- David Newman, Jey Han Lau, Karl Grieser, and Tim- tics (IWCS2013). othy Baldwin. 2010. Automatic evaluation of topic coherence. In NAACL, pages 100–108. Yansong Feng and Mirella Lapata. 2010. Visual in- formation in semantic representation. In NAACL- Karen Spärck-Jones. 1972. A statistical interpretation HLT2010, pages 91–99, Los Angeles, California, of term specificity and its application in retrieval. June. Journal of documentation, 28(1):11–21. Donna K. Harman. 1993. The first text retrieval con- Efstathios Stamatatos. 2009. A survey of modern au- ference (TREC-1). Information Processing & Man- thorship attribution methods. Journal of the Ameri- agement, 29(4):411–414. can Society for information Science and Technology, 60(3):538–556. Aurélie Herbelot. 2015. Mr Darcy and Mr Toad, gen- tlemen: distributional names and their kinds. In Peter D. Turney and Patrick Pantel. 2010. From Proceedings of the 11th International Conference on frequency to meaning: Vector space models of se- Computational Semantics, pages 151–161. mantics. Journal of Artificial Intelligence Research, 37:141–188. Douwe Kiela and Stephen Clark. 2014. A system- atic study of semantic vector space model parame- Eva Maria Vecchi, Marco Baroni, and Roberto Zam- ters. In Proceedings of the 2nd Workshop on Con- parelli. 2011. (Linear) maps of the impossible: cap- tinuous Vector Space Models and their Composition- turing semantic anomalies in distributional space. In ality (CVSC) at EACL, pages 21–30. Proceedings of the Workshop on Distributional Se- mantics and Compositionality, pages 1–9. Associa- Douwe Kiela and Stephen Clark. 2015. Multi-and tion for Computational Linguistics. cross-modal semantics beyond vision: Grounding in auditory perception. In EMNLP. Christian von der Weth and Manfred Hauswirth. 2013. Tom Kwiatkowski, Sharon Goldwater, Luke Zettle- Dobbs: Towards a comprehensive dataset to study moyer, and Mark Steedman. 2012. A probabilis- the browsing behavior of online users. In Web In- tic model of syntactic and semantic acquisition from telligence (WI) and Intelligent Agent Technologies child-directed utterances and their meanings. In (IAT), 2013, volume 1, pages 51–56. IEEE. EACL, pages 234–244, Avignon, France. Anna Wierzbicka. 1984. Cups and mugs: Lexicogra- Thomas K Landauer and Susan T Dumais. 1997. A so- phy and conceptual analysis. Australian Journal of lution to Plato’s problem: The latent semantic anal- Linguistics, 4(2):205–255. ysis theory of acquisition, induction, and represen- Ludwig Wittgenstein. 1953. Philosophical investiga- tation of knowledge. Psychological review, pages tions. Wiley-Blackwell (reprint 2010). 211–240. Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Quoc V Le and Tomas Mikolov. 2014. Distributed WIKIQA: A Challenge Dataset for Open-Domain representations of sentences and documents. arXiv Question Answering. In EMNLP. preprint arXiv:1405.4053. Kevin Lund and Curt Burgess. 1996. Producing Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan high-dimensional semantic spaces from lexical co- Huang. 2006. A framework for authorship identi- occurrence. Behavior Research Methods, Instru- fication of online messages: Writing-style features ments, & Computers, 28:203–208, June. and classification techniques. Journal of the Ameri- can Society for Information Science and Technology, Christopher D Manning, Prabhakar Raghavan, and 57(3):378–393. Hinrich Schütze. 2008. Introduction to information retrieval, volume 1. Cambridge University Press, Cambridge, UK. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Jeff Mitchell and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive Science, 34(8):1388–1429, November.