Academia.eduAcademia.edu

COMPLEX STRUCTURES AND SEMANTICS IN FREE WORD ASSOCIATION

2012

We investigate the directed and weighted complex network of free word associations in which players write a word in response to another word given as input. We analyze in details two large datasets resulting from two very different experiments: on the one hand the massive multiplayer web-based Word Association Game known as Human Brain Cloud, and on the other hand the South Florida Free Association Norms experiment. In both cases the networks of associations exhibit quite robust properties like the small world property, a slight assortativity and a strong asymmetry between in-degree and outdegree distributions. A particularly interesting result concerns the existence of a typical scale for the word association process, arguably related to specific conceptual contexts for each word. After mapping the Human Brain Cloud network onto the WordNet semantics network, we point out the basic cognitive mechanisms underlying word associations when they are represented as paths in an underlying semantic network.

Author manuscript, published in "Advances in Complex Systems 15 (2012) 1250054" July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Advances in Complex Systems c World Scientific Publishing Company Complex structures and semantics in free word association hal-00701709, version 1 - 25 May 2012 Pietro Gravino Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, 00185 Roma, Italy. Dipartimento di Fisica, “Alma Mater Studiorum” Università di Bologna, Viale Berti Pichat 6/2 40127, Bologna, Italy [email protected] Vito D. P. Servedio Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, 00185 Roma, Italy. [email protected] Alain Barrat Centre de Physique Théorique (CNRS UMR 6207), Luminy, 13288 Marseille Cedex 9, France Institute for Scientific Interchange (ISI), Torino, Italy. [email protected] Vittorio Loreto Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, 00185 Roma, Italy. Institute for Scientific Interchange (ISI), Torino, Italy. [email protected] Received (received date) Revised (revised date) We investigate the directed and weighted complex network of free word associations in which players write a word in response to another word given as input. We analyze in details two large datasets resulting from two very different experiments: on the one hand the massive multiplayer web-based Word Association Game known as Human Brain Cloud, and on the other hand the South Florida Free Association Norms experiment. In both cases the networks of associations exhibit quite robust properties like the small world property, a slight assortativity and a strong asymmetry between in-degree and outdegree distributions. A particularly interesting result concerns the existence of a typical scale for the word association process, arguably related to specific conceptual contexts for each word. After mapping the Human Brain Cloud network onto the WordNet semantics network, we point out the basic cognitive mechanisms underlying word associations when they are represented as paths in an underlying semantic network. Keywords: Complex Networks, Language Dynamics, Word Association, Semantic Network, WordNet 1 July 11, 2011 2 11:26 WSPC/INSTRUCTION FILE main Gravino et al. hal-00701709, version 1 - 25 May 2012 1. Introduction In the last years, the physicists’ toolbox has become increasingly used for the study of complex systems in areas traditionally far from the pure realm of physics. Many works have concerned the interdisciplinary field of complex networks [8, 21, 3, 1], as well as the statistical physics of social dynamics [4] where the collective properties of population of human individuals are investigated. From this perspectives a lot of emphasis has been put on opinion, cultural and language dynamics, crowd behavior, hierarchy formation, human dynamics, and social spreading phenomena, just to quote the most important examples. In social phenomena, the basic constituents are not particles but humans [2]. Though in many cases one can neglect the intrinsic complexity of human beings, as for instance for collective phenomena like traffic or crowd behaviour [12, 18], the detailed behavior of each of them is already the complex outcome of many cognitive, psychological and physiological processes, still largely unknown. A very interesting example is represented by social annotations processes [13, 10] through which users annotate resources (web pages, bibliographic references, digital photographs, etc.) with free-form text keywords, known as tags. The emergent data structures are complex networks that represent an externalization of semantic structures (networks of concepts [22]) grounded in cognition and typically hard to access. In addition these networks are collectively built by the uncoordinated activity of thousands to millions of users, entangling semantics and user behaviour into the so-called technosocial systems. In [5] it has been shown that the process of social annotation can be seen as a collective exploration of a semantic space, modeled as a graph, through a series of random walks that should represent sequences of word associations in an hypothetical conceptual space. This simple approach reproduce several aspects of social annotation, among which the peculiar growth of the size of the vocabulary used by the community [11] and its complex network structure [6]. In [5] the semantic space was modeled as a graph. Though very reasonable, this hypothesis has never been tested in a quantitative way. Since the elementary cognitive processes underlying social annotation are word associations, it is quite natural to investigate the structures of our conceptual spaces by looking at the experiments where word associations have been measured in a quantitative way. This is precisely the point of view we take in this paper and we perform a thorough analysis of the two most important word associations databases obtained by collecting responses (target words) given by humans to specific input words (cue words). We consider in particular the Human Brain Cloud database and the University of South Florida Free Association Norms database [19]. Human Brain Cloud (HBC) represents the largest available word association database. It was obtained through the implementation of a massively multiplayer online game. As HBC was not conceived with a specific scientific purpose in mind and may thus suffer from uncontrolled biases, we also consider the smaller, though July 11, 2011 11:26 WSPC/INSTRUCTION FILE main hal-00701709, version 1 - 25 May 2012 Complex structures and semantics in free word association 3 scientifically more controlled, database known as the University of South Florida Free Association Norms [19]. We analyze the graphs built from both databases though we focus more specifically on HBC. We derive in particular an expression describing the growth of the HBC graph and we highlight the existence of a typical scale for the word association process. Next we present an analysis aimed at grounding the observations made in the word association graphs by a direct comparison with WordNet (WN) [15, 17], the largest lexical database of words and semantic relations. The aim here is to associate a semantic value to each node of the graph as well as labeling the word associations in terms of semantically charged cognitive paths on the WordNet graph. The ensemble of these results is very valuable since they help in shedding light on the cognitive processes underlying free word associations. 2. Free Word Associations: data and networks Word associations have been experimentally studied in details, especially by linguists and cognitive scientists [7, 19]. An interesting point of view, not yet fully explored, is to look at the ensemble of words and associations as a complex network, where nodes are words and links are associations, and to analyze it as such [23, 14]. Typical word association experiments involve a relatively limited number of subjects (100 - 1,000), in controlled conditions. A word (called cue word) is presented to the recruited subjects, who are asked to write the first related word that come to their minds (called target word). Due to high costs in terms of time and money, the number of cue-target associations gathered in these classical experiments has been relatively limited, at maximum of the order of 100,000. The experimental data we analyze were obtained from the “massively multiplayer word association game” Human Brain Cloud (HBC)a . HBC was designed as a web-based game in English language that simply proposed a cue word to the player, asking for a target word (e.g., volcano-lava, house-roof, dog-cat, etc.). The cue was taken from an internal self-consistent dictionary, constructed by gathering the answered target words at the end of game sessions. With respect to usual experiments, no control is performed on the number of players, the number of cuetarget associations given by each player, etc... On the other hand, the obtained dataset is considerably larger than that of previous experiments, and consists of approximately 600, 000 words and 7, 000, 000 associations, gathered within a period of one year (while pre-existing experiments involved teams of specialists for much longer periods of time). As each player could enter whichever word or set of words, the data contain a certain volume of inconsistent words, which have to be discarded. After a suitable filtering procedure, which we describe in Appendix A, we obtain a strongly connected directed weighted graph with almost 90, 000 words and 6, 000, 000 associations, i.e., a dataset still considerably larger than those of a Dataset courtesy of Kyle Gabler [9] July 11, 2011 4 11:26 WSPC/INSTRUCTION FILE main Gravino et al. hal-00701709, version 1 - 25 May 2012 previous experiments. Fig. 1. Graphical representation of the word association network obtained from the HBC dataset. Labeled nodes represent the first 5, 000 words entered in the system while arcs represent word associations. Label colors distinguish different parts of speech (nouns, verbs, etc). Arcs’ width codes the weight of the corresponding association, i.e., how many people ever made that association. In order to detect possible systematic biases in HBC, we also compare some properties of the HBC dataset with the dataset of the most important word association experiment, the Nelson et al. “University of South Florida Free Association Norms” database (SF) [19]. SF is the outcome of a great effort started back in 1973 and lasted almost thirty years. consists of 5, 000 words and 700, 000 associations. Each dataset yields a graph whose nodes are the words and whose edges correspond to the associations between words made by the players/subjects. Each edge is moreover weighted by the number of times that the corresponding association has been made. A snapshot of a part of the HBC graph is shown in Fig. 1. We compare in Fig. 2 the statistical properties of the two networks. It turns out that almost all (99%) the words used as cues in SF were present in HBC and 72% of the SF July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 5 associations were present in the HBC dataset. To deepen our analysis, we consider for each word w the set of associations in which w was proposed as a cue word to players, and define the out-strength sout (w) of the corresponding node as the total number of associations in which w was the cue, and the out-degree kout (w) as the number of distinct target words answered in response to w. Analogously, we define the in-strength sin (w) and the in-degree kin (w) by referring to the associations in which w was answered as target: sin (w) is the total number of times that w appears as a target, and kin (w) is the number of distinct other words which yielded w as an answer. HBC In Degree HBC Out Degree SF In Degree SF Out Degree -2 10 P(k) hal-00701709, version 1 - 25 May 2012 0 10 -4 10 -1 10 0 10 1 10 k/<k> Fig. 2. The log-binned degree distributions for the word association networks of HBC and SF, in log-log scale. Since we are matching datasets of different sizes we divided the degree by the average degree value. In red, the in-degree distributions for HBC (filled circles) and for SF (empty circles). In blue, the out-degree distributions for HBC (filled squares) and for SF (empty squares). Both in-degree distributions show a power law form (a straight line in log-log). Both out-degree distributions have a scale-rich shape. The different in shape at lower degrees is a consequence of the filtering procedure described in Appendix A. Figure 2 shows the distribution of in- and out-degrees in both HBC and SF, as a function of k/hki (where hki is the average of the distribution). The distribution of July 11, 2011 hal-00701709, version 1 - 25 May 2012 6 11:26 WSPC/INSTRUCTION FILE main Gravino et al. out-degrees is narrow, which means that the number of distinct answers given by distinct persons to a given cue is rather limited, as could be expected. On the other hand, the distribution of in-degrees is broad: the number of distinct cues from which a given target can be obtained ranges from 1 to 10hki with hkiHBC = hkiHBC = 34, out in SF SF hkiin = 36.5 and hkiout = 6.15 in and out averages HAVE to be equal, except if the nodes with k=0 are not counted...!. The degree distributions of both HBC and SF exhibit very similar shapes, except for a difference caused by the filtering procedure (described in Appendix A) for small in-degrees in HBC. It is worth noticing that even though the systems have different sizes, the out-degrees distributions show a peak at k/hki, a clear indication of an intrinsic similarity of this specific kind of networks, despite their different origin. Note that the difference of distributions between in- and out-degrees is not a surprise: if one chooses at random targets taken from a list of targets obeying a Zipf’s distribution, the number of distinct targets obtained from a given cue will turn out to have a narrow distribution. Figure 3 reports the strength distributions for HBC and SF, as a function of s/hsi, SF = hsiHBC = 68.7, hsiSF with hsiHBC out in = 144 and hsiout = 24.3 once again in and in out averages have to be equal. The out-strength distributions are completely determined by the way in which the cue words were chosen. Both in-strength distributions show a power law form (a straight line in log-log). The different shape at lower degrees is a consequence of the filtering procedure described in Appendix A. In order to quantify more precisely the similarity between SF and HBC we computed the cosine similarity between the neighborhoods of each word. Given a word wi which belongs to both SF and HBC, we are interested in how much the SF associations in our two datasets having wi as a cue are similar. We define lij (resp. HBC lij ) as the number of times that the association between wi and wj has been made in the SF (resp. HBC) dataset. The cosine similarity between the associations made with wi in the two datasets is then defined as SF · lij P SF 2 , HBC 2 · l j lij j ij P CSi = qP HBC j lij (1) where the sums run over all common words between the two datasets. The cosine similarity ranges between 0 (no common word is associated to wi in SF and HBC), and 1, if all the lij are equal. The average cosine similarity turns out to be 0.851 b . This very large value demonstrates a very strong correlation between the data obtained in the SF controlled experiment and in the web-based HBC game. This is an important information pointing to the reliability of the HBC dataset. b If we do not limit the sums over j in Eq. (1) to the words that appear in both datasets, we obtain a much lower average cosine similarity of 0.215: this lower value is simply due to the qPfact that HBC 2 HBC is much larger than SF so that many words can contribute to the denominator j lij but not to the numerator. July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 10 7 0 HBC In Strenght HBC Out Strenght SF In Strenght SF Out Strenght -2 hal-00701709, version 1 - 25 May 2012 P(s) 10 10 10 -4 -6 10 -2 10 -1 10 0 10 1 10 2 s/<s> Fig. 3. The log-binned strength distributions for the word association networks of HBC and SF, in log-log scale. Since we are matching datasets of different sizes we divided the strength with the average strength value. In red, the in-strength distributions for HBC (filled circles) and for SF (empty circles). In blue, the out-strength distributions for HBC (filled squares) and for SF (empty squares). Both in-strength distributions show a power law form (a straight line in log-log). The different shape at lower degrees is a consequence of the filtering procedure described in Appendix A. The out-strength distributions have a peaked but different shape, because of the different ways in which cue words were chosen. 3. Human Brain Cloud in depth Once acquired enough confidence on the reliability of the HBC dataset, let us now deepen our analysis only focusing on HBC, as the largest available Word Association dataset. The directed graph of HBC is composed by 88, 747 nodes and 3, 013, 125 edges in total, corresponding to 6, 097, 806 registered associations. As we have seen in Fig. 2, while the in-degree distribution follows a power-law, the out-degree distribution is peaked. We measure a power-law in-degree distribution of HBC with an exponent β ≃ 1.76 smaller than 2, which implies that the average value of the in-degree hkin i is not well defined, diverging with the size of the system: the number of cues that can yield a given target word as outcome has no typical value. On the July 11, 2011 8 11:26 WSPC/INSTRUCTION FILE main Gravino et al. 80 best fit 60 < kout > hal-00701709, version 1 - 25 May 2012 contrary, the out-degree shows a Gamma-like distribution peaked around a typical value kout ≃ 28, corresponding therefore to a well-defined characteristic number of distinct words obtained as targets for each word inserted in the game as a cue. The distribution drops very rapidly at large kout with a sharp cut-off around 100, indicating that the number of distinct words that humans spontaneously associate to a given cue is quite restricted. For a better understanding of the origin of this kout scale, we can study the average growth of kout as a function of sout , i.e., the average number of different target words obtained from a given cue word as a function of the number of times the given cue was extracted for a game (see Fig. 4). 10 40 10 1 10 2 2 0.87 sout 20 0 0 10 50 1 100 sout 150 200 Fig. 4. Average hkout i as a function of sout (red), obtained by averaging the kout values of all those cue words with the same sout value. The fitted curve obtained from Eq. (2) is shown in black (parameters: V ≃ 74 and B ≃ 0.92). Inset: same data in log-log scale together with a fit to a pure sub-linear power-law, to highlight the deviation from such a law at large sout . The inset of Fig. 4 shows that hkout i can be approximately described by a sublinear power-law with exponent of ∼ 0.87, but a deviation from this law is observed at large sout . The sub-linear power-law behavior is well known to appear in the dictionary growth of texts and is known as Heaps’ law in this framework [11]. The July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 9 Heaps’ law is symptomatic of an underlying generalized Zipf’s law [24], which in our case should appear in the marginal distribution of the target word frequencies, given a fixed cue word. Figure 5 shows the average frequency ranks for different values of sout , confirming the presence of a Zipf’s law, which becomes better defined as sout grows. 2 -1 2 -2 s=2 -1 x s = ... 2-3 s = 10 s = 12 P(r) hal-00701709, version 1 - 25 May 2012 s=3 2 -4 2 -5 s = ... s = 30 s = 35 s = ... 2-6 s = 70 s = 80 s = ... 2 -7 100 101 r 102 Fig. 5. For each cue word we calculate the frequency-rank (FR) plot of its target words. Each curve in the picture corresponds to an averaged FR plot obtained by averaging the FR curves corresponding to cue words with the same out-strength value. When sout is sufficiently large (i.e., we have sufficiently large statistics) we obtain a power-law with an exponent close to (minus) unity, followed by a flat tail. To estimate the functional shape of kout (sout ), we investigate in Appendix B the impact of finite size effects combined with the above mentioned Zipf’s law. We define B as the exponent of the frequency rank of the target to a given cue and assume that there exists a maximum number, V , of different target words that can be associated to a cue word. In this way we are implicitly assuming that kout will converge asymptotically to V (possibly with a residual logarithmic growth that we neglect here). With these hypotheses in mind we derive an expression (see Appendix B) for the behaviour of kout as a function of sout that reads: July 11, 2011 10 11:26 WSPC/INSTRUCTION FILE main Gravino et al. 1/B kout (sout ) = V Bs′ out − s′out B−1 (2) where: hal-00701709, version 1 - 25 May 2012 s′out = sout slim out slim out = (V B − V ) (B − 1) We also defined the auxiliary parameter slim out : it is the value of sout for which kout = V . We can use Eq. (2) to fit the experimental data to check whether out finite-size analysis holds in this case and deducing an estimate for B and V . The best fit is shown in Fig. 4, leading to the values B ∼ 0.92 and V ≃ 74. It is worth to note that, while Eq. (2) is significantly different from a typical Heaps’ law, one recovers the pure Heaps’ behaviour for V >> sout . In this limit, one has that for small sout and for B > 1 the behaviour of kout can be approximated by a power law with exponent 1/B. If instead B ≤ 1 the dominant term of the Eq. (2) will be the linear one (Note that the growth cannot be more than linear, since kout ≤ sout ). It is important to remark that we cannot exclude that the actual formula fitting the data of Fig. 4 contains logarithmic corrections implying that kout would keep growing asymptotically with sout though with a logarithmic or sub-logarithmic law. Here we are neglecting this possibility. The value of V obtained is somewhat surprising since it implies that the asymptotic number of different targets obtained for a given cue is indeed several orders of magnitude smaller than the total number of words in the system (∼ 90, 000). Hence, we find that on average, there is a limited number of target words for a given cue, reflecting the existence of a limited semantic context associated by humans to each word. This is in contrast with the fact that the number of words (cues) that yield a given target does not show any particular scale: the semantic context is thus not a “symmetric” concept in terms of free word associations. Other measures can help in characterizing in a more detailed way the topology of the graph. A measure of the distances between the nodes of the graph, i.e., the shortest path between nodes, indicates that the filtered strongly connected component of the HBC network satisfies the “small world” property with an average node distance of approximately 4. This means that with path of around 5 steps we can reach almost every node of the graph, so that we can easily and quickly explore it. This is not a surprise as most complex networks have been shown to share this property [21, 3, 1]. We also analyzed the mixing patterns [20] of the HBC graph, in order to measure the tendency of nodes with similar degree to preferably connect to each other. The assortativity coefficient r is defined as the Pearson’s correlation coefficient calculated cue between the out-degrees of the nodes linked by an edge. If, for each edge, kout is July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 11 target is the out-degree of the target, we have: the out-degree of the cue and kout hal-00701709, version 1 - 25 May 2012 r= target cue hkout · kout i − hkout i2 2 hkout i − hkout i2 (3) where the averages are calculated on the ensemble of edges. If we consider the number of times a given cue has been associated with a given target as the weight of that edge, we can weight the averages and measure the weighted assortativity coefficient rw = 0.1, which points to a slight assortativity. These results give us an overview of the properties of the HBC network. Using the HBC word association network as a proxy of the way in which our mind stores and organizes all words and related meanings, we observe that it has indeed the properties we should expect from such a network: every word defines a limited context; we can explore the network in a fast and efficient way, for example to recover meanings; and the network is still connected even in case we forget some word or if we do not know it. While this first analysis has concerned the network in itself, considering nodes and edges as abstract entities, in the next section we shall deal with their semantic content. 4. Introducing semantics The nodes of the HBC network are words, and as such, they have a semantic value. Let us therefore aim at measuring quantitatively what kind of semantic relations are used while associating two words. In order to analyze these semantic connections, a database of semantic relations between words is needed. We choose to use WordNet (WN) [15, 17], a large lexical database developed at Princeton University. In WordNet words are related to each other according to their mutual semantic relations, a list of which is given in Appendix C. WordNet allows us to built another directed graph in which nodes (143963) are again words but edges (1345801) represent now the semantic relations between them. In order to find to what kind of relation corresponds to a given free association between words (i.e., a link in the HBC graph), we can examine the shortest paths in the WN graph between every pair of words linked by an association in the HBC graph, as shown in Fig. 6. In this mapping procedure, we have to take into account that words of the WN database are actually lemmas, i.e., are reported in their dictionary form, while HBC words are subject to any kind of morphological derivation (third person, plurals, etc). For this reason, in performing the mapping we assign to two words of HBC the path existing between their relative lemmas. For example since the WN connection between “buy” and “purchase” is of the synonymy type, we consider also the couple “bought” and “purchased” as synonyms. It may moreover happen that two words associated in HBC correspond to multiple shortest paths with the same length in WN. In such a case, we consider all paths as equiprobable by assigning to each possibility a normalized weight. For example “oak” and “pine” are linked by an association in HBC. In the WN graph July 11, 2011 hal-00701709, version 1 - 25 May 2012 12 11:26 WSPC/INSTRUCTION FILE main Gravino et al. Fig. 6. The mapping of HBC onto WN: given a couple of words linked by an association in the HBC graph (the blue arrow) we consider for the shortest path (black arrows and nodes) between the same words in the WN graph (gray graph in the background). In this example a HBC association maps into a 3-steps WN path. to go from “oak” to “pine” we can go either through “tree” or “forest”. The two alternative semantic paths would be: oak|=hyperonymyc ⇒tree|=hyponymyd ⇒pine or oak|=holonymye ⇒forest|=meronymyf ⇒pine; we consider both by assigning to each of them a weight 0.5. We first analyze the 74903 HBC associations that map into one step path on WN. They correspond to the directed edges which are present in both the WN and HBC graphs. Figure 7 shows the normalized distribution of their semantic relations. The distribution shows that synonymyg (32% of the common links between HBC and WN), hyperonymy (26%) and hyponymy (18%) are the most used semantic relations for an association. In Fig. 7 we also show the distribution of the semantic relations in the whole WN graph for comparison. We notice that, for some relations there is a substantial difference between their occurrence in HBC and WN. For example, the causal (CAU) nexus normalized occurrence is much larger in the c Hyperonymy is the link between a word with a particular meaning and another word with a more general meaning which includes the first one; e.g. “red” is the hyperonym of “scarlet”. d Hyponymy is the inverse relation of the hyperonymy. e Holonymy is the link between a word denoting a part of a whole and the word denoting the whole itself; e.g. “hand” is holonym of ”finger”. f Meronymy is the inverse relation of the holonymy. g Synonymy is the link between two words with the same meaning; e.g., “student” and “pupil”. July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 100% 13 paths with l=1 WN 10% 1% 0.01% TC U PR OM D U IS OM D R IN OM D IS F RE AT T M A IS SM A H U R CA OM D IN T M N E A SM A L H AI T EN EM M IS R TT C A OM D T IS SIS A H BG ER V E C SE OM D IN T IS R IS SPA A H EY K ER D AR P IS M SI PO Y H PER Y H N SY hal-00701709, version 1 - 25 May 2012 0.1% Semantic Relationships Fig. 7. Red bars represent (in logarithmic scale) the normalized distribution of the semantic relations characterizing those HBC associations that map into a one step path on the WN network. Blue bars represent the semantic relations distribution in the whole WN graph. mapped HBC (1.6%) than in the WN graph (0.05%). This means that if a given cue word has a causal link in WN then players will be more oriented to follow it while making the association instead of following other kinds of semantic relations. It is then clear that we need to quantify the occurrences of semantic relations in HBC with respect to our reference system of WN by analyzing the effective possibility of players of doing an association in HBC following a certain semantic relation of WN. If we look at all associations starting from a cue word w we could measure the ratio of associations following the synonymy relation or the hyperonimy relation. In general, we could measure all the ratios for any kind of semantic relation. If we have enough statistics, we may assume these ratios are the probabilities of doing these associations starting from w. Let us consider the example case in which w has kout = 5 and sout = 10. Consider also that 4 links are synonymy links and the other one is a causal link. Finally, consider also that, of the 10 associations, 4 follow a causal link and 6 one of the synonymy links. Even though there are overall more synonymy associations it is clear that the causal link is more used than we could expect by just looking at the links of w. This means than human beings construct associations with probabilities that could strongly deviate from what would be ex- July 11, 2011 hal-00701709, version 1 - 25 May 2012 14 11:26 WSPC/INSTRUCTION FILE main Gravino et al. pected from the pure statistical structure of WN. This is a very interesting feature to further quantify. Sticking on the example of the causal link, it is evident that in order to quantify its actual relevance, we should not estimate the probability of making a causal link association, but rather the probability of making a causal link association conditioned to the existence of a certain number of available causal links for w. In this case it would be one causal link out of the 5 total links, i.e., P (cau|(1 cau; 5 syn)). In general, the fraction of associations of a given type represents the probability of choosing a certain semantic link given the underlying WN structure of the possible available links. In order to quantify the relative importance of a given semantic association we normalize its frequency of occurrence in WN. To this end, we define ρi as the percentage of relations of kind i in one step paths and pi as the percentage of relations of kind i in WN and we define a normalized effective probability as: ρi p πi = P iρj (4) pj πi represents thus the probability of performing the association i as if at each step all the semantic associations were equally possible. The results of this computation is presented in Fig. 8, where we also report the information about the size of set of each type of semantic association, to give a qualitative and quantitative idea of the reliability of these results. The analysis reported in Fig. 8 demonstrates how the word association process in HBC features important deviations with respect to WN. If the semantic relation occurrences in HBC were exactly the same as in WN, all symbols would lie on the dashed line. The ones which lie below occur less frequently and viceversa. Deviations are in both directions. For example, hyperonymy is slightly more frequent than expected while hyponymy is less frequent. This means that the target word tends to be a more general term instead of a more specific one. The already mentioned causal link is largely overrepresented in HBC. This means that, if the cue word have a causal link, while making an associations we will tend to prefer it. On the other hand there are associations poorly represented in HBC like “is member” (ISMEM) or “has member” (HASMEM). We also see how many not purely semantic relations (as “see also” or “keyword”) are chosen frequently. We also considered the HBC associations that map into a path of two or three steps in WN, as reported in Fig. 9. The first positions in the distribution correspond again to combinations of hyperonymy, hyponymy and synonymy, which were already the most frequent relations in the previously discussed case of one step paths. By normalizing as we did before for the one step path case, and retaining the most significant results (i.e., those corresponding to the largest statistics, such as the yellow and the red ones in Fig. 8) we obtain the situation shown in Fig. 9. These probabilities seem to show the emergence of a particular kind of exploration paths of the WN graph. Many of the WN relations are hierarchical. For example, July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association Effective Probability πi 1/Norm. 6 5 0.08 4 0.06 3 0.04 2 0.02 1 0 0 TCMU PRDO U M IS DO R M INDO IS F E T R MA AT IS SM A H U R A M C DO IN T EM ANSM A H T M ENME IS TR C M ATDO T S IS SI G A H RB VEE C M SEDO INIST AR IS SP A H Y KE R E D PAR ISM SI POR Y H PE Y H N SY hal-00701709, version 1 - 25 May 2012 0.1 L=1 WN Population size logarithm 0.12 15 Semantic Relationships Fig. 8. Normalized “effective” probabilities of the different types of WN semantic relations with error bars. The dashed line represents the probabilities of the associations in WN. So, if the semantic relation occurrences in the mapped HBC were exactly the same as in WN, all symbols would lie on the dashed line. To give an idea of the significance of the values, for each value we draw two triangles whose colors indicate the logarithm of the number of relations of the given type and therefore give an idea of the measure accuracy (blue and violet values correspond to smaller sample sizes than yellow and red ones). Upward triangles refer to the mapped HBC with unity length paths, while downward triangles refer to the whole WN. hyperonymy binds a specific term to a more general one (e.g., tree is the hyperonym of oak). Other examples are holonymy (the relation between a whole and a part: tree is the holonym of bark) or geographical collocation. In Fig. 9 and 10 we can see how in the sequences exploring these hierarchical structures there is a recurrent pattern, made of one step towards a more general term followed by one step towards a more specific term. We call this pattern the “brother” pattern, because, starting from a given node it takes us to another child (specific term) of the same parent (general term). In order to study the importance of this pattern we measure its occurrence together with two other patterns: grandparent (two steps towards more general terms) and grandchild (two steps towards more specific terms). For paths of two and three steps, we find: July 11, 2011 Probability 16 11:26 WSPC/INSTRUCTION FILE main Gravino et al. 100% 25.9% 8.0% 10% 1% HY PE 4.9% 4.8% 3.8% 3.6% 3.5% 3.1% 2.7% 2.6% HY HY HY SY HY SY SY SIM IN DO NNNPE PO PO PE -SI SY HY HY MC RRM N HY HY -HYP -HYP SY PE PO -IS N R O ER PO PE DO R M R- C Probability hal-00701709, version 1 - 25 May 2012 Semantic Relations 10% 1% 9.7% 8.1% 4.7% 3.6% 3.4% 3.3% 2.1% 1.3% 1.3% 1.2% HYP H H H S H H H H H ER- YPER- YPER- YPO-H YN-HY YPER- YPER- YPER- YPO-H YPERHYP HYP HYP H H S H Y P Y ERO-H O-H PER-H ER-HY YPO-S YPER- YN-HY PO-HY YPERYN HYP HYP SYN YPO YPE YPO PO PO PO O ER R Semantic Relations Fig. 9. The top ten occurrence ranking of the semantic relations sequences for those HBC associations that map onto a path of two (top chart) and three (bottom chart) steps in WN. pattern brother grandparent grandchild 2 steps 31% 8% 5% 3 steps 51% 18% 14% These results confirm the existence of preferential patterns of exploration of hierarchical structures in the process of word association, in a way that most often maintains the same level of specificity between cue and target. Note that we limit the analysis to paths of up to three steps because the distribution for longer paths is practically equal to the distribution occurring in an uncorrelated artificial system built associating randomly extracted words. 5. Conclusions In this paper, we have presented the results of the analysis performed on the Word Association Graph constructed in two different experiments: Human Brain Cloud (HBC) and the South Florida Free Association Norms (SF). Word association graphs are quite interesting because they represent a proxy of the way in which our mind stores and organizes all words and related meanings. The HBC dataset July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 10 2 4486 L=2 L=3 17 3490 ∆ 4721 10 1 35763 29259 36071 hal-00701709, version 1 - 25 May 2012 14568 10008 9507 4362 10 0 HYP H H H S HYP S IN H K ER- YPER- YPER- YPO-H YN-HY ER- IM-SIM DOMC YPER- EY-SIM HYP HYP HYP HYP YPE PER -ISD HYP ERO O O ER R OM HYP -HYPO -HYPE -HYPO HYPO C O R Semantic Relations Fig. 10. The 5 most significant normalized semantic relations sequences for those HBC associations that map onto a path of two and three steps in WN. is substantially larger than any previously available datasets and it has been constructed through a web-based game while the South Florida Free Association Norms is the outcome of a controlled linguistic experiment. The comparison between the two databases has been an important preliminary step in our analysis in order to be sure that no major biases were present in the HBC database. After a filtering procedure we ended up with a HBC database whose statistical properties exhibit strong correlations with that of SF, giving us confidence on the generality and reliability of the approach. In this way for the first time the huge database of the Human Brain Cloud experiment has been brought to the attention of the scientific community as a valuable tool to shed light on the possible mechanisms of word and meaning retrieval processes underlying human language skills. Our results have shown that the associations to a given cue word tend to result in a limited set of words, whose size is considerably smaller than that of the system, while the number of words that yield a given target can fluctuate enormously. Hence, the input of a word seems to define a sort of “semantic context” as an output, which can be subject of further analysis. It is worth to point out how the existence of these contexts should influence the realization of future word association experiment. Equation (2) may be used to understand how much statistics is needed in order to fully explore these semantic contexts, or at least their most significant July 11, 2011 hal-00701709, version 1 - 25 May 2012 18 11:26 WSPC/INSTRUCTION FILE main Gravino et al. part. On the other hand, the size of the context which leads to a given word target has unbounded fluctuations. We found that the HBC network is robust and can be efficiently explored. In the framework of the hypothesis of the existence of a cognitive networked structure revealed by the free word association games or experiments, the robustness and navigational efficiency of such a network are thus compatible to what we should expect from our mental “meanings management system”. We further extended our analysis by grounding the observations made on the HBC graph through a direct comparison with the largest lexical database of words and semantic relations, WordNet (WN) [15, 17]. Through WordNet, we classified the semantic character of word associations collected in HBC. This classification leads to a preliminary understanding of the cognitive processes underlying word association. The most used semantic relations result to be synonymy, hyperonymy and hyponymy. By comparison with the overall number of semantic relations present in WordNet, we have shown that other types of less common relations are in fact important, such as for instance the causal nexus. Moreover, when the association corresponds to a sequence of semantic steps, preferential combinations of semantic relations emerge, with an overall tendency to keep the same level of specificity between the cue word and the target. Acknowledgments The authors wish to thank Francesca Tria for very interesting discussions. This research has been partly supported by the EveryAware project funded by the Future and Emerging Technologies program of the European Commission under Grant Agreement Number 265432. References [1] Barrat, A., Barthélemy, M., and Vespignani, A., Dynamical processes on complex networks (Cambridge University Press, 2008). [2] Buchanan, M., The social atom (Bloomsbury, New York, NY, USA, 2007). [3] Caldarelli, G., Scale-Free Networks (Oxford University Press, 2007). [4] Castellano, C., Fortunato, S., and Loreto, V., Statistical physics of social dynamics, Rev. Mod. Phys. 81 (2009) 591–646. [5] Cattuto, C., Barrat, A., Baldassarri, A., Schehr, G., and Loreto, V., Collective dynamics of social annotation, pnas 106 (2009) 10511–10515. [6] Catutto, C., Schmitz, C., Baldassarri, A., Servedio, V. D. P., Loreto, V., Hotho, A., Grahl, M., and Stumme, G., Network properties of folksonomies, AI Communications Journal, Special Issue on ’Network Analysis in Natural Sciences and Engineering’ (2007). [7] Church, K. and Hanks, P., Word association norms, mutual information, and lexicography, Computational Linguistics 16 (1990) 22–29. [8] Dorogovtsev, S. N. and Mendes, J. F. F., Evolution of Networks: From Biological Nets to the Internet and WWW (Oxford University Press, 2003). [9] Gabler, K., Kyle gabler’s web page, http://kylegabler.com/. July 11, 2011 11:26 WSPC/INSTRUCTION FILE main hal-00701709, version 1 - 25 May 2012 Complex structures and semantics in free word association 19 [10] Golder, S. and Huberman, B. A., The structure of collaborative tagging systems, Journal of Information Science 32 (2006) 198–208. [11] Heaps, H. S., Information Retrieval: Computational and Theoretical Aspects (Academic Press, Inc., Orlando, FL, USA, 1978). [12] Helbing, D., Traffic and related self-driven many-particle systems, Rev. Mod. Phys. 73 (2001) 1067–1141. [13] Mathes, A., Folksonomies – Cooperative Classification and Communication Through Shared Metadata (2004), http://www.adammathes.com/academic/computermediated-communication/folksonomies.html. [14] Mehler, A., Large text networks as an object of corpus linguistic studies, in Corpus Linguistics. An International Handbook of the Science of Language and Society, eds. Lüdeling, A. and Kytö, M. (de Gruyter, Berlin/New York, 2007). [15] Miller, G. A., Wordnet - about us, http://wordnet.princeton.edu. [16] Miller, G. A., Wordnet: A lexical database for english., Commun. ACM 38 (1995) 39–41. [17] Miller, G. A. and Fellbaum, C., WordNet: An electronic lexical database (MIT Press, Cambridge, MA, 1998). [18] Nagatani, T., The physics of traffic jams, Rep. Prog. in Phys. 65 (2002) 1331–1386. [19] Nelson, D., McEvoy, C., and Schreiber, T., The university of south florida free association, rhyme, and word fragment norms, Behavior Research Methods 36 (2004) 402–407. [20] Newman, M. E. J., Assortative mixing in networks, Physical Review Letters 89 (2002) 208701. [21] Pastor-Satorras, R. and Vespignani, A., Evolution and Structure of the Internet: A Statistical Physics Approach (Cambridge University Press, New York, NY, USA, 2004). [22] Sowa, J. F., Conceptual structures: information processing in mind and machine (Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1984). [23] Steyvers, M. and Tenenbaum, J. B., The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth, Cognitive Science 29 (2005) 41–78. [24] Zipf, G. K., Human behavior and the principle of least effort (Hafner, New York, 1965). Appendix A. Filtering Human Brain Cloud data Human Brain Cloud was conceived as a game, and not designed for scientific purposes. Nevertheless, the system included a “quality control” for the word dataset based on two factors: word popularity, i.e., a word used often was assumed to be a “valid word”, and reports of users about misspelled or offensive words. These two factors contributed to a sort of quality score. A word was used as cue by the system only if its quality score was above a given threshold. We used the same strategy to filter words in the first part of our study. In the second part, the matching procedure with WordNet itself provided a filtering mechanism, as invalid words do not have a match in WordNet. July 11, 2011 20 11:26 WSPC/INSTRUCTION FILE main Gravino et al. Appendix B. A limited Heaps’ law To obtain Eq. (2), we start from the assumption that target words to a given cue have a power law frequency rank. This assumption is justified by the measured average frequency rank displayed in Fig. 5. Let us now we describe the problem in an abstract way. We consider a set of V events, each with a given probability. These probabilities, arranged in decreasing order, form a power law of the form: hal-00701709, version 1 - 25 May 2012 fth (r) = A(V, B) · r−B (B.1) where r is the rank, A(V, B) is the normalization coefficient and B is a parameter of the distribution. We perform s extractions of such events, and we want to compute the number k of distinct events obtained. After s extractions, the minimum finite value for the frequency of an event is 1/s. The empirically obtained frequency rank will therefore give a curve such as the ones of Fig. 5, i.e., a power-law decay followed by a plateau at 1/s. Only the probabilities larger than 1/s can thus be correctly estimated. The number of distinct events with probability larger than 1/s, n>1/s , is given by the rank corresponding to the probability 1/s, i.e., from Eq. (B.1): f (r) = A(V, B) · r−B ≃ 1/s ⇒ n>1/s ≃ r(1/s) = (s · A)1/B (B.2) We also have to estimate the number of distinct events which have been extracted although their probability is lower than 1/s, as the cumulative probability P<1/s = V X f (r) (B.3) r1/s can be larger than 1/s. We can reasonably assume that these events are extracted at most once, and estimate their number n<1/s as: n<1/s ≃ P<1/s · s (B.4) The total number of distinct events observed k after s extractions is then obtained by summing (B.2) and (B.4): k(s) ≃ n<1/s + n>1/s = P<1/s · s + (s · A)1/B (B.5) After straightforward computations we obtain k(s) = V · B(s/slim )1/B − s/slim B−1 (B.6) where slim = V B /A = (V B − V )/(B − 1), which is equivalent to Eq. 2. The treatment presented so far does not include the case B = 1. In this case it is easy to find: k(s) = V s slim  1 − log  s slim  . (B.7) July 11, 2011 11:26 WSPC/INSTRUCTION FILE main Complex structures and semantics in free word association 21 It is worth to note that, if V ≫ s one recovers the Heaps’ law, which is a particular case of Eq. 2. In this limit, in fact, s/slim << 1 and the dominant term in B.6 is the one with exponent 1/B for B > 1 and the linear one for B < 1. For B > 1 one recovers the right behaviour with a power-law with an exponent 1/B: k(s) ∼ V B s 1/B , B − 1 slim (B.8) while for B < 1 one recover a linear behavior: hal-00701709, version 1 - 25 May 2012 k(s) ∼ s V . 1 − B slim (B.9) Of course in the limit V → ∞ one is able to observe these behaviours in a very large range of values for s. Appendix C. Wordnet semantic relations abbreviations Here we report a list of the semantic relations recognized by Wordnet with their abbreviation. For further information we refer to the Wordnet documentation [15, 17, 16]. Semantic relation Abbreviation E.g. keyword synonym antonym hypernym hyponym entails similar is member is material is part has member has material has part cause to is participle see also refers to is attribute verb group KEY SYN ANT HYPER HYPO ENT SIM ISMEM ISMAT ISPAR HASMEM HASMAT HASPAR CAU PRTC SEE REF ATTR VERBG derivation DER “cold” is the keyword for “icy” “auto” is a synonym of “car” “up” is an antonym of “down” “tree” is an hypernym of “oak” “trout” is an hyponym of “fish” “dream” entails “sleep” “mistaken” is similar to “wrong” “juror” is member of “jury” “iron” is material of “steel” “month” is part of an “year” “fleet” has “ship” as member “air” has “oxygen” as material “hour” has “minute” as part “teach” cause to “learn” “forced” is participle of “force” “race” see also “speed” “italian” refers to “Italy” “height” is the attribute of “tall” “incinerate” belongs to the verb group of “burn” “sailor” derives from “sail” July 11, 2011 hal-00701709, version 1 - 25 May 2012 22 11:26 WSPC/INSTRUCTION FILE main Gravino et al. belongs to domain (category) belongs to domain (use) belongs to domain (region) is domain (category) is domain (use) is domain (region) INDOMC INDOMU INDOMR ISDOMC ISDOMU ISDOMR “lithograph” belongs to the “art” domain (category) “google” belongs to the “trademark” domain (use) “chili” belongs to the “Mexico” domain (region) “biology” is the domain (category) for “cell” “jargon” is the domain (use) for “trash” “Japan” is the domain (region) for “origami”