Academia.eduAcademia.edu

De gustibus disputandum (forecasting opinions by knowledge networks

2004, Physica A-statistical Mechanics and Its Applications

A model for opinion formation and anticipation based on the match between hidden personal "preferences" and product qualities is presented. We assume that products and individuals are represented by means of vectors in an L-dimensional "taste" space. The opinion of an individual on a given product is proportional to the overlap between the corresponding vectors. Assuming that both individual preferences and product qualities are hidden degrees of freedom, and that only the expressed opinion is observable, we use the correlations among individuals' opinions on products to extract information about the hidden quantities. In particular, the method can be used to anticipate the opinion of an individual on a given product, to study the overlaps of preferences of two individuals, and to extract the dimensionality (L) of the hidden taste space.

Available online at www.sciencedirect.com Physica A 332 (2004) 509 – 518 www.elsevier.com/locate/physa De gustibus disputandum (forecasting opinions by knowledge networks) Franco Bagnolia; b; c; d;∗ , Arturo Berronesa , Fabio Francia; b; d a Dipartimento di Energetica, Universita di Firenze, via Santa Marta, Firenze 3 I-50139, Italy b INFM, Sezione di Firenze, Italy c INFN, Sezione di Firenze, Italy d Centro interdipartimentale per lo Studio delle Dinamiche Complesse, Firenze, Italy Received 17 June 2003; received in revised form 17 July 2003 Abstract A model for opinion formation and anticipation based on the match between hidden personal “preferences” and product qualities is presented. We assume that products and individuals are represented by means of vectors in an L-dimensional “taste” space. The opinion of an individual on a given product is proportional to the overlap between the corresponding vectors. Assuming that both individual preferences and product qualities are hidden degrees of freedom, and that only the expressed opinion is observable, we use the correlations among individuals’ opinions on products to extract information about the hidden quantities. In particular, the method can be used to anticipate the opinion of an individual on a given product, to study the overlaps of preferences of two individuals, and to extract the dimensionality (L) of the hidden taste space. c 2003 Elsevier B.V. All rights reserved.  PACS: 01.75.+m; 87.23.Ge; 89.65.−s Keywords: Opinion formation; Social systems; Knowledge networks 1. Introduction Personal tastes are universally considered very dicult to be analyzed (there’s an old latin proverb stating “De gustibus non disputandum est”, i.e., “There’s no accounting for Corresponding author. E-mail addresses: [email protected] .it (F. Bagnoli), [email protected] .it (A. Berrones), [email protected] .it (F. Franci). ∗ c 2003 Elsevier B.V. All rights reserved. 0378-4371/$ - see front matter  doi:10.1016/j.physa.2003.09.065 510 F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 taste”), nevertheless, there are evidences of certain regularities in personal preferences, allowing people to successfully choose Christmas presents for friends and relatives. Indeed, any market modeling assumes a rational behavior of agents, that choose among the possible options on a quantitative basis. In a famous paper [1], Stigler and Becker argued that all people have xed tastes except for small variations, and that the di erent patterns in taste investments (like buying new music records) are computable from the expectations in revenues (i.e., the forecasted enjoyment of future music exposure). Others have argued that this is purely from economic point of view which ignores the “enormous role of historical and cultural forces, education, and values, as the initial shapers of our preferences” [2]. Anyhow, in order to test any economical, psychological or moral point of view about tastes, we need a quantitative model of opinion formation and, more important yet, of opinion anticipation based on past experience. The starting point of our analysis is that the opinion of an agent on a given product is formed by the match between the agent’s set of preferences or tastes and the product’s qualities. While many commercial studies are based on surveys about the customer’s preferences, we assume that both preferences and qualities are hidden degrees of freedom, and that only the expressed opinion is observable. One of the goals of our study is to develop techniques that are able to extract information about the hidden parts from the correlations among the agents’ opinions on products. Let us suppose that there exists a database of agents’ opinions on a given set of products. This database can be seen as a sparse matrix, with holes corresponding to missing opinions (say, agents that have never been exposed to a given product). In geometrical words, one represents one agent’s preferences as a vector in an hypothetical taste space, whose dimension and base vectors are unknown. A product is represented by a similar vector of qualities. An agent’s opinion on a given product is assumed to be proportional to the overlap between preferences and qualities, which can be expressed by an operation analogous to the scalar product between corresponding vectors. Therefore, products act like a basis, and opinions as agent’s coordinates on such a basis. However, di erently from usual geometrical problems, we do not know what the basis is, if it is complete, etc. As we shall recall in Section 2, Maslov and Zhang (MZ) [3] have shown that it is possible, if we know the basis of one agent’s preferences, to reconstruct the vectors of the individual tastes from the knowledge of a sparsely connected network of the overlaps (scalar products) among preferences. We want to extend this result to the more usual case in which basis information is not at our disposal, as discussed in Section 3. One of the outcomes of our analysis is the possibility of opinion anticipation, i.e., the possibility of exploiting the correlations in the database to forecast the missing opinions. Alternatively, we can obtain information about the overlaps of tastes between two individuals from the knowledge of their expressed opinions. What we think is our main result, is the possibility of extracting information about the hidden degrees of freedom, and in particular the dimensionality of hidden space, from the opinion database. In this way customer’s commercial interests can be used as tools of cognitive psychology. F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 511 As we shall discuss in Section 4, the sparseness of data and a bias in the database can be included in the model. The results of the comparisons between the theory and numerical simulations over randomly generated data are presented in Section 5. Finally, in Section 6, we summarize our work and draw some conclusions. 2. The model We consider a population of M individuals interacting with a set of N products. We assume that each product is characterized by an L-dimensional array a = (a(1) ; a(2) ; : : : ; a(L) ) of features, while each individual has the corresponding list of L personal tastes on the same features b = (b(1) ; b(2) ; : : : ; b(L) ). For numerical simulations we have (l) chosen both an(l) and bm in the set {−1; 1}. The opinion of individual m on product n, denoted by sm; n , is de ned proportional to the scalar product between bm and an : sm; n = (L) bm · an , where (L) is a suitably chosen normalization factor. In general, (L) should scale as L−1 and depend on the ranges of a and b. For our choice of hidden parameters, we use (L) = 1=L, so that sm; n lies in the interval [ − 1; 1]. In order to predict whether the person j will like or dislike a certain product an , assuming to know an , it is sucient to obtain the individual tastes of that individual, i.e., the vector bj . The similarity between tastes of two individuals i and j is de ned by the overlap ij = bi · bj between the preferences bi and bj . One can build a knowledge network among people, using the vectors bm as nodes and the overlaps ij as edges. MZ assume that a fraction p of these overlaps are known. They show that there are two important thresholds for p in order to be able to reconstruct the missing information. The rst one is a percolation threshold, reached when the fraction of edges p is greater than p1 = 1=M − 1, where M is the number of people. This means that there must be at least one path between two randomly chosen nodes, in order to be able to predict the second node starting from the rst one. Since vectors bn lie in an L-dimensional space, and a single link “kills” only one degree of freedom, a reliable prediction needs more than one path connecting two individuals. MZ show that there is a “rigidity” threshold p2 , of the order of 2L=M , such that for p ¿ p2 the mutual orientation of vectors in the network is xed, and the knowledge of the preferences of just one person is sucient to reconstruct those of all the other individuals. 3. Extracting information from hidden quantities In general one does not have access to individuals’ preferences, nor one knows the dimensionality L of this space. In order to address this problem, let us de ne the correlation Cij between the opinions of agents i and j by N (si; n − si )(sj; n − sj ) (1) ; Ci; j =  n=1 N N 2 2 (s − s  ) (s − s  ) i; n i j; n j n=1 n=1 512 F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 where si is the average opinion of individual i. The elements Ci; j can be conveniently stored in a M × M opinion correlation matrix C . We show below that one can compute an accurate opinion anticipation s̃m; n of a true value sm; n using this formula s̃m; n = M k  Cm; i si; n ; M (2) i=1 where k is a factor that in general depends on L and on the statistical properties of the hidden components. However, it will be shown that if the components of an and bm are independent random variables, k is independent of n and m, so it can be simply chosen in order to have s̃m; n de ned over the same interval as sm; n . For instance, if we de ne s̃∗m; n = M 1  Cm; i si; n ; M (3) i=1 then in order to keep estimations in the range [ − 1; 1], k = 1= S̃ ∗max , where S̃ ∗max = max |s̃∗m; n |. As we shall illustrate in the following, from this estimation of k we can get information about the dimensionality L of the space of individual preferences. We now justify the proposed formulas for the case in which the components of an , bm are independent random variables distributed according to (l) ) = Pn; l (a)Pm; l (b) : P(an(l) ; bm (4) Let us introduce the de nition h = ∞  (l) h(an(l) ; bm )Pn; l (a)Pm; l (b) ; (5) m; n;l so the operation h represents the average, computed in the thermodynamic limit, (l) (l) over P(an(l) ; bm ). For a set of hidden components ) of an arbitrary function h(an(l) ; bm distributed according to Eq. (4), the opinions are uncorrelated in the thermodynamic limit. However, the idea is that the system present uctuations mainly because L is nite, so correlations between opinions arise and can be used to predict unknown opinions. In order to keep the algebra simple, the discussion will be made for the case (l) in which the variables an(l) and bm have zero mean. At the end a generalization to biased components will be given. The components can be written in matrix form as     b1; 1 · · · bM; 1 a1; 1 · · · aN; 1      :  : :  :           ; ; B =  : : : : (6) A=          :  : :  :      b1; L · · · bM; L a1; L · · · aN; L 513 F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 putting the vectors an , bm as columns in the corresponding matrices. The opinion matrix is de ned by S = (L)B T A ; (7) where (L) is the normalization constant. The opinion correlation matrix is essentially equivalent to C= SS T ; N s2  (8) where s2  denotes the average of s2 over Pn; l (a)Pm; l (b). Because of the nite size of the system, there are di erences between the normalization factors in de nitions Eqs. (1) and (8). These di erences are small for large N and L, and we neglect them at this point because they give non-dominant contributions to errors in the nal expressions. An element of the opinion matrix S is expressed by the internal product sm; n = (L) L  (9) bm; l an; l ; l=1 so averaging over the distribution Pn; l (a)Pm; l (b) s2  = 2 (L)La2 b2  : (10) Using Eq. (10) the correlation matrix can be written as C= B T AAT B : LN a2 b2  (11) Let us now consider the expression 1 (L)B T AAT BB T A CS = : M LN a2 b2  (12) If N and M are large, the central limit theorem can be applied to the following matrix products: √   [N + O( N ) + · · · ] O(1) ···   √  O(1) [N + O( N ) + · · · ] · · ·   T 2  AA = a   (13)  ;   : :       BB = b     T 2 : √ [M + O( M ) + · · · ] : O(1) O(1) √ [M + O( M ) + · · · ] : : : : :::   :::   :   (14) 514 F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 Introducing Eqs. (13) and (14) in Eq. (12), we obtain 1 1 1 1 L−1 S √ +√ + ··· S : + √ (15) CS = + O M L L N L MN M For large values of N and M , by comparing Eq. (15) with Eq. (2), we can identify the factor k with the number of components L, and obtain an estimate for the average prediction error  √ √ 1  M+ N = ; (16) (s̃m; n − sm; n )2 ≃ L3=2 √ MN m; n MN where  = (L) a2 b2  : (17) Formula (16) implies that the predictive power of Eq. (2) grows with MN and diminishes with L. This fact is a consequence of the decay of the correlations among opinions with L, so that more amount of information is needed in order to perform a prediction as L grows. This condition can be compared with the “rigidity” threshold p2 in the MZ analysis. 4. Sparse and biased data In the real world one cannot expect to have at his disposal a fully connected opinion matrix. Indeed, one of the most important features of an anticipation system is its hole- lling capability. One can extend the previous formalism to sparse datasets by considering the parameters M , N as functions of the pair (m; n) formed by an individual and a product. Let Mn represent the available number of opinions over product n given by any agent and Nm be the number of opinions expressed by agent m about any product. Using formula (2) with the rede ned parameters Mn and Nm , it follows from Eq. (15) that an unknown opinion sm; n can be estimated with an accuracy that scales as √ √ Mn + Nm √ (18) |s̃m; n − sm; n | ∼ L3=2 Mn Nm for large values of Nm , Mn and L. The accuracy of our approach can be related with the “rigidity” threshold p2 . To illustrate this let us consider a situation in which Nm = Mn and N = M . From formula (18) it turns out that the relative error in the estimation of an opinion will be 2L |s̃m; n − sm; n | ∼√ (19) ; |sm; n | Mn so in order to have relative errors of order one or less, the inequality Mn & 4L2 must hold. This implies for the density of known opinions among all the elements of the opinion matrix that M Mn p = n=12 & 2Lp2 ; (20) M which means that our formulas work above the “rigidity” threshold p2 . F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 515 Our formalism is generalizable to systems with biased components, exploiting essentially the same arguments used to justify Eq. (2). It is found that in this case the factor k that appears in the estimation formula (2) is given by L−1 L 1 + L k= b2 b2  −1 : (21) Notice that k does not depend on the an(l) variables, no matter if these variables are biased or not. The existence of a constant value of k independent of n and m justi es the previously proposed normalization approach k = 1= S̃ ∗max . Moreover, k can be interpreted as the e ective number of components of the vector of internal preferences bm . For instance, (l) if the variance b2  − b2 is zero, then bm can take a unique value, so bm has only one e ective degree of freedom, which is re ected by the value k = 1. On the other (l) hand the variance of bm is maximum when b = 0, implying the value k = L when all the L degrees of freedom are relevant. The behavior of the distance between the anticipated and actual values of opinions in the biased case is again given as in Eqs. (16) and (18), by  (22) =  a2 [b2  − b2 ] : (l) The asymmetry of formulas (21) and (22) with respect to variables an(l) and bm is related to the fact that the opinion correlation matrix C basically re ects the overlap between the preferences of agents. To see this let us consider the following normalized overlap between bi and bj : L l=1 bi; l bj; l  = (23) i; j L 2 L 2 : b b l=1 j; l l=1 i; l For a large system size the opinion correlation matrix is written as C= ((L)B T A − s1)((L)AT B − s1) : N [s2  − s2 ] (24) By introducing the product AAT given in Eq. (13) in formula (24), it is found that Ci; j = i; j 1+O 1 √ N and the average error  1 = |Cm; m′ − M ′ m; m 1 1+ √ L m; m′ | (25) (26) should grow like  ∼ N −1=2 . Eq. (25) states that for increasing N the correlation between the expressed opinions of agents i and j tends to be equivalent to the overlap i; j . 516 F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 5. Numerical results In order to test the obtained relationships, we have performed simple simulations using random data. The quantities L, M and N are free parameters. We have used discrete components in the {−1; 1} set, randomly generated with variable average. We have computed the opinion matrix S (Eq. (7)), the correlation matrix C (Eq. (1)) and the actual overlap matrix (Eq. (23)). Then we have iterated over all the individuals’ opinions sm; n computing s̃m; n from Eq. (2), accumulating the average quadratic estimation error , Eq. (16). Figs. 1–3 show that the theoretical average errors, Eq. (16), are in good agreement with simulations. Fig. 1. Average estimation error  for L=10 as a function of number of products N for M =500 (circles) and M = 1000 (crosses). The lines represent the best linear t, with exponent −0:498 and −0:493, respectively. Fig. 2. Average estimation error  for L = 10 as a function of population size M for N = 500 (circles) and N = 1000 (crosses). The lines represent the best linear t, with exponent −0:515 and −0:530, respectively. F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 517 Fig. 3. Average estimation error  as a function of L for N = M = 400. The line represents the best linear t, with exponent 0:523. Fig. 4. Average error  (Eq. (26)) as a function of the product number N . The dashed line marks the linear tting  ∼ N −0:41 . Moreover, we show in Fig. 4 that the distance  between C and as expected from Eq. (25). goes like N −1=2 , 6. Discussion and conclusions We assumed that an opinion is formed as a scalar product between individual preferences and products’ properties (both unobservable). This assumption relies on a kind of “universality” in cognitive processes, so that the opinion formation process should be analogous to other brain activity like the olfactory system, but honestly we do not have any rigorous justi cation. 518 F. Bagnoli et al. / Physica A 332 (2004) 509 – 518 Assuming that individuals’ opinions are stored in a database, we have shown that, using the central limit theorem, it is possible to anticipate an opinion, i.e., there is the possibility of exploiting the correlations in the database to forecast the missing opinions. Alternatively, we can obtain information about the overlaps of tastes between two individuals from the knowledge of their expressed opinions. We have also shown that one can extract information about the dimensionality of the hidden taste space from the opinion database. We have also recovered the (almost trivial) expectation that the prediction error decreases when both the size of individual and product pools grow, and increases with the dimension of the hidden space. We have not considered here the problem of coevolution of tastes and product qualities (which are produced in accordance to expectations about clients’ expectations). The coevolution of products’ features and individuals’ preferences induces correlations: people are not expected to blindly choose one movie from the available ones, but they tend to watch lms based on their anticipated opinion, thus lling the dataset with correlated data. On the other hand, lms are produced based on market expectation, reducing still more the variability. The role of education emerges from this simple model: reliable opinion anticipations, that constitute an expectation of “revenues” from cultural investments, can come only from an assorted background of experiences both from personal point of view, and also from the community’s one (due to the need of individuals’s correlations). Finally, this model illustrates the value contained in personal information and the need for their protection. Experimental veri cations of the model are dicult, since personal data are jealously conserved. One possibility is to extract data from public internet pages in the spirit of Google research engine [5]. The comparisons with “real” personal preferences extracted by home pages will be presented in a future work. Nevertheless, it is possible to identify similar “scalar product-like” mechanism in chemical or biological interactions [4], for which experimental data may be more easily available. At present we are extending the model in this direction. Acknowledgements We acknowledge fruitful discussions with P. Palmerini and the DOCS group [6]. We thank R. Rechtman for his careful reading of the manuscript. A.B. acknowledges nancial support of CONACYT. References [1] G. Stigler, G. Becker, Am. Econ. Rev. 67 (1977) 76–90. [2] A. Etzioni, The Moral Dimension: Toward a New Economics, Free Press, New York, 1988, and http://www.gwu.edu/∼ ccps/etzioni/B299.html [3] S. Maslov, Y.-C. Zhang, Phys. Rev. Lett. 87 (2001) 248701. [4] S. Maslov, K. Sneppen, Science 296 (2002) 910. [5] http://www.google.com. [6] http://www.docs.uni .it.