Cond Mat0312586
Cond Mat0312586
Cond Mat0312586
net/publication/234502578
CITATIONS READS
60 277
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Alexandre Souto Martinez on 07 February 2018.
Abstract
A thesaurus is one, out of many, possible representations of term (or word) connec-
tivity. The terms of a thesaurus are seen as the nodes and their relationship as the
links of a directed graph. The directionality of the links retains all the thesaurus
information and allows the measurement of several quantities. This has lead to a
new term classification according to the characteristics of the nodes, for example,
nodes with no links in, no links out, etc. Using an electronic available thesaurus
we have obtained the incoming and outgoing link distributions. While the incom-
ing link distribution follows a stretched exponential function, the lower bound for
the outgoing link distribution has the same envelope of the scientific paper citation
distribution proposed by Albuquerque and Tsallis [1]. However, a better fit is ob-
tained by simpler function which is the solution of Ricatti’s differential equation.
We conjecture that this differential equation is the continuous limit of a stochastic
growth model of the thesaurus network. We also propose a new manner to arrange
a thesaurus using the “inversion method”.
Words are the building blocks to construct sentences and to transmit infor-
mation. During last decades much effort has been spent on the statistics of
words. Concern has been centered in the similarities and differences among
word distributions which may be useful for application in automatic informa-
tion retrieval and thesaurus construction.
∗
Email address: [email protected] (Alexandre Souto Martinez).
URL: http://www.fisicamedica.com.br/martinez/ (Alexandre Souto
Martinez).
Other studies have focused on a different approach. Words are tied to each
other as links of a graph where the words are the nodes of it. Exhaustive
studies over thesaurus [8,9] indicate that words are related among themselves
as a small-world and scale-free network [10]. This means that words may be
embedded in a low dimensional space but with a small fraction of long distance
connections. The existence of the low dimensional space has been suggested by
the deterministic “tourist” walks [11,12] on the graph, which is an independent
sampling procedure [13].
2
a directed graph. Classification of terms can be accomplished looking at the
links (arcs), for instance: head words (root words) are words with at least one
emerging link (kout > 0) and non-root words are words with no emerging links
(kout = 0). Apparently there is a giant strong component (percolative cluster of
directed links) which connects a large fraction of words [15,16]. We stress that
the working thesaurus is a simple and unstructured related term thesaurus
and we point out the existence of other thesauruses such as WordNet [17] and
definition terms thesaurus (Roget’s thesaurus) which may be modeled as a
bipartite graph, but they are be considered here.
If only co-linked terms (mutually referred terms) are considered, this structure
forms a digraph and reduces to the previous studied one, where its small-world
and scale free structure has been pointed out [9]. In this case, the number of
connections k of a node is called degree of a node. The node degree statistics
shows an exponential behavior for small values of k and a power law behavior
for large values of k [9].
sink composed of the 73,046 terms with kout = 0. For example: glucose, pass-
word, all-around, grape juice, send word, put to, lap dog, afterbirth;
source are the 30,260 terms with at least one outgoing link (kout > 0), usually
called main entries, entries, head-words or root words. The source can be
divided into three categories;
absolute source is related to 877 terms without incoming links kin = 0.
For example: rackets, grammatical, double quick, half moon, blinded;
normal source are 29,333 terms that receive links and send links to other
source and sink terms (kout > 0 and kin > 0). For example: ablation,
analogy, call out, factitious, laid low, make a deal;
bridge source they are the 16 terms without outgoing links to source
terms (kout (source) = 0), listed: androgyny, Christian sectarians, Congress,
detector, electric meter, enzyme, Esperanto, et cetera, Geiger counter,
ghetto dwellers, harp, in fun, lobotomy, penicillin, perversely, Senate;
N0
f (kout ) = , (1)
[1 + (q − 1)λkout ]q/(q−1)
3
1
2
3
4
Fig. 1. Coarse grained view of the thesaurus as a directed graph. The region com-
posed by subgraphs 1 to 3 is the source and subgraph 4 is referred as the sink. The
source contains: the normal source, named as subgraph 1; bridge source, named as
subgraph 2 and absolute source, here called subgraph 3.
N0
f (kout ) = κ
, (2)
1 + λkout
On the other hand, we show in Figure 3 that the frequency of words with a
given number of incoming links (kin ) is very well described by the stretched
exponential curve:
!κ
kin
f (kin ) = N0 exp − , (3)
k̄in
5 Note that introducing a new parameter one can write f (x) = N0 /[1 + λxκ ]q/(q−1)
which is the Burr XII distribution function that appears as a result of a q-logarithm
entropy maximization [18] and generalizes both functions.
4
Fig. 2. The frequency of outgoing links kout (root words) is well described by
Eq. 2 which is the rightwards curve, in contrast with curve of Eq. 1. The point
[kout = 17, f (kout ) = 3] has been excluded in both fitting procedures. These words
are: for good, for keeps, and grin.
where we have found N0 = 12000 ± 300, k̄in = 4.9 ± 0.3 and κ = 0.52 ± 0.01,
(r 2 = 0.993 and χ2 = 4.58). We shall stress that a fitting curve of the type of
Eq. 2 also describe this data if λ is taken
√ small enough. A simple approxima-
tion may be used as: f (kin ) ∝ exp(− kin ). The low values of incoming links
(kin < 10) are dominated by non-root words while high values (kin > 100) are
dominated by root words, as seen in Figure 3.
Although empirically f (kin ) and f (kout ) are apparently different, this may be
due to a finite size database effect. This is suggested by a kin ×kout plot (Fig. 4)
where kin and kout are ranked by decreasing values and plotted jointly to show
the correlation between them. From Fig. 4, it is clear that a linear correlation
occurs for k > 100. A perfect thesaurus should have a symmetric property
kin = kout .
As suggested by the above analysis, let Eq. 2 represent both the distribution
of outgoing and incoming links. If one takes the variable to be continuous,
it is not hard to notice that Eq. 2 is the solution of the Ricatti’s differential
5
Fig. 3. Frequency of incoming links kin for all words (•), root words (△) and non-root
words (2). The curve for all words (•) is well described by a stretched exponential
(line) expressed by Equation 3 (N0 = 12000±300, k̄in = 4.9±0.3 and κ = 0.52±0.01)
which is dominated by non-root words for low kin values (kin ≤ 10) and by root
words for high kin values (kin ≥ 100).
Fig. 4. The number of links kin and kout are ranked by decreasing values and plotted
jointly to show the correlation.
6
equation 6
κλxκ−1 2
y ′ (x) = − y (x) . (4)
N0
The authors are deeply grateful to Vera Lúcia Coelho Villar from the Insti-
tuto Antônio Houaiss de Lexicografia, Brazil, for the fruitful discussions The
authors thank stimulation discussion with F. Brouers , M. G. V. Nunes, B.
C. D. da Silva and C. Tsallis. This work has been partially funded by the
Brazilian agencies: FAPESP, CAPES and CNPq.
References
7
[3] M. A. Montemurro, D. H. Zanette, Entropic analysis of the role of words in
literary texts, cond-mat/0109218 (2001).
[4] D. Volchenkov, P. Blanchard, S. Sharoff, Core lexicon and contagious words,
cond-mat/0303454 (2003).
[5] T. K. Landauer, S. T. Dumais, A solution to Plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of
knowledge, Physcol. Rev. 104 (2) (1997) 211–240.
[6] W. Kintsch, The potential of latent semantic analysis for machine grading of
clinical case summaries, J. Biomed. Inf. 104 (2002) 3–7.
[7] T. L. Griffiths, M. Steyvers, A probabilistic approach to semantic
representation, http://www-psych.stanford.edu∼gruffydd/papers/semrep.pdf.
[8] M. Sigman, G. A. Gecchi, Global organization of the lexicon, cond-mat/0106509
(2001).
[9] A. E. Motter, A. P. S. de Moura, Y.-C. Lai, P. Dasgupta, Topology of the
conceptual network of language, Phys. Rev. E 65 (2002) 065102(R).
[10] D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small world’ networks,
Nature (London) 393 (1998) 440–442.
[11] G. F. Lima, A. S. Martinez, O. Kinouchi, Deterministic walks in random media,
Phys. Rev. Lett. 87 (1) (2001) 010603.
[12] H. E. Stanley, S. V. Buldyrev, Statistical physics - the salesman and the tourist,
Nature (London) 413 (6854) (2001) 373–374.
[13] O. Kinouchi, A. S. Martinez, G. F. Lima, G. M. Loureno, S. Risau-Gusman,
Deterministic walks in random networks: an application to thesaurus graphs,
Physica A 315 (3/4) (2002) 665–676.
[14] G. Ward, Moby Thesaurus II, Project Gutenberg Literary Archive Foundation,
2002, ftp://ibiblio.org/pub/docs/books/gutenberg/etext02/mthes10.zip.
[15] M. E. J. Newman, S. H. Strogatz, D. J. Watts, Random graphs with arbitrary
degree distributions and their applications, Phys. Rev. E 64 (2) (2001) 026118.
[16] S. N. Dorogovtsev, J. F. F. Mendes, A. N. Samukhin, Giant strongly connected
component of directed networks, Phys. Rev. E 64 (2) (2001) Art. No. 025101.
[17] Wordnet: A lexical database for the english language,
http://www.cogsci.princeton.edu/∼wn/.
[18] F. Brouers, private communication (November 2003).
[19] J. Laherrère, D. Sornette, Stretched exponential distributions in nature and
economy: ‘fat tails’ with characteristic scales, Eur. Phys. J. B 2 (1998) 525–
539.
[20] W. E. Boyce, R. C. DiPrima, Elementary Differential Equation and Boundary
Value Problem, Seventh Edition, John Wiley & Sons, New York, 2001.
8
1000
Kin
100
10
10 100 1000
Kout