0% found this document useful (0 votes)
32 views10 pages

Cond Mat0312586

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/234502578

Thesaurus as a complex network

Article  in  Physica A: Statistical Mechanics and its Applications · November 2004


DOI: 10.1016/j.physa.2004.06.025 · Source: arXiv

CITATIONS READS
60 277

5 authors, including:

Ivan Torres Pisa Osame Kinouchi


Universidade Federal de São Paulo University of São Paulo
143 PUBLICATIONS   759 CITATIONS    105 PUBLICATIONS   2,963 CITATIONS   

SEE PROFILE SEE PROFILE

Alexandre Souto Martinez Evandro E S Ruiz


University of São Paulo University of São Paulo
156 PUBLICATIONS   1,988 CITATIONS    65 PUBLICATIONS   300 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Statistical analysis View project

development of biological plausible neuronal networks View project

All content following this page was uploaded by Alexandre Souto Martinez on 07 February 2018.

The user has requested enhancement of the downloaded file.


Thesaurus as a complex network
arXiv:cond-mat/0312586v1 [cond-mat.stat-mech] 22 Dec 2003

Adriano de Jesus Holanda, Ivan Torres Pisa, Osame Kinouchi,


Alexandre Souto Martinez ∗ and Evandro Eduardo Seron Ruiz
Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto (FFCLRP)
Universidade de São Paulo (USP)
Av. Bandeirantes, 3900
14040-901 Ribeirão Preto, SP, Brazil.

Abstract

A thesaurus is one, out of many, possible representations of term (or word) connec-
tivity. The terms of a thesaurus are seen as the nodes and their relationship as the
links of a directed graph. The directionality of the links retains all the thesaurus
information and allows the measurement of several quantities. This has lead to a
new term classification according to the characteristics of the nodes, for example,
nodes with no links in, no links out, etc. Using an electronic available thesaurus
we have obtained the incoming and outgoing link distributions. While the incom-
ing link distribution follows a stretched exponential function, the lower bound for
the outgoing link distribution has the same envelope of the scientific paper citation
distribution proposed by Albuquerque and Tsallis [1]. However, a better fit is ob-
tained by simpler function which is the solution of Ricatti’s differential equation.
We conjecture that this differential equation is the continuous limit of a stochastic
growth model of the thesaurus network. We also propose a new manner to arrange
a thesaurus using the “inversion method”.

Key words: complex networks, directed graphs, thesaurus


PACS: 05.90.+m, 02.50.-r,

Words are the building blocks to construct sentences and to transmit infor-
mation. During last decades much effort has been spent on the statistics of
words. Concern has been centered in the similarities and differences among
word distributions which may be useful for application in automatic informa-
tion retrieval and thesaurus construction.

Email address: [email protected] (Alexandre Souto Martinez).
URL: http://www.fisicamedica.com.br/martinez/ (Alexandre Souto
Martinez).

Preprint submitted to Physica A 2 February 2008


Zipf [2] has shown that word frequency obeys a power law if words are ranked
from the most to the less frequent ones. Statistical linguistics, at its lowest
level, can be exemplified by the Zipf’s exponent, which is very sensitive to the
writer’s instruction degree but much less sensitive to language (culture char-
acteristics). Beyond word level, word connectivity has been treated in several
manners. These treatments include entropic measures [3] and the construction
of other quantities, such as the distribution of documents over the frequency of
words [4]. Another interesting way to treat data is the Latent Semantic Analy-
sis (LSA) [5] which deals with word covariance in a corpus. LSA is a principal
component analysis (PCA) technique , i.e., the covariance matrix is diagonal-
ized and from the most important eigenvalues (around 300) the eigenvectors
are considered to span an Euclidean vector space. A curious application of
LSA is the automatic grading of high school texts [6]. However, LSA has been
criticized as a poor approach for predicting semantic neighborhood [7].

Other studies have focused on a different approach. Words are tied to each
other as links of a graph where the words are the nodes of it. Exhaustive
studies over thesaurus [8,9] indicate that words are related among themselves
as a small-world and scale-free network [10]. This means that words may be
embedded in a low dimensional space but with a small fraction of long distance
connections. The existence of the low dimensional space has been suggested by
the deterministic “tourist” walks [11,12] on the graph, which is an independent
sampling procedure [13].

A thesaurus is a list of terms. A term can be a word, a composed word or


even an expression. The list of related terms to a main entry term (head-
word) provides alternatives for these entries. Following previous studies, we
will consider terms as being “words” in a broad sense.

As in a previous work [9], our study is based on unstructured thesaurus, the


Moby Thesaurus II which is the largest 1 and most comprehensive free the-
saurus data source in English available [14]. It has 30,260 (main) entries, also
called root words 2 or head-words 3 and 73,046 words which are referred from
the entries but they are not entries. They are called non-root words. These
add up to 103,306 different words. Each root word points, in average, to 83
words 4 .

The thesaurus derived network is defined considering each term as a node.


Connections are established from an entry to its related list of terms forming
1 The file has 24,271 KB.
2 A root word should not be confused with a root node which is defined as one that
has no incoming link.
3 Some curiosities are: 877 words which are not referred from other entries, 16 words

are entry words but point only to non-root words.


4 From which, in average, 54 are root words and 29 are non-root words.

2
a directed graph. Classification of terms can be accomplished looking at the
links (arcs), for instance: head words (root words) are words with at least one
emerging link (kout > 0) and non-root words are words with no emerging links
(kout = 0). Apparently there is a giant strong component (percolative cluster of
directed links) which connects a large fraction of words [15,16]. We stress that
the working thesaurus is a simple and unstructured related term thesaurus
and we point out the existence of other thesauruses such as WordNet [17] and
definition terms thesaurus (Roget’s thesaurus) which may be modeled as a
bipartite graph, but they are be considered here.

If only co-linked terms (mutually referred terms) are considered, this structure
forms a digraph and reduces to the previous studied one, where its small-world
and scale free structure has been pointed out [9]. In this case, the number of
connections k of a node is called degree of a node. The node degree statistics
shows an exponential behavior for small values of k and a power law behavior
for large values of k [9].

The directed graph concepts permit us to classify sets of terms according to


its links properties as follows:

sink composed of the 73,046 terms with kout = 0. For example: glucose, pass-
word, all-around, grape juice, send word, put to, lap dog, afterbirth;
source are the 30,260 terms with at least one outgoing link (kout > 0), usually
called main entries, entries, head-words or root words. The source can be
divided into three categories;
absolute source is related to 877 terms without incoming links kin = 0.
For example: rackets, grammatical, double quick, half moon, blinded;
normal source are 29,333 terms that receive links and send links to other
source and sink terms (kout > 0 and kin > 0). For example: ablation,
analogy, call out, factitious, laid low, make a deal;
bridge source they are the 16 terms without outgoing links to source
terms (kout (source) = 0), listed: androgyny, Christian sectarians, Congress,
detector, electric meter, enzyme, Esperanto, et cetera, Geiger counter,
ghetto dwellers, harp, in fun, lobotomy, penicillin, perversely, Senate;

These definitions are illustrated by the subgraphs 1–4 in Figure 1.

It is interesting to observe that the outgoing link distribution could be fitted


by the scientific-papers citation frequency curve [1]:

N0
f (kout ) = , (1)
[1 + (q − 1)λkout ]q/(q−1)

with: N0 = 654 ± 7, λ = (1.66 ± 0.01) × 10−2 and q = 0.95 ± 0.01 (r 2 = 0.993


and χ2 = 77), see Fig. 2. However, its limiting behavior for small kout is

3
1
2

3
4

Fig. 1. Coarse grained view of the thesaurus as a directed graph. The region com-
posed by subgraphs 1 to 3 is the source and subgraph 4 is referred as the sink. The
source contains: the normal source, named as subgraph 1; bridge source, named as
subgraph 2 and absolute source, here called subgraph 3.

exponential instead of the measured stretched exponential. So, we propose


another fitting function with power law behavior for large kout and stretched
exponential behavior for small kout 5

N0
f (kout ) = κ
, (2)
1 + λkout

with: N0 = 468 ± 3, λ = (2.0 ± 0.4) × 10−5 and κ = 2.55 ± 0.03 (r 2 = 0.990


and χ2 = 99), see Fig. 2.
κ
For λkout ≪ 1 this curve presents the same behaviour of a stretched expo-
nential [19]: f (kout ) = N0 exp(− kout /k̄out )κ , which permits us to estimate the
mean value of outgoing links equal to k̄out = λ−1/κ = 70 ± 6, which must be
κ
compared to 83 which is obtained from the thesaurus statistics. For λkout ≫ 1,
−κ
the distribution of kout is a power law: f (kout ) = (N0 /λ)kout .

On the other hand, we show in Figure 3 that the frequency of words with a
given number of incoming links (kin ) is very well described by the stretched
exponential curve:

kin
f (kin ) = N0 exp − , (3)
k̄in
5 Note that introducing a new parameter one can write f (x) = N0 /[1 + λxκ ]q/(q−1)
which is the Burr XII distribution function that appears as a result of a q-logarithm
entropy maximization [18] and generalizes both functions.

4
Fig. 2. The frequency of outgoing links kout (root words) is well described by
Eq. 2 which is the rightwards curve, in contrast with curve of Eq. 1. The point
[kout = 17, f (kout ) = 3] has been excluded in both fitting procedures. These words
are: for good, for keeps, and grin.

where we have found N0 = 12000 ± 300, k̄in = 4.9 ± 0.3 and κ = 0.52 ± 0.01,
(r 2 = 0.993 and χ2 = 4.58). We shall stress that a fitting curve of the type of
Eq. 2 also describe this data if λ is taken
√ small enough. A simple approxima-
tion may be used as: f (kin ) ∝ exp(− kin ). The low values of incoming links
(kin < 10) are dominated by non-root words while high values (kin > 100) are
dominated by root words, as seen in Figure 3.

Although empirically f (kin ) and f (kout ) are apparently different, this may be
due to a finite size database effect. This is suggested by a kin ×kout plot (Fig. 4)
where kin and kout are ranked by decreasing values and plotted jointly to show
the correlation between them. From Fig. 4, it is clear that a linear correlation
occurs for k > 100. A perfect thesaurus should have a symmetric property
kin = kout .

As suggested by the above analysis, let Eq. 2 represent both the distribution
of outgoing and incoming links. If one takes the variable to be continuous,
it is not hard to notice that Eq. 2 is the solution of the Ricatti’s differential

5
Fig. 3. Frequency of incoming links kin for all words (•), root words (△) and non-root
words (2). The curve for all words (•) is well described by a stretched exponential
(line) expressed by Equation 3 (N0 = 12000±300, k̄in = 4.9±0.3 and κ = 0.52±0.01)
which is dominated by non-root words for low kin values (kin ≤ 10) and by root
words for high kin values (kin ≥ 100).

Fig. 4. The number of links kin and kout are ranked by decreasing values and plotted
jointly to show the correlation.

6
equation 6

κλxκ−1 2
y ′ (x) = − y (x) . (4)
N0

This equation is known to represent contact processes such as the propagation


of diseases [20]. Presently we are searching for a microscopic network growth
model that has the Ricatti’s equation as a continuum limit.

A thesaurus is a attempt to synthesize terms and their relationships as natural


as possible. Nevertheless this trial is artificial and subjective. Our work of
treating the thesaurus as a directed graph has provided new insights into its
macro structure. From this graph theoretical approach the counting of kin
and kout could lead to a novel proposition of term arrangement and term
connectivities in it.

The standard thesaurus classification is made according to word frequency in


a corpus. But our approach suggested to rank the term related to a given root-
word by its kin ranking or any other connectivity index. For a finite thesaurus
where the number of root-words is smaller than the total number of terms,
we suggest that it is always possible to construct a new closed thesaurus
by inversion between the initial kout > 0 root-words and kin > 0 non-root
words. The information in both cases are the same, but the latter leads to
practical facilities, for instance, the fact that all words present in the thesaurus
become root-words (the thesaurus becomes closed). We are submitting our
closed version of Moby Thesaurus as a freeware database to the Gutenberg
project [14].

The authors are deeply grateful to Vera Lúcia Coelho Villar from the Insti-
tuto Antônio Houaiss de Lexicografia, Brazil, for the fruitful discussions The
authors thank stimulation discussion with F. Brouers , M. G. V. Nunes, B.
C. D. da Silva and C. Tsallis. This work has been partially funded by the
Brazilian agencies: FAPESP, CAPES and CNPq.

References

[1] C. Tsallis, M. P. de Albuquerque, Are citations of scientific papers a case of


nonextensivity?, Eur. Phys. J. B 13 (2000) 777–780.

[2] G. K. Zipf, Human behavior and principle of least effort, Addison-Wesley,


Cambridge-MA, 1949.
6 Ricatti’s differential equation is a particular instance of Bernoulli’s differential
equation [20].

7
[3] M. A. Montemurro, D. H. Zanette, Entropic analysis of the role of words in
literary texts, cond-mat/0109218 (2001).
[4] D. Volchenkov, P. Blanchard, S. Sharoff, Core lexicon and contagious words,
cond-mat/0303454 (2003).
[5] T. K. Landauer, S. T. Dumais, A solution to Plato’s problem: The latent
semantic analysis theory of acquisition, induction, and representation of
knowledge, Physcol. Rev. 104 (2) (1997) 211–240.
[6] W. Kintsch, The potential of latent semantic analysis for machine grading of
clinical case summaries, J. Biomed. Inf. 104 (2002) 3–7.
[7] T. L. Griffiths, M. Steyvers, A probabilistic approach to semantic
representation, http://www-psych.stanford.edu∼gruffydd/papers/semrep.pdf.
[8] M. Sigman, G. A. Gecchi, Global organization of the lexicon, cond-mat/0106509
(2001).
[9] A. E. Motter, A. P. S. de Moura, Y.-C. Lai, P. Dasgupta, Topology of the
conceptual network of language, Phys. Rev. E 65 (2002) 065102(R).
[10] D. J. Watts, S. H. Strogatz, Collective dynamics of ‘small world’ networks,
Nature (London) 393 (1998) 440–442.
[11] G. F. Lima, A. S. Martinez, O. Kinouchi, Deterministic walks in random media,
Phys. Rev. Lett. 87 (1) (2001) 010603.
[12] H. E. Stanley, S. V. Buldyrev, Statistical physics - the salesman and the tourist,
Nature (London) 413 (6854) (2001) 373–374.
[13] O. Kinouchi, A. S. Martinez, G. F. Lima, G. M. Loureno, S. Risau-Gusman,
Deterministic walks in random networks: an application to thesaurus graphs,
Physica A 315 (3/4) (2002) 665–676.
[14] G. Ward, Moby Thesaurus II, Project Gutenberg Literary Archive Foundation,
2002, ftp://ibiblio.org/pub/docs/books/gutenberg/etext02/mthes10.zip.
[15] M. E. J. Newman, S. H. Strogatz, D. J. Watts, Random graphs with arbitrary
degree distributions and their applications, Phys. Rev. E 64 (2) (2001) 026118.
[16] S. N. Dorogovtsev, J. F. F. Mendes, A. N. Samukhin, Giant strongly connected
component of directed networks, Phys. Rev. E 64 (2) (2001) Art. No. 025101.
[17] Wordnet: A lexical database for the english language,
http://www.cogsci.princeton.edu/∼wn/.
[18] F. Brouers, private communication (November 2003).
[19] J. Laherrère, D. Sornette, Stretched exponential distributions in nature and
economy: ‘fat tails’ with characteristic scales, Eur. Phys. J. B 2 (1998) 525–
539.
[20] W. E. Boyce, R. C. DiPrima, Elementary Differential Equation and Boundary
Value Problem, Seventh Edition, John Wiley & Sons, New York, 2001.

8
1000
Kin

100

10
10 100 1000
Kout

View publication stats

You might also like