Keyword 2
Keyword 2
Keyword 2
3.2. Distributions Likewise given a document distribution P (d), the one step
Markov chain evolution is the term distribution
We simplify a document to a bag of words, terms or key- X
words, in the following always called terms. We consider a pP (t) = q(t|d)P (d).
d
collection of n term occurrences W. Each term occurrence
is an instance of exactly one term t in T = {t1 , . . . tm }, Since P (d) gives the probability to find a term occurrence
and can be found in exactly one source document d in a in document d, pP is the P-weighted average of the term
collection C = {d1 , . . . dM }. Let n(d,
P t) be the number of distributions in the documents. Combining these, i.e. run-
occurrences of term t in d, n(t) = d n(d, t) Pbe the num- ning the Markov chain twice, every term distribution gives
ber of occurrences of term t, and N (d) = t n(d, t) the rise to a new distribution
number of term occurrences in d.
We consider the natural probability distributions Q on
X X
p̄(t) = pPp (t) = q(t|d)Pp (d) = q(t|d)Q(d|t0 )p(t0 )
C × T , a distribution Q on C and q on T that measure the d t0 ,d
probability to randomly select an occurrence of a term, from
a source document or both In particular starting from the degenerate “known to be z”
term distribution
Q(d, t) = n(d, t)/n on C × T
(
Q(d) = N (d)/n on C 1 if t = z ,
pz (t) = p(t|z) = ,
q(t) = n(t)/n on T 0 otherwise.
55
the distribution of co-occurring terms p̄z then is where m = 1/2(p + q) is the mean distribution and D(p||q)
X X is the relative entropy or Kullback-Leibler divergence be-
p̄z (t) = q(t|d)Q(d|t0 )pz (t0 ) = q(t|d)Q(d|z). tween p and q which is defined by
d,t0 d
n
This distribution is the weighted average of the term distri- X pi
D(p||q) = pi log .
butions of documents containing z where the weight is the qi
i=1
probability Q(d|z) that a term occurrence in d is an occur-
rence of z. Since the square root of the Jensen Shannon divergence is a
Note that the probability measure p̄z is very similar to proper metric [4], we have two distances
the setup in [10, section 3]. The difference is that we keep
track of the density of a keyword in a document rather than JSD simdoc (t, s) = JSD(Qt , Qs )
just the mere occurrence or non occurrence of a keyword in
a document. This difference is particularly relevant for long and
documents where a word may occur with very low density, JSD simterm (t, s) = JSD(p̄s , p̄t )
yet because it contains many words have a significant con-
tribution to the mean word distribution. Unfortunately p̄z is 3.4. Clustering Method
expensive to compute.
3.3. Distance Measures We have used the induced bisecting k-means clustering
algorithm as described by [1], which is based on the stan-
An effective way to define “similarity” between two ele- dard bisecting k-means algorithm, see e.g. [14]. An infor-
ments is through a metric d(i, j) between the elements i, j mal description of the method is as follows. We start by
satisfying the usual axioms of nonnegativity, identity of in- selecting two elements that have the largest distance, which
discernables and triangle inequality. Two elements are more we use as the seeds for two clusters. Next all other items
similar if they are closer. For this purpose any monotone in- are assigned to the cluster closest to one of the two seeds.
creasing function of a metric will suffice and we will call After all items have been assigned to a cluster, the centers of
such a function a distance function. both clusters are computed. Here we need a representation
For clustering we use a hierarchical top-down method, of items that naturally allows to define a center which typi-
that requires that in each step the center of each cluster is cally is not an item proper but a weighted sum of items. The
computed. Thus our choice of distance function is restricted new centers serve as new seeds for finding two clusters and
to distances defined on a space allowing us to compute a the process is repeated until the two centers are converged
center and distances between keywords and this center. In up to some predefined precision. We have now found two
particular we cannot use popular similarity measures like clusters. If the diameter of a cluster is larger than a specified
the Jaccard coefficient. threshold value, the whole procedure is applied recursively
In the following we will compare results with four dif- to that cluster. The algorithm therefore finds a binary tree
ferent distance functions for keywords t and s: (a) the co- of clusters.
sine similarity of the document distribution Qt and Qs con- We also experimented with agglomerative hierarchical
sidered as vectors on the document space, (b) the cosine clustering, especially the single link algorithm. First results
similarity of the vectors of tf.idf values of keywords, (c) the suggested that this approach performed not as well as the k-
Jensen-Shannon divergence between the document distri- means algorithm, in line with similar findings by [14] and
butions Qt and Qs and (d) the Jensen-Shannon divergence [12]. Since our focus is to find an optimal distance measure
between the term distributions, p̄t and p̄s . and not to find the best clustering algorithm, we did not
The cosine similarity of two terms t and s is defined as investigate the agglomerative methods in more detail.
Since the arccos of this similarity function is a proper met- 4.1. Implementation
ric, (1 − cos)(arccos(cos sim(t, s))) = 1 − cos sim(t, s) is
a distance function. To test and compare the different strategies we have com-
The Jensen-Shannon divergence or information radius piled a small corpus of Dutch Wikipedia articles consisting
[11, 4] between two distributions p and q is defined as of 758 documents. In the analysis phase, 118099 term oc-
1 1 currences, and 26373 unique terms were found. The arti-
JSD(p||q) = D(p||m) + D(q||m) cles were taken from 8 Wikipedia categories: spaceflight,
2 2
56
painting, architecture, trees, monocots, aviation, pop mu- each cluster c ∈ C and cluster c∗ ∈ C ∗ we define a re-
sic, charadriiformes. Categories were selected for subjec- call measure rec(c, c∗ ) = |c ∩ c∗ |/|c∗ |, a precision measure
tive similarity, like spaceflight and aviation, and subjective prec(c, c∗ ) = |c ∩ c∗ |/|c∗ | and an F value
dissimilarity like pop music and monocots. Articles are
rec(c, c∗ )prec(c, c∗ )
equally distributed over the categories, but articles in some F (c, c∗ ) = 1 .
categories are significantly longer than in others. Moreover, 2 (rec(c, c∗ ) + prec(c, c∗ ))
homogeneity and specifity of articles differs significantly
Let F (c∗ ) = maxc∈C F (c, c∗ ) be the F -value of the best
between categories.
fitting found cluster and finally define the overall F-value
To determine a set of relevant keywords we have selected
the most frequent content words and filtered out a number X |c∗ |
of overly general terms. The latter was done by requiring F = F (c∗ )
|c∗ |
P
c∗ ∈C ∗ c∗ ∈C ∗
that a keyword has to be different enough from the back-
ground distribution q, as measured by the Kullback-Leibler A value of F = 1 therefore means that for each Wikipedia
divergence. We used a cutoff D(p̄t ||q) > 1 bit, that turned and rest category the topic detection algorithm found a cor-
out to give decent results. responding cluster.
Before extracting keywords we do some preprocessing Since the clustering algorithm can produce clusters of
using the GATE–framework [2]. The main analysis steps different size, the quality of the result depends on the only
are: lemmatization, multiword lookup and named entity input parameter of the algorithm, the threshold value to split
recognition. Lemmatization and tagging is done using the up a cluster. Its quality (in terms of overall F-value) is com-
Treetagger [13]. Tagging allows us to distinguish content pared between distance functions by varying the number
words from function words. After lemmatization all in- of clusters as controlled by varying the threshold values.
flected forms of verbs, nouns and adjectives are mapped Since there might be several clusters with exactly the same
onto their lexical form, substantially reducing the variation largest distance between two elements, we do not find val-
in the input data. For multiword lookup we used article ues for each number of clusters. In particular, for cos simtf
titles from the Dutch Wikipedia, since it is reasonable to as- and JSD simdoc there are no values that yield a clustering
sume that each title represents a single concept. Finally, into less than 14 and 11 clusters, resp. In the case of co-
some simple rules are applied to identify proper names. sine similarity there are in each of these clusters even two
While some of the components are language dependent, all keywords with completely orthogonal document distribu-
of the components are available for a number of languages tions. The overall F-values for clustering with the different
within the GATE–framework. similarities are given in Figure 1. Cosine similarity with
We selected the 160 most frequent keywords fulfilling tf.idf vectors performed in between direct cosine similarity
this criterion and clustered them with the induced bisecting and the Jensen-Shannon distance with the document distri-
k-means algorithm from section 3.4 using different distance bution
measures. The rest category in the reference categorization is not a
real cluster or category of keywords. Moreover, this cate-
4.2. Results gory is about three times as large as the average size of the
other categories. We have therefore tested how well the 8
To evaluate the implemented topic detection methods, positively motivated categories are detected, and computed
we have compared the results with topics known to be the overall F-values averaging only over these 8 categories.
present in the collection. We benchmarked against the 8 The results are given in Figure 2. Results for the cosine sim-
selected Wikipedia topics of the collection. Of course, it ilarity with tf.idf vector were slightly better than clustering
is conceivable that the collection has more topics that au- with JSD simdoc
tomatic methods might recognize as well. To define a ref-
erence clustering, we have clustered the 160 selected key- 5. Conclusions
words into a set of 9 categories C ∗ = {c∗0 , c∗1 , · · · , c∗9 }, one
for each Wikipedia category and a rest cluster c∗0 , using the The experimental results suggest that topic identification
following method. For each of the 8 Wikipedia categories by clustering a set of keywords works fairly well, using ei-
c∗i we compute the distribution qc∗i of words in the docu- ther of the investigated similarity measures. In the present
ments belonging to c∗i and we let qc∗0 = q. We assign a term experiment a recently proposed distribution of terms asso-
z to cluster c∗ if c∗ = argminc∗ ∈C D(qc∗ ||p̄z ). ciated with a keyword clearly gives best results, but compu-
Following [8], we now compare with the set of clusters tation of the distribution is relatively expensive. The reason
C of keywords found using the algorithm in section 3.4, for this is the fact that co-occurrence of terms is (implicitly)
different distance measures and different diameters. For taken into account. The document distribution for terms,
57
Figure 1. Overall F-Measure for clustering Figure 2. Overall F-Measure for clustering
based on 9 categories based on 8 categories
that is the base of the other measures, tend to be very sparse [4] B. Fuglede and F. Topsoe. Jensen-shannon divergence and
vectors, since for a given document most words will not oc- hilbert space embedding. In Proc. of the Internat. Symposium
cur at all in that document. In the distribution used for the on Information Theory, 2004, pages 31–, 2004.
[5] T. Hofmann. Probabilisitic latent semantic analysis. In
JSD simdoc this problem alleviated by a kind of ’intelligent
UAI99: Uncertainty in artificial intelligence, 1999.
smoothing’ or spreading values from one term to frequently [6] T. Hofmann. Unsupervised learning by probabilistic latent
co-occurring terms. semantic analysis. Machine Learning, 42(1-2):177–196, Jan-
uary 2001.
[7] T. Landauer, P. Foltz, and D. Laham. Introduction to latent
Acknowledgements semantic analysis. Discourse Processes, 25:259–284, 1998.
[8] B. Larsen and C. Aone. Fast and effective text mining us-
ing linear-time document clustering. In KDD, pages 16–22,
This work is part of the MultimediaN project1 sponsored
1999.
by the Dutch government under contract BSIK 03031. [9] H. Li and K. Yamanishi. Text classification using esc-based
stochastic decision lists. Inf. Process. Manage., 38(3):343–
361, 2002.
References [10] H. Li and K. Yamanishi. Topic analysis using a finite mixture
model. Inf. Process. Manage., 39(4):521–541, 2003.
[1] F. Archetti, P. Campanelli, E. Fersini, and E. Messina. A [11] C. D. Manning and H. Schütze. Foundations of Statistical
hierarchical document clustering environment based on the Natural Language Processing. The MIT Press, Cambridge,
induced bisecting k-means. In H. L. Larsen, G. Pasi, D. O. Massachusetts, 1999.
[12] S. Meyer zu Eissen and B. Stein. Analysis of clustering
Arroyo, T. Andreasen, and H. Christiansen, editors, FQAS,
algorithms for web-based search. In D. Karagiannis and
volume 4027 of Lecture Notes in Computer Science, pages
U. Reimer, editors, PAKM, volume 2569 of Lecture Notes
257–269. Springer, 2006.
in Computer Science, pages 168–178. Springer, 2002.
[2] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. [13] H. Schmid. Probabilistic part-of-speech tagging using deci-
GATE: A Framework and Graphical Development Environ- sion trees. In International Conference on New Methods in
ment for Robust NLP Tools and Applications. In Proc. of Language Processing, Manchester, UK, 1994. unknown.
the 40th Anniversary Meeting of the Association for Compu- [14] M. Steinbach, G. Karypis, and V. Kumar. A comparison of
tational Linguistics (ACL’02), pages 168–175, Philadelphia, document clustering techniques. In Proceedings of Workshop
July 2002. on Text Mining, 6th ACM SIGKDD International Conference
[3] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, on Data Mining (KDD’00), pages 109–110, August 20–23
and R. Harshman. Using latent semantic analysis to improve 2000.
access to textual information. In CHI ’88: Proceedings of the [15] L. Zhang, X. Wu, and Y. Yu. Emergent semantics from folk-
SIGCHI conference on Human factors in computing systems, sonomies: A quantitative study. 4090:168–186, 2006.
pages 281–285, New York, NY, USA, 1988. ACM.
1 http://www.multimedian.nl
58