Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya
Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya
Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya
1 Introduction
arrange
There are many algorithms for automatic clustering like
the K Means algorithm [Hartigan and Wong 1979], Expectation agt plc
Maximization [Dempster et. al.1977] and hierarchical clustering aoj
[Jain and Dubes, 1988] which can be applied to a set of vectors
Chairman John residence
to form the clusters. Traditionally the document is represented
by the frequency of the words that make up the document (the obj
Vector space model and the Self-organizing semantic map [T. mod pos
Kohonen, 1995]). Different words are then given importance
according to different criteria like Inverse Document frequency
and Information Gain. A comparative evaluation of feature company meeting
selection methods for text documents can be found in [Yang and
Pedersen 1997]. These methods consider the document as a bag
of words, and do not exploit the relations that may exist between
the words. Figure 1: An example UNL graph
Contacting author
2 Document Vector Construction Using UNL Documents in cluster 2: 8
Documents in cluster 3: 4
Graph Links The Clustering Step:
In the UNL link method, instead of using the words as The dimension of the vector created by TF method for the whole
components for the document vector we use the Universal of the twenty-six documents was 1025 and the dimensions of the
Words- which are concepts formed using English words and vectors created by the UNL methods were 1255. The vectors
attaching restrictions to them- as the components of the vector. were then input to a Self Organizing Map of 9 neurons
Since each UW is disambiguated (for example the financial organized as a 3 x 3 grid.
bank is represented as bank (icl>financial institute) and the river The output of the SOM corresponding to the TF, UNL
bank is represented as bank (mod>river) in UNL), multiple link and UNL relation method are shown in figures 2(a), 2(b)
words in the document get automatically differentiated, thereby and 2(c) respectively. The nine circles in the figures denote the
producing correct frequency count. After this, each component nine neurons of the 3 x 3 SOM. The number inside the circle
of the document vector- that represents a different universal denotes the number of documents that were assigned to the
word (i.e., a concept) is assigned the number of links incident on neuron after the self organization process. The numbers above
the node, considering the graph to be undirected. When a UW is
not present in the UNL graph of the document then 0 is written 3+2+0 0+0+0 0+5+4
in its position. The basic assumption behind this approach of
counting the links is that the more number of links to and from a 5 0 9
universal word, the more is the importance of the word in the
document. 0+0+0 2+1+0 0+0+0
3 Document Vector Construction Using UNL 0 3 0
Relation Labels
4+0+0 0+0+0 5+0+0
The UNL link method does not consider the label of the
links in the graph. In the relation label based method, instead of 4 0 5
a single dimensional vector we construct a two dimensional
matrix M of dimension n x n, where n is the total number of
UWs in the corpus encompassing all documents The element mij
(a) TF method
of the matrix denotes the value of the weight assigned to the
label of the link connecting the UWs, UWi and UWj or a value 3+0+0 0+0+0 5+0+0
of 0 if there is no link between the two UWs. To make the
feature vector we add up all the column of the matrix to form a 3 0 5
single dimension vector of size equal to the number of distinct
Universal words in the whole corpus. The relation weights are 0+0+0 2+0+0 0+0+0
found using a machine learning approach.
0 2 0
4 Evaluation
4+0+0 0+0+0 0+8+4
Vectors of documents were created using the term frequency,
the UNL link and the UNL relational label methods. Then they 4 0 12
were clustered using the Self Organizing Maps [T. Kohonen
1995]. The neurons were labeled using the majority approach,
i.e., if most of the documents assigned to a neuron belong to the (b) UNL Links method
cluster C, then the label of the neuron is designated as C. After
the self-organization process, the neurons get labeled and we
know the classes of the documents. Then comparing the actual 0+7+0 0+0+0 5+0+0
classes with the SOM found classes we can obtain the number of
documents correctly clustered. The accuracy of clustering is 7 0 5
given by,
0+1+0 2+0+4 0+0+0
Number of documents correctly clustered
Accuracy = 1 6 0
Total number of documents.
4.1 Experiments 4+0+0 0+0+0 3+0+0
Input: 4 0 3
Total number of documents: 26 (c) UNL Relation method
Total number of clusters: 3
Documents in cluster 1: 14 Figure 2: The different Self Organizing Maps
the circles (n1 + n2+ n3) represent the number of documents of Algorithm. Journal of the Royal Statistical Society,
class 1, 2 and 3 respectively assigned to that neuron. For Series B (Methodological) , 39(1):1--38, 1977.
example 3+2+0 above the first circle in figure 1(a) indicates
that 3 documents belonging to the first cluster, 2 documents Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan
belonging to second cluster and no documents from the third Cigarran Indexing with WordNet synsets can improve
cluster were mapped to that neuron. Text Retrieval, Proceedings of the COLING/ACL '98
Workshop on Usage of WordNet for NLP, Montreal.
5 Discussion of Results 1998.
We denote the neurons by the tuple (row number, column J. A. Hartigan and M. A. Wong. A k-means clustering
number) with row number increasing from bottom to top and the algorithm. Applied Statistics, 28:100--108, 1979.
column number increasing from left to right. As seen in figure
A.K. Jain and R.C. Dubes. Algorithms for Clustering
2(a), using the term frequency method the documents of clusters
1 are distributed to neurons (1,1), (1,3), (2,2) and (3,1), while Data, Prentice Hall, Englewood Cliffs NJ, U.S.A.,
those of cluster 2 are given to (3,1), (2,2) and (3,3). The 1988.
documents of cluster 3 go to (3,3) only. So we have 2+1 T. Kohonen Self-organizing Maps, Series in Information
documents of cluster 2 and all 4 documents of cluster 3 are
Sciences, vol. 30, Springer, 1995.
wrongly mapped. Hence the accuracy is 19/26 which is
0.730769. Scott Deerwester, Susan T. Dumais, George W.
When we consider the UNL link method, figure 2(b) Furnas, Thomas K. Landauer, Richard Harshman
shows that only the 4 documents of cluster 3 are wrongly Indexing by Latent Semantic Analysis. Journal of the
mapped to the neuron for cluster 2 at (1,3). All 8 documents of
cluster 2 are together. The documents of cluster 1- which is big-
American Society of Information Science 1990.
is distributed to 4 neurons, probably because of intra document Y. Yang, J. O. Pedersen. A Comparative Study on
differences in spite of being from the same cluster. The accuracy Feature Selection in Text Categorization, Proc. of the
here is seen to be 22/26 which is 0.846154.
14th International Conference on Machine Learning
Coming to the last method of UNL relation labels,
figure 2(c) shows that the distribution of cluster 1 documents are
ICML 1997.
same as before. However, cluster 2 documents stand H. Uchida, M. Zhu, Senta T. Della. UNL: A Gift for a
independently in two neurons. But the good thing is that the Millennium. The United Nations University, 1995
cluster 3 now has got an independent neuron label. The number
of wrongly clustered documents is only 2 giving, thus, an
accuracy of 24/26 which is 0.923077. All the accuracy values
are tabulated in table 1.
Method Accuracy
Term Frequency 0.730769
UNL Link 0.846154
UNL Relation 0.923077
6 Conclusion
We have proposed a new method for text clustering. This
method uses the semantic information present in the form of
relations between words in sentences. Thus the approach is
different from traditional methods of clustering which consider
the document as a bag of words. As shown in the experiments,
this approach performs better than the methods based on only
frequency.
References