Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya

Text Clustering using Semantics
Bhoopesh Choudhary Pushpak Bhattacharyya

CSE Department CSE Department
Indian Institute of Technology, Bombay Indian Institute of Technology, Bombay
India. India
[email protected] [email protected]
However, this can cause problems. For example, if we

Abstract consider the two sentences John eats the apple standing beside
the tree and The apple tree stands beside John's house. On the
In traditional document clustering methods, a other hand there may be some sentences, which have the same
document is considered a bag of words. The fact that meaning but have been constructed from different sets of words.
the words may be semantically related- a crucial For example in the sentences, John is an intelligent boy and
information for clustering- is not taken into account. John is a brilliant lad, mean more or less the same thing. There
In this paper we describe a new method for are some methods like Latent Semantic Indexing [Deerwester et.
generating feature vectors, using the semantic al. 1995], which try to solve it. The word category map method
relations between the words in a sentence. The can also be used for the same purpose. Another shortcoming of
semantic relations are captured by the Universal these methods is due to polysemy or homography where a word
Networking Language (UNL), which is a recently has different meanings or meaning shades in different contexts
proposed semantic representation for sentences. The (for example, the word bank in He went to the bank to withdraw
clustering method applied to the feature vectors is the some money and The boat was beside the bank). It has been
Kohonen Self Organizing Maps (SOM). This is a shown [Gonzalo et. al. 1998] that if we index words with their
neural network based technique, which takes the wordnet synset or sense then it improves the information
vectors as inputs and forms a document map in which retrieval performance.
similar documents are mapped to the same or nearby In this paper we describe a new method for the creation
neurons. Experiments show that if we use the UNL of document vectors. This approach uses the Universal
method for feature vector generation, clustering tends Networking Language (UNL) representation of a document. The
to perform better than when the term frequency based UNL (explained in detail in [Uchida, Zhu and Della 1995])
method is used. represents the document in the form of a semantic graph with
universal words as nodes and the semantic relation between
Keywords: Text clustering, Document vectors, Semantic them as links. For example, figure 1 shows the UNL graph for
net/graph, Universal Networking language, Self Organization the sentence John, who is the chairman of the company, has
Maps. arranged a meeting at his residence. Instead of considering the
documents as a bag of words we use the information given by
Topic of the paper: Text Clustering using meaning. the UNL graph to construct the vector.
1 Introduction
arrange
There are many algorithms for automatic clustering like
the K Means algorithm [Hartigan and Wong 1979], Expectation agt plc
Maximization [Dempster et. al.1977] and hierarchical clustering aoj
[Jain and Dubes, 1988] which can be applied to a set of vectors
Chairman John residence
to form the clusters. Traditionally the document is represented
by the frequency of the words that make up the document (the obj
Vector space model and the Self-organizing semantic map [T. mod pos
Kohonen, 1995]). Different words are then given importance
according to different criteria like Inverse Document frequency
and Information Gain. A comparative evaluation of feature company meeting
selection methods for text documents can be found in [Yang and
Pedersen 1997]. These methods consider the document as a bag
of words, and do not exploit the relations that may exist between
the words. Figure 1: An example UNL graph
Contacting author
2 Document Vector Construction Using UNL Documents in cluster 2: 8
Documents in cluster 3: 4
Graph Links The Clustering Step:
In the UNL link method, instead of using the words as The dimension of the vector created by TF method for the whole
components for the document vector we use the Universal of the twenty-six documents was 1025 and the dimensions of the
Words- which are concepts formed using English words and vectors created by the UNL methods were 1255. The vectors
attaching restrictions to them- as the components of the vector. were then input to a Self Organizing Map of 9 neurons
Since each UW is disambiguated (for example the financial organized as a 3 x 3 grid.
bank is represented as bank (icl>financial institute) and the river The output of the SOM corresponding to the TF, UNL
bank is represented as bank (mod>river) in UNL), multiple link and UNL relation method are shown in figures 2(a), 2(b)
words in the document get automatically differentiated, thereby and 2(c) respectively. The nine circles in the figures denote the
producing correct frequency count. After this, each component nine neurons of the 3 x 3 SOM. The number inside the circle
of the document vector- that represents a different universal denotes the number of documents that were assigned to the
word (i.e., a concept) is assigned the number of links incident on neuron after the self organization process. The numbers above
the node, considering the graph to be undirected. When a UW is
not present in the UNL graph of the document then 0 is written 3+2+0 0+0+0 0+5+4
in its position. The basic assumption behind this approach of
counting the links is that the more number of links to and from a 5 0 9
universal word, the more is the importance of the word in the
document. 0+0+0 2+1+0 0+0+0
3 Document Vector Construction Using UNL 0 3 0
Relation Labels
4+0+0 0+0+0 5+0+0
The UNL link method does not consider the label of the
links in the graph. In the relation label based method, instead of 4 0 5
a single dimensional vector we construct a two dimensional
matrix M of dimension n x n, where n is the total number of
UWs in the corpus encompassing all documents The element mij
(a) TF method
of the matrix denotes the value of the weight assigned to the
label of the link connecting the UWs, UWi and UWj or a value 3+0+0 0+0+0 5+0+0
of 0 if there is no link between the two UWs. To make the
feature vector we add up all the column of the matrix to form a 3 0 5
single dimension vector of size equal to the number of distinct
Universal words in the whole corpus. The relation weights are 0+0+0 2+0+0 0+0+0
found using a machine learning approach.
0 2 0
4 Evaluation
4+0+0 0+0+0 0+8+4
Vectors of documents were created using the term frequency,
the UNL link and the UNL relational label methods. Then they 4 0 12
were clustered using the Self Organizing Maps [T. Kohonen
1995]. The neurons were labeled using the majority approach,
i.e., if most of the documents assigned to a neuron belong to the (b) UNL Links method
cluster C, then the label of the neuron is designated as C. After
the self-organization process, the neurons get labeled and we
know the classes of the documents. Then comparing the actual 0+7+0 0+0+0 5+0+0
classes with the SOM found classes we can obtain the number of
documents correctly clustered. The accuracy of clustering is 7 0 5
given by,
0+1+0 2+0+4 0+0+0
Number of documents correctly clustered
Accuracy = 1 6 0
Total number of documents.
4.1 Experiments 4+0+0 0+0+0 3+0+0
Input: 4 0 3
Total number of documents: 26 (c) UNL Relation method
Total number of clusters: 3
Documents in cluster 1: 14 Figure 2: The different Self Organizing Maps
the circles (n1 + n2+ n3) represent the number of documents of Algorithm. Journal of the Royal Statistical Society,
class 1, 2 and 3 respectively assigned to that neuron. For Series B (Methodological) , 39(1):1--38, 1977.
example 3+2+0 above the first circle in figure 1(a) indicates
that 3 documents belonging to the first cluster, 2 documents Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan
belonging to second cluster and no documents from the third Cigarran Indexing with WordNet synsets can improve
cluster were mapped to that neuron. Text Retrieval, Proceedings of the COLING/ACL '98
Workshop on Usage of WordNet for NLP, Montreal.
5 Discussion of Results 1998.
We denote the neurons by the tuple (row number, column J. A. Hartigan and M. A. Wong. A k-means clustering
number) with row number increasing from bottom to top and the algorithm. Applied Statistics, 28:100--108, 1979.
column number increasing from left to right. As seen in figure
A.K. Jain and R.C. Dubes. Algorithms for Clustering
2(a), using the term frequency method the documents of clusters
1 are distributed to neurons (1,1), (1,3), (2,2) and (3,1), while Data, Prentice Hall, Englewood Cliffs NJ, U.S.A.,
those of cluster 2 are given to (3,1), (2,2) and (3,3). The 1988.
documents of cluster 3 go to (3,3) only. So we have 2+1 T. Kohonen Self-organizing Maps, Series in Information
documents of cluster 2 and all 4 documents of cluster 3 are
Sciences, vol. 30, Springer, 1995.
wrongly mapped. Hence the accuracy is 19/26 which is
0.730769. Scott Deerwester, Susan T. Dumais, George W.
When we consider the UNL link method, figure 2(b) Furnas, Thomas K. Landauer, Richard Harshman
shows that only the 4 documents of cluster 3 are wrongly Indexing by Latent Semantic Analysis. Journal of the
mapped to the neuron for cluster 2 at (1,3). All 8 documents of
cluster 2 are together. The documents of cluster 1- which is big-
American Society of Information Science 1990.
is distributed to 4 neurons, probably because of intra document Y. Yang, J. O. Pedersen. A Comparative Study on
differences in spite of being from the same cluster. The accuracy Feature Selection in Text Categorization, Proc. of the
here is seen to be 22/26 which is 0.846154.
14th International Conference on Machine Learning
Coming to the last method of UNL relation labels,
figure 2(c) shows that the distribution of cluster 1 documents are
ICML 1997.
same as before. However, cluster 2 documents stand H. Uchida, M. Zhu, Senta T. Della. UNL: A Gift for a
independently in two neurons. But the good thing is that the Millennium. The United Nations University, 1995
cluster 3 now has got an independent neuron label. The number
of wrongly clustered documents is only 2 giving, thus, an
accuracy of 24/26 which is 0.923077. All the accuracy values
are tabulated in table 1.
Method Accuracy
Term Frequency 0.730769
UNL Link 0.846154
UNL Relation 0.923077
Table 1: Accuracy of different methods
6 Conclusion
We have proposed a new method for text clustering. This
method uses the semantic information present in the form of
relations between words in sentences. Thus the approach is
different from traditional methods of clustering which consider
the document as a bag of words. As shown in the experiments,
this approach performs better than the methods based on only
frequency.
References
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum

Likelihood from Incomplete Data via the EM
This document was created with Win2PDF available at http://www.daneprairie.com.
The unregistered version of Win2PDF is for evaluation or non-commercial use only.

Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya

Uploaded by

Copyright:

Available Formats

Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Clustering Using Semantics: Bhoopesh Choudhary Pushpak Bhattacharyya

Uploaded by

Copyright:

Available Formats

Text Clustering using Semantics

Bhoopesh Choudhary Pushpak Bhattacharyya

However, this can cause problems. For example, if we

Table 1: Accuracy of different methods

A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum

You might also like