Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Concept-based vector space model for
improving text clustering
Abdolkarim Elahi1,a , Ali Shokouhi Rostami2,b
1
Department of computer, behshahr Branch, Islamic Azad
University Behshahr, IRAN
2
Department of computer, Ferdowsi university of mashhad,
mashhad, IRAN
a
[email protected],
[email protected]
ISSN: 2231-8275
Article Info
Received: 25th March 2012
Accepted: 1st May 2012
Published online: 1st September 2012
© 2012 Design for Scientific Renaissance All rights reserved
ABSTRACT
In document clustering, it must be more similarity between intra-document and less similarity
between intra-document of two clusters. The cosine function measures the similarity of two
documents. When the clusters are not well separated, partitioning them just based on the pair wise is
not good enough because some documents in different clusters may be similar to each other and the
function is not efficient. To solve this problem, a measurement of the similarity in concept of
neighbours and links is used. In Vector Space Model (VSM), every vector composed by the feature
and its weight represents a document. But TF-IDF has the fault that exceptional useful features may
be deleted, and this simple VSM model cannot present semantics well because all columns (terms)
are considered independent. Indeed, the VSM model ignores all important semantic/conceptual
relations of words. so we make up that by adding the count of the words at the important places and
embed the semantic relationship information directly in the weights of the corresponding words
which are semantically related by readjusting the weight values through similarity measures. In this
way, similarity is used to re-weight term frequency in the VSM. Two clustering algorithms, bisecting
k-means, feature weighting k-means clustering algorithm, have been used to cluster real-world text
data represented in the new knowledge-based VSM. The experimental results show that the clustering
performance based on the new model was much better than that based on the traditional term-based
VSM.
Keywords:
function
knowledge-based VSM, Document clustering, neighbour, link
1. Introduction
With generalization and widespread use of computers and internet, Methods of
information obtaining have been changed and the number of documents has dramatically
increased. Regarding high volume of information in new world, procedure for quick
obtaining of interested information becomes more difficult. If no effective method be
available for regulation and extraction of the information, the researcher devote more time for
search information instead of learning, which is not endurable in data mining and information
recovery.
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Document clustering has been organized to investigate a set of documents obtained from
the results of searching engine and dramatically promotes accuracy of information recovery
systems (Guha, Rastogi, & Shim, (2000), Li, Luo, & Chung, (2008), Deerwester, Dumais,
Furnas, Landauer, & Harshman, (1990)). more over it provides an efficient way for finding
adjacent neighbors of a document in user request, Li, Chung, & Holt, (2008). Document
similarity is a kind of measuring similarity level between two or more documents. The more
similarity between two documents, the higher level of similarity, otherwise similarity level
decrease. Efficiency of calculating similarity among the documents is the basis of document
clustering, information recovery, question answering system. Document classification and
etc, which is achieved according to TF – IDF1 algorithm. But these kinds of algorithms may
neglect some exceptional beneficial characteristics. To eliminate this problem, we increase
the number term in important part of the text. And also, the importance of using the feature
selection phase will modify. In general, there are two kinds of clustering methods:
Agglomerative hierarchical clustering (AHC) methods, (Ali & Zarowin, 1991). In AHC,
every document is first considered as a cluster and several distance function is used for
calculating similarity between a pair of clusters (Ali & Zarowin, 1991) and in the next step,
merge the closest pair, this merging step is repeated many times until the desired number of
cluster is achieved.
In comparison with the down-top method (AHC), the family of k-means algorithms, an
important example of clustering, provide a division of documents based on which, a center
can propose a cluster, then k elementary centers are selected and each document is assigned
for a cluster based on its distance (between each document k center) k center is again
calculated and the step is repeated many times to obtain an optimum set of k clusters based
on criterion function. For document clustering, un-weighted Pair Group Method with
Arithmetic Mean (UPGMA) has been reported as having the highest accuracy in the AHC
category (Ali & Zarowin, 1991). Moreover in k-means family, bisecting k-means algorithm
has been reported as the best compact algorithm for accuracy and efficiency (Dyer, Frieze,
1985), (Classic Text Database, <ftp://ftp.cs.cornell.edu/pub/smart/>).
One measurement of cluster similarity is cosine function which is widely used in
document clustering algorithm and has been also reported as a good criterion.
When cosine function, measures similarity between two clusters, only a pair wise
similarity is described to show if a document belongs to a cluster or not. But when the
clusters are not clearly distinctive, their division only based on pair wise similarity is not
sufficiently accurate, since documents in different clusters may be similar to each other. To
prevent this problem, a concept of neighbors and links is used for document clustering (Luo,
et al., 2009).
In general, two documents adequately similar to each other are considered as neighbors
and according to a given similarity threshold, each document in the document set can be
regarded for neighboring. Moreover, the link between the two clusters is considered as the
number of their common neighbors (Luo et al., 2009).
1
Temp Frequency Inverse Document Frequency
141
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Each document is considered based on multiple values of Boolean characteristics so that
each feature corresponds with a key or unique word and its value is correct when the
corresponding word is present.
Based on this, the concepts of neighbors and links can be of great informative value about
the documents in the cluster. Moreover in addition to considering the distance between
documents and centers, this evaluation is done on their neighbor in the best form.
K-means family algorithms are composed of two phases: elementary clustering and
cluster refinement. In the former, selection of initial k centers and document allocation is
done while in the latter, in cluster refinement phase, the process of partitioning optimization
is performed by repeating calculation of new cluster centers base on recently allocated
documents. At first, a new method of original center selection using TF-IDF implicit
optimized function is done so that distribution of original centers is performed in a way that a
good sectioning of common topic – documents is achieved.
This selection of original centers is based on three values; first based on a cosine function
performing pair wise similarity, second based on link function, and the third based on the
number of neighbors of documents. Combination of these three cases helps select high
quality original centers.
Then, a new similarity measurement is done in refinement phase to determine nearest of
clusters' centers. This value is a combination of cosine and link functions. but TF-IDF has the
fault that exceptional useful features may be deleted, and this simple VSM model cannot
present semantics well because all columns (terms) are considered independent. However,
many terms in text are semantically /conceptually related. For example, ‘suggestion’ and
‘advice’ can have the same meaning in a document but they will be represented as two
independent terms. This situation can significantly affect the dissimilarity calculation
between two documents, and therefore affect clustering results. Indeed, the VSM model
ignores all important semantic/conceptual relations of words, such as synonymy,
specialization / generalization and part/whole relationships in forming the representation
matrix from a set of documents.
so we make up that by adding the count of the words at the important places and embed
the semantic relationship information directly in the weights of the corresponding words
which are semantically related by readjusting the weight values through similarity
measures. Our idea is to re-adjust word weights so that the importance of clusterdependent core words can be increased and the contribution of cluster-independent general
words can
be reduced. to illustrate the effectiveness of the knowledge-based VSMs to improve the
performance of the clustering algorithms, and also show that the new knowledge-based term
similarity is more reasonable than the exiting measures.
Bisecting finally, proposed new integrated method was run on Mahak (Sheykh, et al.
2007) standard data set and significant improvement in clustering accuracy was observed. In
section 2, Weighting and vector space model, Cosine function, neighbor's concept, link
function ,term similarity with an ontology, k-means and bisecting k-means algorithm are
discussed. In section 3, Application of weighting improvement, neighbors and link function
in k-means and bisecting k-means algorithm are proposed and in section 4, the results of
142
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
comparison between the proposed algorithm and original one regarding accuracy and time are
discussed.
2. Background
2.1vector space and weighting model of text document
In this model, each document is considered as a vector in word space and presented term
frequency vector.
D tf , tf ,, tf m
tf 1 2
(1)
It is frequency of the word i in document and m total number of the words in text database.
In first some processing steps including words elimination, stop word and achieving words
root are performed on the documents. A good usage of this model is weighting each word
based on inverse frequency (idf) in documents set , usually obtained by multiplying log(n/dfi)
by each word frequency i, where n is the total number of documents in data set and dfi is
number of documents including the word i. the weighting formula is :
(2)
W tf * idfi
i, d
i, d
Where wi,d is the importance of i term in d document. So idf can reflect power of class
distinguishing and TF represent distribution of a characteristic in a document. TF-IDF can
include the words with high frequency but its distinguishing ability is low, so TF-IDF of an
effective algorithm is in calculation of a term weight (Sheykh, et al. 2007).
We assume that vector space is used for documents indication and consider cj as a set of
documents whose vector center is cj which is equal to.
1 n
c
di
(3)
j cj dicj
di a [cj] are document vector and the number of vector in cj set respectively. Cj length may
be unequal to each di.
2.2 Measurement of cosine similarity
There are different numbers of similarity measurement for document clustering, most of
which use cosine function. Similarity between two di and dj document is as below.
Cos(di, dj)
di, dj
di dj
(4)
2.3 Neighbours and links:
The neighbors of the document d in data set , are documents similar to it (Luo, Li & Soon,
2009). A similarity measurement, is pair-wise similarity between two di and dj document,
having values between 0and1. for threshold , di and dj are neighbors if
143
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
sim(di, dj) , with 0 1
(5)
Threshold (introduced by user) for controlling the similarity between two documents in
such a way that they could be considered as neighbors. each user can use a special value of
which can be shown by an adjacency matrix each entry m[i , j] in matrix is 1 if di and dj are
neighbors otherwise zero.
The number of di neighbors in data set is shown by N(di) which is the number of entries
whose value in i the row of matrix m is the value of one. Value of link function is shown by
link (di,dj) and is defined as the number of common neighbors of di,dj and it is calculated by
ith row of m matrix multiplied by its jth column (Luo, Li & Soon, 2009).
link (di, dj) n m i, m m m, j
m1
(6)
So , if link (di , dj) is large , di and dj will be more probably of sufficient similarity and can
be placed in the same cluster. Link function is also a good factor for evaluation of vicinity of
two documents.
2.4 Term similarity with an ontology
An ontology is used as a description of the concepts and relationships that exist in an
agent or a community of agents, and is defined for the purpose to enable knowledge sharing
and reuse (Gruber, 1993). An ontology in this paper is defined as a directed acyclic tree T =
(C, E) (see Fig. 1), where C is the set of nodes and E is the set of edges such that every edge
ei € E
is an ordered pair (ci , cj), and ci , c j € C. The following are the descriptions of nodes and
edges addressed in an ontology.
– Nodes: A node represents a concept consisting of a set of synonyms. For a given concept
node, its ancestor nodes of lower levels are more general than those of higher levels (an
upside-down tree). For instance in Figure 1, node B is more general than node D.
Fig.1. sample of ontology
144
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
– Edges: Edges are directional, indicating the directions of the relationships between nodes
(i.e., an directed edge is pointing from a general node to a specific node). For instance, node
D is a specific concept derived from the general concept of node B.
Based on the above ontology hierarchy, we can calculate the similarity between each pair of
concepts, denoted as δ(c1, c2). Therefore, the similarity between two terms t1 and t2 can be
found according to the semantic relationship between their corresponding concepts c1 and c2
in the ontology as
(t1 , t2) =
(7)
where s(t1) is the set of concepts in the taxonomy that are senses of term t1 or the set of
concepts which cover the meaning of term t1. For example, if term t1 has multiple meanings,
there may be more than one concept in s(t1). Formula (3) demonstrates that the similarity
between two terms can be obtained by calculating the similarity between the most-related
pair of corresponding concepts (Budanitsky & Hirst (2006). In this way, the words with
multiple meanings can be handled.
A simple method to calculate the similarity between two concepts in an ontology is to
count the number and directions of edges between two concept nodes, as used in (Hirst & StOnge, 1998). However, without considering the levels of nodes in the ontology hierarchy, the
similarity of this method could be misleading. For example, in Fig. 1, the similarity between
nodes J and K will be same as the similarity between nodes D and E, according to Formula
(1). However, we note that nodes J and K should be more similar than nodes D and E,
because concepts J and K are more specific in comparison with concepts D and E that are
more general in the linguistic preview.
However ,we increased the term of document by Position weight and Path weight which
defined as : Position weight: the vertical position of the concepts within the hierarchy,
measured by the level of the nearest common parent node of the two concepts, i.e., the level
of the nearest common predecessor in the ontology hierarchy. Path weight: the essential
distance between the two concepts, measured by the length of the path between any of
concept pairs and their nearest common parent node, and the length of the maximum path
containing each concept in the hierarchy.
2.5 k-means and bisecting k-means algorithm for document clustering
K-means is a general algorithm by which a data set is divided into k clusters. Is data set
includes N documents as d1, d2, … dn, so grouping into k clusters is divided an optimal state
k n
sim(di, cj)
j1i 1
(8)
criterion function should be either minimum or maximum based on determining
sim(di,cj), cj indicates cj cluster center for j:1 , 2 , … k and Sim(di,cj) evaluate similarity
between di document and cj center. When space vector model and cosine is used for
Sim(di,cj), every document is allocated to a cluster whose central vector is more similar than
145
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
that of other clusters , in other word, general criterion function is maximum. This optimum
process is known as a NP- complete problem and k-means algorithm offers a partial solution
for preparation.
k-means steps are:
1. Selecting k elementary clusters center, each indicating a cluster
2. for each document in data set, similarity with cluster centers is calculated and the
document is devoted to the nearest (the most similar) center
3. Recalculation of k center is performed based on document allocation to cluster.
4. Steps 2 and 3 are repeated until convergence is achieved.
The bisecting k-means algorithm is different from k-means and its main topic is to divide
a cluster into two sub-clusters in each step. This algorithm is initiated with all data sets as an
individual cluster and follows these steps.
1. Selecting a cj cluster for breaking based on a heuristic function
2. Finding two sub cluster of cj using k-means algorithm
(a) Selecting 2 elementary cluster centers
(b) For each document of cj, its similarity with two cluster centers is calculated and the
document is devoted to the nearest center.
(c) Recalculation of two centers based on documents allocated to them.
(d) Repeat steps 2b and 2c until achieving convergence
3. Repetition of step 2 for i times, breaking and producing the best cluster of general criterion
function
4. Repeat steps 1, 2 and 3 to achieve k clusters.
i denote repeating number of each bisecting step, determined through process improvement.
3. Applying weighting improvement with neighbors and links in k-means family
algorithm
3.1 Improving calculation of documents similarity
The first stage in algorithm of calculation of documents similarity is selecting important
terms in training set and then calculating weight of each term. In other words, when
characteristics are selected in characteristic selection stage, they are in the same weight. But
in practical application, training and practical set are always related.
TF-IDF is based on statistical information. The more documents in a set, the higher
influence they have for using such a character, we suggest to use such terms in calculating
term weight of practical set, in order that one can consider term weight for selection stage.
Improved TF-IDF is.
(9)
W Wt * tf idft
t ,d
t ,d
146
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Wt is the weight of term t in training set that obtain by adding the count of the words at the
important places. We used improved TF-IDF formula in equation 11. To obtain calculation of
new similarity: we go on the next sections.
3.2 selecting elementary centers of clusters based on ranks
k-means algorithm family initiates with elementary centers of the clusters and the
documents are frequently devoted based on the general criterion function being minimum.
This is a well known clustering algorithm based on frequently processes. Having high
efficiency but often convergent to local maximum or minimum of local criterion function.
When considering different sets of elementary centers, then we obtain different results
from final cluster. This can be circumvented by initiating from a good set of elementary
centers. Also a major issue of the k -means algorithm is that it needs to pre-determine the
cluster number. An inappropriate choice of k may yield poor results. Stability is a popular
tool for model selection in clustering, in particular to select the number of k of clusters. The
general idea is that the best parameter k for a given data set is the one which leads to the most
stable clustering results. Automatically determining the number of clusters has been one of
the most difficult problems in data clustering. Most methods for automatically determining
the number of clusters cast it into the problem of model selection. Usually, clustering
algorithms are run with different values of k the best value of k is then chosen based on a
predefined criterion. There are three algorithms for selecting elementary centers.
1. random
2. Buckshot (Luo, Li & Soon, 2009)
3. fractionation
In random algorithm , k documents of data set are randomly assumed as elementary
centers. In Buckshut algorithm, nk documents are randomly selected from n data sets and
clustered by a clustering algorithm and resulted k centers are chosen as elementary centers. In
fractionation algorithm, documents are divided into identical document and clustering is done
in each document. Then the cluster behaves in such a way as if they were individual
documents, all the procedures are repeated until k clusters are achieved and the centers of k
clusters are selected as elementary centers.
In this section, we discuss a new method for selecting elementary centers based on
neighbor and link concepts along with cosine function. Documents in a cluster are considered
more similar. So a candidate for elementary center not only is sufficiently adjacent to other
documents in a cluster but also is well separated from other centers.
By considering (appropriate similarity Threshold) of neighbors number of a document in
a data set, it can be possible to investigate how many neighbors are enough nearest to that
document. As two cosine and link functions can evaluate similarity between two documents,
Their combination is used for evaluating dissimilarity between two documents previously
assigned as elementary centers' candidate. At first, by adjacency matrix, the documents are
listed descendible according to the number of neighbors, then based on finding a set of
candidate’s for elementary centers. Each of which is enough close to center cluster of
documents, the upper m documents of the list are selected so that m candidate for elementary
center is defined by sm where m=k+nplus and kij arbitrary number of clusters and n-plus is
147
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
additional number of selected candidates. Since these m candidate’s have the highest number
of neighbors in dataset, it is assumed that they have more similarity to clusters centers.
For exam, consider the data set S composed of 6 documents d1, d2, …, d6 whose
neighborhood matrix is presented in fig 1.
when =0 and k=3,n plus=1, sm has four documents {d4,d1,d2,d3 } , then cosine and
link value for each document pair in sm is calculated and the paired documents are ranked
scantly based on cosine and link values , each document pair of di and dj are ranked based on
cosine value rank cos(di,dj) rank link(di,dj) is their rank based on their link value is
considered as the sum of rank link (di,dj) and rank cos (di,dj) for two rank cos (di,dj) and
rank link (di,dj), a smaller value shows a higher rank, and zero is the highest rank. As a
conclusion, a smaller value of rank di,dj shows a higher rank. Some ranks of document pairs
is shown in Table1.
Table 1: Similarity measurement among primary centroid candidates
Better elementary centers are well separated in data set, so document pairs of higher rank
can be considered as good candidates of elementary centers.
There are mck combinations for choosing k elementary centers from m candidates, each
combination
is a k-number-set of sm and rank of each combination of com k is calculated as.
rank
com k
(rank
, for di com and dj com )
di, dj
k
k
(10)
It means that the value of a combination rank is the sum of rank values of kc2
combination pairs in candidate documents for elementary centers.
In this example, there are four combinations and their rank values are shown in Table 2.
148
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Table 2: four combinations and their rank values
0.35
2
3
3
5
0.10
1
1
0
1
0.40
3
3
3
6
0
0
1
0
0
0.50
4
3
3
7
0.60
5
2
2
7
Then, combination of the highest rank (the lowest value ) is selected as collection of
elementary centers for k-means algorithm. In this example {d1,d2,d3} are chosen because
they have the lowest rank value. Documents well separated in this combination are
considered, so they can serve as elementary centers for k-means.
Efficiency of this proposed method is based on n-plus selection and distribution of cluster
size. Experimental results show that suggested evaluation of similarity–proposed in section C
can improve clustering results in data set.
3.3 Measurement similarity based on link and cosine function
Cosine function is a good evaluation of similarity for document clustering that measures
similarity between two documents as correlation among that document's vectors. This
correlation is defined as cosine value of the angle between two vectors. Higher cosine value
shows higher number of shared terms and clauses between two documents. When cosine is
accepted in k-means algorithm, correlation between each document pair and the center is
evaluated through allocation step.
Measurement similarity based on cosine, however, doesn't work well for some kinds of
document sets. The number of key terms in data set is usually very high and it is possible that
average number of key terms in a document be seen very low. Moreover, documents with
common topic in a cluster may include low number of the words in its large dictionary. Here,
we present two examples. The first one is about relation between a topic and sub topic in a
cluster of family tree including the words such as parents, brothers, sisters, uncles, and so on.
In this cluster, some documents focus on brothers and sisters, while the others involve in
other family branches. So, these documents don’t cover all relationship terms listed above.
Second example is about synonyms. Different terms may be used in different document
for a same topic. Documents in a cluster used for a car factory may use different words for
describing a given characteristic of a car in other words there different words for a specific
topic for example words auto, automobile and vehicle are synonymous words.
149
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
In this case, possible link concept helps us to verify the vicinity of two documents
through checking the neighbors. When a document di has a group of words shared with its
neighbors and dj has other words shared with di neighbors. The shared neighbors of di and dj
indicate that the two documents have vicinity even if the cosine function doesn’t consider dj
similar to each other. Another fact is that the possible number of key words for different topic
is completely different when using different dictionaries.
In a cluster with a large dictionary, some document vectors are assigned for a large
number of terms. Majority of document pairs have low number shared terms. In other words,
if cosine function is used, similarity between a document and a center can be very low
because the center is identified as mean vector in all document vectors of the cluster. In
refinement phase of k-means algorithm is the process of maximizing the general criterion
function when cosine function is used for similarity measurement. So cluster split by large
dictionary is preferred. In other words, If general criterion function is link based, the
information is derived from a large dictionary it means that the larger dictionary, the higher
correlation between documents included in a cluster (as a result of shared neighbors ).
If document similarity is investigated based on link function, then chance of a document
in a larger cluster is more than that of smaller cluster because it has more neighbors in this
cluster so it will be divided based on dictionary size. Based on a fixed similarity threshold (
), center of a large cluster like ci has more neighbors compared to a smaller cluster like cj ,so,
for the document di, link (di,ci) is larger than link(di,cj). Based on this, evaluating similarity
for a k-means family algorithm using combination of link and cosine function is calculated
as:
(11)
link
f (di, cj)
(dicj)
(1 ) cos (di, cj)
1max
with 0 1
Where lmax is the highest possible value of link (di,cj) and is a coefficient set by user.
In k-means algorithm, since all documents are present in all processing clustering, so the
highest value of link (di,cj) of all documents of data set is n, meaning that all documents of
data set are near both of di and cj. In bisecting k-means algorithm, only selected documents
are present in each bisecting step, so the highest possible value of link (di,cj) of document is
determined. The lowest possible value of link (di,cj) in both k-means and bisecting k-means
is zero meaning di, cj have no shared neighbors. Lmax is used for regulating link values so
the value of link (di,cj) / lmax is always in the range of [0,1] and with n 1 f(di,cj) is
always between 0 and 1.
Equation 10 shows that sum of weight values of cosine and link function is used for
evaluating nearest between two documents and higher value of f (di,cj) indicates that they are
more near. Experiment data set on the various tests show that in the range of [0.8, 0.95]
produce the best result. For calculating link (di,cj), Column k is added to adjacency matrix m.
the resulted matrix is a n×(n + k) matrix denoted m'. Value of link (di,cj) can be calculated by
multiplying the ith row of M' in its ( n + j)th column as:
n
link (di, cj) M / i, m M / m, n j
m1
150
(12)
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
3.4 Choosing a cluster for bisecting based on neighbors of the center
In bisecting k-means algorithm in each bisecting step, and cluster for being bisected is
selected based on a huristic and unreasoning function. Based on this function, a cluster of the
least quality level is found. A low quality cluster is the one in which the documents are not
close to each other or their relation is weak. So selection of a cluster to be bisected should be
based on cluster compact (Guo, 2008). One frequently used method is evaluating cluster
compact based on its diameter. Though conforming document clusters in vector space can be
completely irregular (non-spherical), higher diameter of a cluster doesn't necessarily mean
that
the
cluster
is
not
connected.
In
(Classic
Text
Database,
<ftp://ftp.cs.cornell.edu/pub/smart/>.) the authors evaluate a cluster based on its total
similarity, cluster size or combination of these two, but they found that difference between
different measurements is usually low in results of final cluster. So they proposed that the
biggest remained cluster be bisected. Neighbors concept based on which similarity of two
documents is defined provide more information on cluster compact so we create a new
heuristic function which compares neighbors with centers of remained clusters described
below. Experimental results show that efficiency of bisecting k-means is improved compared
to breaking the biggest cluster. For a cluster cj, The number of local neighbors of center cj are
denoted by N(cj) local and we can obtain information m/[I,n+J] whose value is 1 for dj cj by
counting each entry. For the same cluster size and same similarity threshold of , center of a
compact cluster should have more neighbors than non-compact cluster. By definition the
center, when similarity threshold is fixed, center of large cluster tends to have larger
number of neighbors compared to a smaller center. So the number of local neighbors of a
center is divided by cluster size to get a normalized value, denoted by v(cj) for cj and always
in the range of [0,1].
And finally a cluster with the lowest value of V is selected for bisecting. Our rank-based
method involves several steps, and the time complexity of each step is analyzed in detail as
follows:
V(cj) n(cj) local / cj
(13)
In bisecting k-means algorithm in each bisecting step, an existing cluster for being bisected is
selected based on a cognitive and unreasoning function. Based on this function, a cluster of
the best quality level is found. A low quality cluster is the one in which the documents are not
close to each other or their relation is weak. So selection of a cluster to be bisected should be
based on cluster density (Guo, 2008). One frequently used method is evaluating cluster
density based on its diameter. Though conforming document groups in vector space can be
completely irregular (non-spherical), higher diameter of a cluster doesn't necessarily mean
that the cluster.
151
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
3.4.1 Our rank-based method involves several steps, and the time complexity of each
step is analyzed in detail as follows:
Step 1: Creation of the neighbor matrix
We use the cosine function to measure the similarity. The time complexity of calculating
each similarity could be represented as F1dt, where F1 is a constant for the calculation of the
cosine function, D is the number of unique words in the data set, and t is the unit operation
time for all basic operations. The time complexity of creating the neighbor matrix is:
T
ma tr ix
( F Dt 2 t ) n
1
2
2
2
( f d / 2 1) n t
1
(14)
Where n is the number of documents in the data set.
Step 2: Obtaining the top m documents with most neighbors.
First, the number of the neighbors of each document is calculated by using the neighbor
matrix, which takes n2t. It takes f n log(n ) operations to sort n documents, where F2 is the
2
constant for each operation of the sorting. Obtaining the top m documents from the sorted list
takes m operations, and
m kn
plus
2 k our
experiments. The set of these m initial centroid
candidates is denoted by Sm, and the time complexity of this step is:
t
sm
2
n t f
2
n log(n ) t 2 kt
(15)
Step 3: Ranking the document pairs in Sm based on the cosine and link values.
There are m( m 1) / 2 document pairs in Sm. We first rank them based on their cosine and
link values, respectively;
Then the final rank of each document pair is the sum of those two ranks. The time
complexity of ranking the document pairs based on their cosine values is:
T
f ( m( m 1) / 2 ) log(m( m 1) / 2 ) t
2
ra nk
cos(dj, dj)
f
2
( k ( 2 k 1)) log(k ( 2 k 1)) t
Thus, the time complexity of step 3 is:
152
(16)
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
T
T
T
T
a dd _ ra nks
ra nk
ra nk( dj, dj)
ra nk
link( dj, dj)
cos(dj, dj)
2 k ( 2 k 1) nt 2 F2 ( k ( 2 k 1)) log(k ( 2 k 1)) t k ( 2 k 1) t
(17)
Step 4: Finding the best k-subset out of Sm
There are mCk k-subsets of the documents in Sm, and we need to find the best k-subset
based on the aggregated ranks of all the document pairs in it. For each k-subset, it takes k(k1)/2+1 operations to check if it is the best one. Thus, the time complexity of finding the best
k-subset is:
T
( k ( k 1) / 2 1) m! /(( m K )! k! )) t
best combina tion
( k ( k 1) / 2 1)(( 2 K )! /( K! K! )) t
(18)
And the total time required for the selection of k initial centroids is:
T
T
T
init
matrix
s
Trank ( dj , dj ) Tbest combination
m
2
( f D / 2 2 ) n t f n log( n ) t 2 K ( 2 K 1) nt K ( 2 K 1) t
1
2
2 F 2 k ( 2 K 1) log( K ( 2 K 1))t ( K ( K 1) / 2 1)((2 K )! /( k! K
(19)
Since we can always have 2k <<n and2k2 << n for a given data set with n documents, the
time complexity of the first three steps is O(n2). The time complexity of step 4 is in an
exponential form of k. Since k is small in most real-life applications, step4 would not
increase the total computation cost much, and the time complexity of whole process is O(n2)
in that case. However, if k is large, the computation time of step 4 would be very large. So,
we propose a simple alternative step 4 that can remove the exponential component in the time
complexity.
When k is large, instead of checking all the possible k-subsets of the documents in Sm to
find the best one, we can create a k-subset, S', incrementally. After step 3, first the document
pair with the highest rank are inserted into S'.
Then we perform (k-2) selections, and at each selection, the best document out of k
randomly selected documents from Sm is added to S'. The goodness of each candidate
document di is evaluated by the random value of the current subset S' when di is inserted.
4. Experimental results
Using parity and f-measure values, we investigate accuracy of suggested algorithms; fmeasure is a homogeny combination of recall and precision in information recovery (Classic
153
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Text Database, ftp://ftp.cs.cornell.edu/pub/smart/). If assigns the number of class I
members, nj, number of cluster members, nij: number of class i members in cluster j, then
equation for precision and recall are (20), (21).
p(i, j)
nij
nj
R(i, j)
nij
ni
(21)
(20)
And f-measure is
F (i, j)
2 p(i, j) R(i, j)
p(i, j) R(i, j)
(22)
And purity is a fraction of cluster corresponding to the largest class of documents devoted to
cluster
Purity (i)
1
max (nij )
nj i
(23)
4.1 Dataset
We extracted 3 topic categories to build the dataset ReutersTD from Reuters-21578
corpus2 which contain total 135 topics. ReutersTD covers documents belonging to 3 topics.
4.2 Results of clustering
Figures 2 and 3 show f-measure values of clustering of two algorithms in three mahak
data sets, tables 1 and 2 show values of purity of clustering results. In original k-means and
bisecting k-means algorithm, elementary centers are selected randomly and cosine function is
used as similarity level. In BKM, the biggest cluster is used for bisecting in each bisecting
step and 5 is the replicating number for each step in the figures, identified Rank for
elementary centers are selected based on document ranking. CL acronym stands for
identically of sizes based on cosine and link and NB empires that selecting a cluster for
bisecting is based on nearest of centers neighbors. CLW acronym stands for identically of
sizes based on cosine and link with ontology wordnet. Each algorithm runs ten times to
obtain average values of f-measure and purity. Experimental results prove that suggested
methods using neighbors and links on KM and BKM significantly improve clustering.
2
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
154
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
f_measure
Fig.2. Neighborhood matrix M with data set s , =0.3
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
km with clw
km with rank&cl
km with Rank
km
km with rank &clw
Interest Jobs Housing
Fig.3. F-measure results in k-means algorithm on Mahak dataset
We test different values of coefficient for k-means with cl on data set, the results of which
is shown in figures 4, 5. When coefficient is between 0.8 and 0.95, clustering. Results are
better than those obtained only by using cosine function. We choose the coefficient 0.9 to
obtain other lab results reported in this section.
0.8
bkm with clw
0.7
f_measure
0.6
bkm with rank& cl
0.5
0.4
bkm with Rank
0.3
bkm
0.2
0.1
bkm with NB
0
Interest Jobs Housing
bkm with rank& clw
Fig.4. F-measure results for bisecting k-means in Mahak
155
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
0.8
0.7
f_measure
0.6
0.5
0.4
Interest
0.3
Jobs
0.2
Housing
0.1
0
0.52 0.6 0.68 0.76 0.84 0.92 1
coefficient α
Fig.5. The effect of on f-measure of k-means with cl
Results: we readjust the term weight according to the similarity measure between terms. We
have proposed a new ontology based term similarity measure that makes use of location
information of concept nodes in the ontology hierarchy. The experiment results showed that
the new similarity measure and adding the count of terms was more effective in improving
the clustering performance than the traditional similarity measure without considering
concept node location information.
we used Three different methods of neighbors and links in
k-means and bisecting kmeans to improve calculation of document's similarity. There for, we first we increased the
number of words in important places and wordnet ontology. Then, we extended KM and
BKM by using ranking for selecting elementary centers with linear combination of link and
cosine functions as a same measurement between a document and a center. Experimental
results shows that accuracy of clustering of k-means and bisecting k-means is improved using
new method.
Elementary centers selected by this method have been distributed well and each of them
is close to a sufficient number of related documents, so they improve clustering accuracy.
compact of a cluster can accurately be measured with neighbors of a center. Therefore
bisecting k-means, a cluster whose center has the lowest number of local neighbors can be
bisected. Moreover since all of our proposed methods use the same adjacency matrix, they
can be easily combined and run in better clusters. In proposed methods, there are steps that
can be parallel, such as finding document's neighbors, calculating inter-document link using
adjacency matrix, selecting similar clustering center for each document based on similarity
measurement. In future, we will investigate how parallel method can be used for improving
clustering accuracy.
156
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
References
Ali, A., Zarowin, P., 1991. Permanent Versus Transitory Components of Annual Earnings
and Bartal, Y. Charikar, M. Raz, D. (2001) Approximating min-sum k-clustering in metric
spaces, in: Proc. of the 33rd Annual ACM Symposium on Theory of Computing, pp. 11–
20.
Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic
relatedness.
Classic Text Database, <ftp://ftp.cs.cornell.edu/pub/smart/>.
Comput Linguistics 32(1):13–47
Deerwester, S. Dumais, S.T. Furnas, G.W. Landauer, T.K. Harshman, R. (1990) Indexing by
latent semantic analysis, Journal of the American Society for Information Science 41 (6)
391–407.
Dyer M.E., Frieze, A.M. (1985) A simple heuristic for the p-center problem, Operations
Research Letters 3 285–288.
Gruber T (1993) A translation approach to portable ontologies. Knowl Acquisit 5(2):199–220
Guha, S. Rastogi, R. Shim, K. (2000) ROCK: a robust clustering algorithm for categorical
attributes, Information Systems 25 (5) 245–266.
Guo, Q. (2008),"the similarity computing of documents based on VSM", Annual IEEE
International Computer Software and Applications Conference,pp 142-148.
Hirst G, St-Onge D (1998) Lexical chains as representations of context for the detection and
correction
of malapropisms, Fellbaum, pp 305–332
Holt, J.D. Chung, S.M. Li, Y. (2007) Usage of mined word associations for text retrieval, in:
Proc. of IEEE Int’l Conf. on Tools with Artificial Intelligence (ICTAI-2007), vol. 2, pp.
45–49.
Li, Y. Chung, S.M. Holt, J.D. (2008) Text document clustering based on frequent word
meaning sequences, Data and Knowledge Engineering 64 (1), 281–404.
Li, Y. Luo, C. Chung, S.M. (2008), Text clustering with feature selection by using statistical
data, IEEE Transactions on Knowledge and Data Engineering 20 (5) 641–652.
Luo, C. Li, Y. Soon M, (2009), Chung Text document clustering based on neighbors Data &
Knowledge Engineering, 1271-1288,.
Sheykh E, H. Abolhassani, M. Neshati, E. Behrangi, A. Rostami and M. Mohammadi Nasiri,
Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems, in
Proceedings of 5th ACS/IEEE International Conference on Computer Systems and
Applications (AICCSA-07), Amman, Jordan, May 2007.
157
Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139
Table 3: purity value in k-means algorithm
Data set
Km
Km with rank Km with clw Km with rank and clw
Interest
0.711 0.744
0.727
0.769
Jobs
0.495 0.710
0.675
0.720
Housing 0.590 0.700
0.645
0.695
Table 4: Purity value for bisecting k-means algorithm
Data set
Bkm
bKm
rank
with bKm
clw
with bKm
nb
with bKm with rank and clw and
nb
Interest
0.727 0.754
0.753
0.749
0.754
Jobs
0.620 0.690
0.630
0.625
0.655
Housing 0.700 0.699
0.699
0.700
0.700
0.8
0.7
0.6
purity
0.5
0.4
Interest
0.3
Jobs
Housing
0.2
0.1
0
0.52 0.6 0.68 0.76 0.84 0.92 1
coefficient α
Fig.6. The effect of on purity of k-means with cl
158