Concept-based vector space model for improving text clustering

ali shokouhi

Concept-based vector space model for improving text clustering

ali shokouhi

visibility

…

description

19 pages

link

1 file

Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Concept-based vector space model for improving text clustering Abdolkarim Elahi1,a , Ali Shokouhi Rostami2,b 1 Department of computer, behshahr Branch, Islamic Azad University Behshahr, IRAN 2 Department of computer, Ferdowsi university of mashhad, mashhad, IRAN a [email protected], [email protected] ISSN: 2231-8275 Article Info Received: 25th March 2012 Accepted: 1st May 2012 Published online: 1st September 2012 © 2012 Design for Scientific Renaissance All rights reserved ABSTRACT In document clustering, it must be more similarity between intra-document and less similarity between intra-document of two clusters. The cosine function measures the similarity of two documents. When the clusters are not well separated, partitioning them just based on the pair wise is not good enough because some documents in different clusters may be similar to each other and the function is not efficient. To solve this problem, a measurement of the similarity in concept of neighbours and links is used. In Vector Space Model (VSM), every vector composed by the feature and its weight represents a document. But TF-IDF has the fault that exceptional useful features may be deleted, and this simple VSM model cannot present semantics well because all columns (terms) are considered independent. Indeed, the VSM model ignores all important semantic/conceptual relations of words. so we make up that by adding the count of the words at the important places and embed the semantic relationship information directly in the weights of the corresponding words which are semantically related by readjusting the weight values through similarity measures. In this way, similarity is used to re-weight term frequency in the VSM. Two clustering algorithms, bisecting k-means, feature weighting k-means clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM. Keywords: function knowledge-based VSM, Document clustering, neighbour, link 1. Introduction With generalization and widespread use of computers and internet, Methods of information obtaining have been changed and the number of documents has dramatically increased. Regarding high volume of information in new world, procedure for quick obtaining of interested information becomes more difficult. If no effective method be available for regulation and extraction of the information, the researcher devote more time for search information instead of learning, which is not endurable in data mining and information recovery. Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Document clustering has been organized to investigate a set of documents obtained from the results of searching engine and dramatically promotes accuracy of information recovery systems (Guha, Rastogi, & Shim, (2000), Li, Luo, & Chung, (2008), Deerwester, Dumais, Furnas, Landauer, & Harshman, (1990)). more over it provides an efficient way for finding adjacent neighbors of a document in user request, Li, Chung, & Holt, (2008). Document similarity is a kind of measuring similarity level between two or more documents. The more similarity between two documents, the higher level of similarity, otherwise similarity level decrease. Efficiency of calculating similarity among the documents is the basis of document clustering, information recovery, question answering system. Document classification and etc, which is achieved according to TF – IDF1 algorithm. But these kinds of algorithms may neglect some exceptional beneficial characteristics. To eliminate this problem, we increase the number term in important part of the text. And also, the importance of using the feature selection phase will modify. In general, there are two kinds of clustering methods: Agglomerative hierarchical clustering (AHC) methods, (Ali & Zarowin, 1991). In AHC, every document is first considered as a cluster and several distance function is used for calculating similarity between a pair of clusters (Ali & Zarowin, 1991) and in the next step, merge the closest pair, this merging step is repeated many times until the desired number of cluster is achieved. In comparison with the down-top method (AHC), the family of k-means algorithms, an important example of clustering, provide a division of documents based on which, a center can propose a cluster, then k elementary centers are selected and each document is assigned for a cluster based on its distance (between each document k center) k center is again calculated and the step is repeated many times to obtain an optimum set of k clusters based on criterion function. For document clustering, un-weighted Pair Group Method with Arithmetic Mean (UPGMA) has been reported as having the highest accuracy in the AHC category (Ali & Zarowin, 1991). Moreover in k-means family, bisecting k-means algorithm has been reported as the best compact algorithm for accuracy and efficiency (Dyer, Frieze, 1985), (Classic Text Database, <ftp://ftp.cs.cornell.edu/pub/smart/>). One measurement of cluster similarity is cosine function which is widely used in document clustering algorithm and has been also reported as a good criterion. When cosine function, measures similarity between two clusters, only a pair wise similarity is described to show if a document belongs to a cluster or not. But when the clusters are not clearly distinctive, their division only based on pair wise similarity is not sufficiently accurate, since documents in different clusters may be similar to each other. To prevent this problem, a concept of neighbors and links is used for document clustering (Luo, et al., 2009). In general, two documents adequately similar to each other are considered as neighbors and according to a given similarity threshold, each document in the document set can be regarded for neighboring. Moreover, the link between the two clusters is considered as the number of their common neighbors (Luo et al., 2009). 1 Temp Frequency Inverse Document Frequency 141 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Each document is considered based on multiple values of Boolean characteristics so that each feature corresponds with a key or unique word and its value is correct when the corresponding word is present. Based on this, the concepts of neighbors and links can be of great informative value about the documents in the cluster. Moreover in addition to considering the distance between documents and centers, this evaluation is done on their neighbor in the best form. K-means family algorithms are composed of two phases: elementary clustering and cluster refinement. In the former, selection of initial k centers and document allocation is done while in the latter, in cluster refinement phase, the process of partitioning optimization is performed by repeating calculation of new cluster centers base on recently allocated documents. At first, a new method of original center selection using TF-IDF implicit optimized function is done so that distribution of original centers is performed in a way that a good sectioning of common topic – documents is achieved. This selection of original centers is based on three values; first based on a cosine function performing pair wise similarity, second based on link function, and the third based on the number of neighbors of documents. Combination of these three cases helps select high quality original centers. Then, a new similarity measurement is done in refinement phase to determine nearest of clusters' centers. This value is a combination of cosine and link functions. but TF-IDF has the fault that exceptional useful features may be deleted, and this simple VSM model cannot present semantics well because all columns (terms) are considered independent. However, many terms in text are semantically /conceptually related. For example, ‘suggestion’ and ‘advice’ can have the same meaning in a document but they will be represented as two independent terms. This situation can significantly affect the dissimilarity calculation between two documents, and therefore affect clustering results. Indeed, the VSM model ignores all important semantic/conceptual relations of words, such as synonymy, specialization / generalization and part/whole relationships in forming the representation matrix from a set of documents. so we make up that by adding the count of the words at the important places and embed the semantic relationship information directly in the weights of the corresponding words which are semantically related by readjusting the weight values through similarity measures. Our idea is to re-adjust word weights so that the importance of clusterdependent core words can be increased and the contribution of cluster-independent general words can be reduced. to illustrate the effectiveness of the knowledge-based VSMs to improve the performance of the clustering algorithms, and also show that the new knowledge-based term similarity is more reasonable than the exiting measures. Bisecting finally, proposed new integrated method was run on Mahak (Sheykh, et al. 2007) standard data set and significant improvement in clustering accuracy was observed. In section 2, Weighting and vector space model, Cosine function, neighbor's concept, link function ,term similarity with an ontology, k-means and bisecting k-means algorithm are discussed. In section 3, Application of weighting improvement, neighbors and link function in k-means and bisecting k-means algorithm are proposed and in section 4, the results of 142 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 comparison between the proposed algorithm and original one regarding accuracy and time are discussed. 2. Background 2.1vector space and weighting model of text document In this model, each document is considered as a vector in word space and presented term frequency vector. D  tf , tf ,, tf m  tf  1 2  (1) It is frequency of the word i in document and m total number of the words in text database. In first some processing steps including words elimination, stop word and achieving words root are performed on the documents. A good usage of this model is weighting each word based on inverse frequency (idf) in documents set , usually obtained by multiplying log(n/dfi) by each word frequency i, where n is the total number of documents in data set and dfi is number of documents including the word i. the weighting formula is : (2) W  tf * idfi i, d i, d Where wi,d is the importance of i term in d document. So idf can reflect power of class distinguishing and TF represent distribution of a characteristic in a document. TF-IDF can include the words with high frequency but its distinguishing ability is low, so TF-IDF of an effective algorithm is in calculation of a term weight (Sheykh, et al. 2007). We assume that vector space is used for documents indication and consider cj as a set of documents whose vector center is cj which is equal to. 1 n c  di (3) j cj dicj di a [cj] are document vector and the number of vector in cj set respectively. Cj length may be unequal to each di. 2.2 Measurement of cosine similarity There are different numbers of similarity measurement for document clustering, most of which use cosine function. Similarity between two di and dj document is as below. Cos(di, dj)  di, dj di dj (4) 2.3 Neighbours and links: The neighbors of the document d in data set , are documents similar to it (Luo, Li & Soon, 2009). A similarity measurement, is pair-wise similarity between two di and dj document, having values between 0and1. for threshold  , di and dj are neighbors if 143 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 sim(di, dj)   , with 0    1 (5)  Threshold (introduced by user) for controlling the similarity between two documents in such a way that they could be considered as neighbors. each user can use a special value of  which can be shown by an adjacency matrix each entry m[i , j] in matrix is 1 if di and dj are neighbors otherwise zero. The number of di neighbors in data set is shown by N(di) which is the number of entries whose value in i the row of matrix m is the value of one. Value of link function is shown by link (di,dj) and is defined as the number of common neighbors of di,dj and it is calculated by ith row of m matrix multiplied by its jth column (Luo, Li & Soon, 2009). link (di, dj)  n m i, m m m, j  m1 (6) So , if link (di , dj) is large , di and dj will be more probably of sufficient similarity and can be placed in the same cluster. Link function is also a good factor for evaluation of vicinity of two documents. 2.4 Term similarity with an ontology An ontology is used as a description of the concepts and relationships that exist in an agent or a community of agents, and is defined for the purpose to enable knowledge sharing and reuse (Gruber, 1993). An ontology in this paper is defined as a directed acyclic tree T = (C, E) (see Fig. 1), where C is the set of nodes and E is the set of edges such that every edge ei € E is an ordered pair (ci , cj), and ci , c j € C. The following are the descriptions of nodes and edges addressed in an ontology. – Nodes: A node represents a concept consisting of a set of synonyms. For a given concept node, its ancestor nodes of lower levels are more general than those of higher levels (an upside-down tree). For instance in Figure 1, node B is more general than node D. Fig.1. sample of ontology 144 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 – Edges: Edges are directional, indicating the directions of the relationships between nodes (i.e., an directed edge is pointing from a general node to a specific node). For instance, node D is a specific concept derived from the general concept of node B. Based on the above ontology hierarchy, we can calculate the similarity between each pair of concepts, denoted as δ(c1, c2). Therefore, the similarity between two terms t1 and t2 can be found according to the semantic relationship between their corresponding concepts c1 and c2 in the ontology as (t1 , t2) = (7) where s(t1) is the set of concepts in the taxonomy that are senses of term t1 or the set of concepts which cover the meaning of term t1. For example, if term t1 has multiple meanings, there may be more than one concept in s(t1). Formula (3) demonstrates that the similarity between two terms can be obtained by calculating the similarity between the most-related pair of corresponding concepts (Budanitsky & Hirst (2006). In this way, the words with multiple meanings can be handled. A simple method to calculate the similarity between two concepts in an ontology is to count the number and directions of edges between two concept nodes, as used in (Hirst & StOnge, 1998). However, without considering the levels of nodes in the ontology hierarchy, the similarity of this method could be misleading. For example, in Fig. 1, the similarity between nodes J and K will be same as the similarity between nodes D and E, according to Formula (1). However, we note that nodes J and K should be more similar than nodes D and E, because concepts J and K are more specific in comparison with concepts D and E that are more general in the linguistic preview. However ,we increased the term of document by Position weight and Path weight which defined as : Position weight: the vertical position of the concepts within the hierarchy, measured by the level of the nearest common parent node of the two concepts, i.e., the level of the nearest common predecessor in the ontology hierarchy. Path weight: the essential distance between the two concepts, measured by the length of the path between any of concept pairs and their nearest common parent node, and the length of the maximum path containing each concept in the hierarchy. 2.5 k-means and bisecting k-means algorithm for document clustering K-means is a general algorithm by which a data set is divided into k clusters. Is data set includes N documents as d1, d2, … dn, so grouping into k clusters is divided an optimal state k n   sim(di, cj) j1i  1 (8) criterion function should be either minimum or maximum based on determining sim(di,cj), cj indicates cj cluster center for j:1 , 2 , … k and Sim(di,cj) evaluate similarity between di document and cj center. When space vector model and cosine is used for Sim(di,cj), every document is allocated to a cluster whose central vector is more similar than 145 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 that of other clusters , in other word, general criterion function is maximum. This optimum process is known as a NP- complete problem and k-means algorithm offers a partial solution for preparation. k-means steps are: 1. Selecting k elementary clusters center, each indicating a cluster 2. for each document in data set, similarity with cluster centers is calculated and the document is devoted to the nearest (the most similar) center 3. Recalculation of k center is performed based on document allocation to cluster. 4. Steps 2 and 3 are repeated until convergence is achieved. The bisecting k-means algorithm is different from k-means and its main topic is to divide a cluster into two sub-clusters in each step. This algorithm is initiated with all data sets as an individual cluster and follows these steps. 1. Selecting a cj cluster for breaking based on a heuristic function 2. Finding two sub cluster of cj using k-means algorithm (a) Selecting 2 elementary cluster centers (b) For each document of cj, its similarity with two cluster centers is calculated and the document is devoted to the nearest center. (c) Recalculation of two centers based on documents allocated to them. (d) Repeat steps 2b and 2c until achieving convergence 3. Repetition of step 2 for i times, breaking and producing the best cluster of general criterion function 4. Repeat steps 1, 2 and 3 to achieve k clusters. i denote repeating number of each bisecting step, determined through process improvement. 3. Applying weighting improvement with neighbors and links in k-means family algorithm 3.1 Improving calculation of documents similarity The first stage in algorithm of calculation of documents similarity is selecting important terms in training set and then calculating weight of each term. In other words, when characteristics are selected in characteristic selection stage, they are in the same weight. But in practical application, training and practical set are always related. TF-IDF is based on statistical information. The more documents in a set, the higher influence they have for using such a character, we suggest to use such terms in calculating term weight of practical set, in order that one can consider term weight for selection stage. Improved TF-IDF is. (9) W  Wt * tf idft t ,d t ,d 146 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Wt is the weight of term t in training set that obtain by adding the count of the words at the important places. We used improved TF-IDF formula in equation 11. To obtain calculation of new similarity: we go on the next sections. 3.2 selecting elementary centers of clusters based on ranks k-means algorithm family initiates with elementary centers of the clusters and the documents are frequently devoted based on the general criterion function being minimum. This is a well known clustering algorithm based on frequently processes. Having high efficiency but often convergent to local maximum or minimum of local criterion function. When considering different sets of elementary centers, then we obtain different results from final cluster. This can be circumvented by initiating from a good set of elementary centers. Also a major issue of the k -means algorithm is that it needs to pre-determine the cluster number. An inappropriate choice of k may yield poor results. Stability is a popular tool for model selection in clustering, in particular to select the number of k of clusters. The general idea is that the best parameter k for a given data set is the one which leads to the most stable clustering results. Automatically determining the number of clusters has been one of the most difficult problems in data clustering. Most methods for automatically determining the number of clusters cast it into the problem of model selection. Usually, clustering algorithms are run with different values of k the best value of k is then chosen based on a predefined criterion. There are three algorithms for selecting elementary centers. 1. random 2. Buckshot (Luo, Li & Soon, 2009) 3. fractionation In random algorithm , k documents of data set are randomly assumed as elementary centers. In Buckshut algorithm, nk documents are randomly selected from n data sets and clustered by a clustering algorithm and resulted k centers are chosen as elementary centers. In fractionation algorithm, documents are divided into identical document and clustering is done in each document. Then the cluster behaves in such a way as if they were individual documents, all the procedures are repeated until k clusters are achieved and the centers of k clusters are selected as elementary centers. In this section, we discuss a new method for selecting elementary centers based on neighbor and link concepts along with cosine function. Documents in a cluster are considered more similar. So a candidate for elementary center not only is sufficiently adjacent to other documents in a cluster but also is well separated from other centers. By considering (appropriate similarity Threshold) of neighbors number of a document in a data set, it can be possible to investigate how many neighbors are enough nearest to that document. As two cosine and link functions can evaluate similarity between two documents, Their combination is used for evaluating dissimilarity between two documents previously assigned as elementary centers' candidate. At first, by adjacency matrix, the documents are listed descendible according to the number of neighbors, then based on finding a set of candidate’s for elementary centers. Each of which is enough close to center cluster of documents, the upper m documents of the list are selected so that m candidate for elementary center is defined by sm where m=k+nplus and kij arbitrary number of clusters and n-plus is 147 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 additional number of selected candidates. Since these m candidate’s have the highest number of neighbors in dataset, it is assumed that they have more similarity to clusters centers. For exam, consider the data set S composed of 6 documents d1, d2, …, d6 whose neighborhood matrix is presented in fig 1. when  =0 and k=3,n plus=1, sm has four documents {d4,d1,d2,d3 } , then cosine and link value for each document pair in sm is calculated and the paired documents are ranked scantly based on cosine and link values , each document pair of di and dj are ranked based on cosine value rank cos(di,dj) rank link(di,dj) is their rank based on their link value is considered as the sum of rank link (di,dj) and rank cos (di,dj) for two rank cos (di,dj) and rank link (di,dj), a smaller value shows a higher rank, and zero is the highest rank. As a conclusion, a smaller value of rank di,dj shows a higher rank. Some ranks of document pairs is shown in Table1. Table 1: Similarity measurement among primary centroid candidates Better elementary centers are well separated in data set, so document pairs of higher rank can be considered as good candidates of elementary centers. There are mck combinations for choosing k elementary centers from m candidates, each combination is a k-number-set of sm and rank of each combination of com k is calculated as. rank com k   (rank , for di com and dj com ) di, dj k k (10) It means that the value of a combination rank is the sum of rank values of kc2 combination pairs in candidate documents for elementary centers. In this example, there are four combinations and their rank values are shown in Table 2. 148 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Table 2: four combinations and their rank values 0.35 2 3 3 5 0.10 1 1 0 1 0.40 3 3 3 6 0 0 1 0 0 0.50 4 3 3 7 0.60 5 2 2 7 Then, combination of the highest rank (the lowest value ) is selected as collection of elementary centers for k-means algorithm. In this example {d1,d2,d3} are chosen because they have the lowest rank value. Documents well separated in this combination are considered, so they can serve as elementary centers for k-means. Efficiency of this proposed method is based on n-plus selection and distribution of cluster size. Experimental results show that suggested evaluation of similarity–proposed in section C can improve clustering results in data set. 3.3 Measurement similarity based on link and cosine function Cosine function is a good evaluation of similarity for document clustering that measures similarity between two documents as correlation among that document's vectors. This correlation is defined as cosine value of the angle between two vectors. Higher cosine value shows higher number of shared terms and clauses between two documents. When cosine is accepted in k-means algorithm, correlation between each document pair and the center is evaluated through allocation step. Measurement similarity based on cosine, however, doesn't work well for some kinds of document sets. The number of key terms in data set is usually very high and it is possible that average number of key terms in a document be seen very low. Moreover, documents with common topic in a cluster may include low number of the words in its large dictionary. Here, we present two examples. The first one is about relation between a topic and sub topic in a cluster of family tree including the words such as parents, brothers, sisters, uncles, and so on. In this cluster, some documents focus on brothers and sisters, while the others involve in other family branches. So, these documents don’t cover all relationship terms listed above. Second example is about synonyms. Different terms may be used in different document for a same topic. Documents in a cluster used for a car factory may use different words for describing a given characteristic of a car in other words there different words for a specific topic for example words auto, automobile and vehicle are synonymous words. 149 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 In this case, possible link concept helps us to verify the vicinity of two documents through checking the neighbors. When a document di has a group of words shared with its neighbors and dj has other words shared with di neighbors. The shared neighbors of di and dj indicate that the two documents have vicinity even if the cosine function doesn’t consider dj similar to each other. Another fact is that the possible number of key words for different topic is completely different when using different dictionaries. In a cluster with a large dictionary, some document vectors are assigned for a large number of terms. Majority of document pairs have low number shared terms. In other words, if cosine function is used, similarity between a document and a center can be very low because the center is identified as mean vector in all document vectors of the cluster. In refinement phase of k-means algorithm is the process of maximizing the general criterion function when cosine function is used for similarity measurement. So cluster split by large dictionary is preferred. In other words, If general criterion function is link based, the information is derived from a large dictionary it means that the larger dictionary, the higher correlation between documents included in a cluster (as a result of shared neighbors ). If document similarity is investigated based on link function, then chance of a document in a larger cluster is more than that of smaller cluster because it has more neighbors in this cluster so it will be divided based on dictionary size. Based on a fixed similarity threshold (  ), center of a large cluster like ci has more neighbors compared to a smaller cluster like cj ,so, for the document di, link (di,ci) is larger than link(di,cj). Based on this, evaluating similarity for a k-means family algorithm using combination of link and cosine function is calculated as: (11) link f (di, cj)    (dicj)  (1  )  cos (di, cj) 1max with 0   1 Where lmax is the highest possible value of link (di,cj) and  is a coefficient set by user. In k-means algorithm, since all documents are present in all processing clustering, so the highest value of link (di,cj) of all documents of data set is n, meaning that all documents of data set are near both of di and cj. In bisecting k-means algorithm, only selected documents are present in each bisecting step, so the highest possible value of link (di,cj) of document is determined. The lowest possible value of link (di,cj) in both k-means and bisecting k-means is zero meaning di, cj have no shared neighbors. Lmax is used for regulating link values so the value of link (di,cj) / lmax is always in the range of [0,1] and with n    1 f(di,cj) is always between 0 and 1. Equation 10 shows that sum of weight values of cosine and link function is used for evaluating nearest between two documents and higher value of f (di,cj) indicates that they are more near. Experiment data set on the various tests show that  in the range of [0.8, 0.95] produce the best result. For calculating link (di,cj), Column k is added to adjacency matrix m. the resulted matrix is a n×(n + k) matrix denoted m'. Value of link (di,cj) can be calculated by multiplying the ith row of M' in its ( n + j)th column as: n link (di, cj)   M / i, m M / m, n  j  m1 150 (12) Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 3.4 Choosing a cluster for bisecting based on neighbors of the center In bisecting k-means algorithm in each bisecting step, and cluster for being bisected is selected based on a huristic and unreasoning function. Based on this function, a cluster of the least quality level is found. A low quality cluster is the one in which the documents are not close to each other or their relation is weak. So selection of a cluster to be bisected should be based on cluster compact (Guo, 2008). One frequently used method is evaluating cluster compact based on its diameter. Though conforming document clusters in vector space can be completely irregular (non-spherical), higher diameter of a cluster doesn't necessarily mean that the cluster is not connected. In (Classic Text Database, <ftp://ftp.cs.cornell.edu/pub/smart/>.) the authors evaluate a cluster based on its total similarity, cluster size or combination of these two, but they found that difference between different measurements is usually low in results of final cluster. So they proposed that the biggest remained cluster be bisected. Neighbors concept based on which similarity of two documents is defined provide more information on cluster compact so we create a new heuristic function which compares neighbors with centers of remained clusters described below. Experimental results show that efficiency of bisecting k-means is improved compared to breaking the biggest cluster. For a cluster cj, The number of local neighbors of center cj are denoted by N(cj) local and we can obtain information m/[I,n+J] whose value is 1 for dj  cj by counting each entry. For the same cluster size and same similarity threshold of  , center of a compact cluster should have more neighbors than non-compact cluster. By definition the center, when similarity threshold  is fixed, center of large cluster tends to have larger number of neighbors compared to a smaller center. So the number of local neighbors of a center is divided by cluster size to get a normalized value, denoted by v(cj) for cj and always in the range of [0,1]. And finally a cluster with the lowest value of V is selected for bisecting. Our rank-based method involves several steps, and the time complexity of each step is analyzed in detail as follows: V(cj)  n(cj) local / cj (13) In bisecting k-means algorithm in each bisecting step, an existing cluster for being bisected is selected based on a cognitive and unreasoning function. Based on this function, a cluster of the best quality level is found. A low quality cluster is the one in which the documents are not close to each other or their relation is weak. So selection of a cluster to be bisected should be based on cluster density (Guo, 2008). One frequently used method is evaluating cluster density based on its diameter. Though conforming document groups in vector space can be completely irregular (non-spherical), higher diameter of a cluster doesn't necessarily mean that the cluster. 151 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 3.4.1 Our rank-based method involves several steps, and the time complexity of each step is analyzed in detail as follows: Step 1: Creation of the neighbor matrix We use the cosine function to measure the similarity. The time complexity of calculating each similarity could be represented as F1dt, where F1 is a constant for the calculation of the cosine function, D is the number of unique words in the data set, and t is the unit operation time for all basic operations. The time complexity of creating the neighbor matrix is:  T ma tr ix ( F Dt  2 t ) n 1 2 2 2  ( f d / 2  1) n t 1 (14) Where n is the number of documents in the data set. Step 2: Obtaining the top m documents with most neighbors. First, the number of the neighbors of each document is calculated by using the neighbor matrix, which takes n2t. It takes f n log(n ) operations to sort n documents, where F2 is the 2 constant for each operation of the sorting. Obtaining the top m documents from the sorted list takes m operations, and m  kn plus  2 k our experiments. The set of these m initial centroid candidates is denoted by Sm, and the time complexity of this step is: t sm 2  n t  f 2 n log(n ) t  2 kt (15) Step 3: Ranking the document pairs in Sm based on the cosine and link values. There are m( m  1) / 2 document pairs in Sm. We first rank them based on their cosine and link values, respectively; Then the final rank of each document pair is the sum of those two ranks. The time complexity of ranking the document pairs based on their cosine values is: T  f ( m( m  1) / 2 ) log(m( m  1) / 2 ) t 2 ra nk cos(dj, dj)  f 2 ( k ( 2 k  1)) log(k ( 2 k  1)) t Thus, the time complexity of step 3 is: 152 (16) Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 T  T T T a dd _ ra nks ra nk ra nk( dj, dj) ra nk link( dj, dj) cos(dj, dj)  2 k ( 2 k  1) nt  2 F2 ( k ( 2 k  1)) log(k ( 2 k  1)) t  k ( 2 k  1) t (17) Step 4: Finding the best k-subset out of Sm There are mCk k-subsets of the documents in Sm, and we need to find the best k-subset based on the aggregated ranks of all the document pairs in it. For each k-subset, it takes k(k1)/2+1 operations to check if it is the best one. Thus, the time complexity of finding the best k-subset is: T  ( k ( k  1) / 2  1) m! /(( m  K )! k! )) t best  combina tion  ( k ( k  1) / 2  1)(( 2 K )! /( K! K! )) t (18) And the total time required for the selection of k initial centroids is: T T T init matrix s  Trank ( dj , dj )  Tbest  combination m 2  ( f D / 2  2 ) n t  f n log( n ) t  2 K ( 2 K  1) nt  K ( 2 K  1) t 1 2  2 F 2 k ( 2 K  1) log( K ( 2 K  1))t  ( K ( K  1) / 2  1)((2 K )! /( k! K (19) Since we can always have 2k <<n and2k2 << n for a given data set with n documents, the time complexity of the first three steps is O(n2). The time complexity of step 4 is in an exponential form of k. Since k is small in most real-life applications, step4 would not increase the total computation cost much, and the time complexity of whole process is O(n2) in that case. However, if k is large, the computation time of step 4 would be very large. So, we propose a simple alternative step 4 that can remove the exponential component in the time complexity. When k is large, instead of checking all the possible k-subsets of the documents in Sm to find the best one, we can create a k-subset, S', incrementally. After step 3, first the document pair with the highest rank are inserted into S'. Then we perform (k-2) selections, and at each selection, the best document out of k randomly selected documents from Sm is added to S'. The goodness of each candidate document di is evaluated by the random value of the current subset S' when di is inserted. 4. Experimental results Using parity and f-measure values, we investigate accuracy of suggested algorithms; fmeasure is a homogeny combination of recall and precision in information recovery (Classic 153 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Text Database, ftp://ftp.cs.cornell.edu/pub/smart/). If assigns the number of class I members, nj, number of cluster members, nij: number of class i members in cluster j, then equation for precision and recall are (20), (21). p(i, j)  nij nj R(i, j)  nij ni (21) (20) And f-measure is F (i, j)  2  p(i, j)  R(i, j) p(i, j)  R(i, j) (22) And purity is a fraction of cluster corresponding to the largest class of documents devoted to cluster Purity (i)  1 max (nij ) nj i (23) 4.1 Dataset We extracted 3 topic categories to build the dataset ReutersTD from Reuters-21578 corpus2 which contain total 135 topics. ReutersTD covers documents belonging to 3 topics. 4.2 Results of clustering Figures 2 and 3 show f-measure values of clustering of two algorithms in three mahak data sets, tables 1 and 2 show values of purity of clustering results. In original k-means and bisecting k-means algorithm, elementary centers are selected randomly and cosine function is used as similarity level. In BKM, the biggest cluster is used for bisecting in each bisecting step and 5 is the replicating number for each step in the figures, identified Rank for elementary centers are selected based on document ranking. CL acronym stands for identically of sizes based on cosine and link and NB empires that selecting a cluster for bisecting is based on nearest of centers neighbors. CLW acronym stands for identically of sizes based on cosine and link with ontology wordnet. Each algorithm runs ten times to obtain average values of f-measure and purity. Experimental results prove that suggested methods using neighbors and links on KM and BKM significantly improve clustering. 2 http://www.daviddlewis.com/resources/testcollections/reuters21578/. 154 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 f_measure Fig.2. Neighborhood matrix M with data set s ,  =0.3 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 km with clw km with rank&cl km with Rank km km with rank &clw Interest Jobs Housing Fig.3. F-measure results in k-means algorithm on Mahak dataset We test different values of coefficient for k-means with cl on data set, the results of which is shown in figures 4, 5. When coefficient is between 0.8 and 0.95, clustering. Results are better than those obtained only by using cosine function. We choose the coefficient 0.9 to obtain other lab results reported in this section. 0.8 bkm with clw 0.7 f_measure 0.6 bkm with rank& cl 0.5 0.4 bkm with Rank 0.3 bkm 0.2 0.1 bkm with NB 0 Interest Jobs Housing bkm with rank& clw Fig.4. F-measure results for bisecting k-means in Mahak 155 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 0.8 0.7 f_measure 0.6 0.5 0.4 Interest 0.3 Jobs 0.2 Housing 0.1 0 0.52 0.6 0.68 0.76 0.84 0.92 1 coefficient α Fig.5. The effect of  on f-measure of k-means with cl Results: we readjust the term weight according to the similarity measure between terms. We have proposed a new ontology based term similarity measure that makes use of location information of concept nodes in the ontology hierarchy. The experiment results showed that the new similarity measure and adding the count of terms was more effective in improving the clustering performance than the traditional similarity measure without considering concept node location information. we used Three different methods of neighbors and links in k-means and bisecting kmeans to improve calculation of document's similarity. There for, we first we increased the number of words in important places and wordnet ontology. Then, we extended KM and BKM by using ranking for selecting elementary centers with linear combination of link and cosine functions as a same measurement between a document and a center. Experimental results shows that accuracy of clustering of k-means and bisecting k-means is improved using new method. Elementary centers selected by this method have been distributed well and each of them is close to a sufficient number of related documents, so they improve clustering accuracy. compact of a cluster can accurately be measured with neighbors of a center. Therefore bisecting k-means, a cluster whose center has the lowest number of local neighbors can be bisected. Moreover since all of our proposed methods use the same adjacency matrix, they can be easily combined and run in better clusters. In proposed methods, there are steps that can be parallel, such as finding document's neighbors, calculating inter-document link using adjacency matrix, selecting similar clustering center for each document based on similarity measurement. In future, we will investigate how parallel method can be used for improving clustering accuracy. 156 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 References Ali, A., Zarowin, P., 1991. Permanent Versus Transitory Components of Annual Earnings and Bartal, Y. Charikar, M. Raz, D. (2001) Approximating min-sum k-clustering in metric spaces, in: Proc. of the 33rd Annual ACM Symposium on Theory of Computing, pp. 11– 20. Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Classic Text Database, <ftp://ftp.cs.cornell.edu/pub/smart/>. Comput Linguistics 32(1):13–47 Deerwester, S. Dumais, S.T. Furnas, G.W. Landauer, T.K. Harshman, R. (1990) Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (6) 391–407. Dyer M.E., Frieze, A.M. (1985) A simple heuristic for the p-center problem, Operations Research Letters 3 285–288. Gruber T (1993) A translation approach to portable ontologies. Knowl Acquisit 5(2):199–220 Guha, S. Rastogi, R. Shim, K. (2000) ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) 245–266. Guo, Q. (2008),"the similarity computing of documents based on VSM", Annual IEEE International Computer Software and Applications Conference,pp 142-148. Hirst G, St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms, Fellbaum, pp 305–332 Holt, J.D. Chung, S.M. Li, Y. (2007) Usage of mined word associations for text retrieval, in: Proc. of IEEE Int’l Conf. on Tools with Artificial Intelligence (ICTAI-2007), vol. 2, pp. 45–49. Li, Y. Chung, S.M. Holt, J.D. (2008) Text document clustering based on frequent word meaning sequences, Data and Knowledge Engineering 64 (1), 281–404. Li, Y. Luo, C. Chung, S.M. (2008), Text clustering with feature selection by using statistical data, IEEE Transactions on Knowledge and Data Engineering 20 (5) 641–652. Luo, C. Li, Y. Soon M, (2009), Chung Text document clustering based on neighbors Data & Knowledge Engineering, 1271-1288,. Sheykh E, H. Abolhassani, M. Neshati, E. Behrangi, A. Rostami and M. Mohammadi Nasiri, Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems, in Proceedings of 5th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA-07), Amman, Jordan, May 2007. 157 Journal of Advanced Computer Science and Technology Research Vol.2 No.3, September 2012, 127-139 Table 3: purity value in k-means algorithm Data set Km Km with rank Km with clw Km with rank and clw Interest 0.711 0.744 0.727 0.769 Jobs 0.495 0.710 0.675 0.720 Housing 0.590 0.700 0.645 0.695 Table 4: Purity value for bisecting k-means algorithm Data set Bkm bKm rank with bKm clw with bKm nb with bKm with rank and clw and nb Interest 0.727 0.754 0.753 0.749 0.754 Jobs 0.620 0.690 0.630 0.625 0.655 Housing 0.700 0.699 0.699 0.700 0.700 0.8 0.7 0.6 purity 0.5 0.4 Interest 0.3 Jobs Housing 0.2 0.1 0 0.52 0.6 0.68 0.76 0.84 0.92 1 coefficient α Fig.6. The effect of  on purity of k-means with cl 158

Log In

Concept-based vector space model for improving text clustering

Related papers

Related papers