Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
TECHNOLOGY (IJCET)
ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 6, November - December (2013), pp. 192-203 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com
IJCET
IAEME
SIMILARITY BASED AUTOMATIC ITEM CLUSTERING FOR EFFICIENT CLASSIFICATION OF INFORMATION SPACE
M. Nagarjuna Reddy1,
1
R. Lakshmi Tulasi2
(M. Tech (CSE) Student, Dept. of CSE, QISCET, India) 2 (Professor and HOD, Dept. of IT, QISCET, India)
ABSTRACT Clustering is a useful technique that organizes a large quantity of unordered objects into a small number of meaningful and coherent clusters. The goal of the clustering is to assist in the location of information. It is an essential data analysis method used in many applications such as psychology, biology, information retrieval and mining technologies. Nowadays all manual documents are in automated form, because of fast access and lesser storage. So, to retrieve appropriate documents from huge database it is a major issue. Clustering documents to related groups is one of the active field of research in different fields of text mining, topic tracking systems, and question answering systems. We are proposing four eminent clustering algorithms that use standard similarity metrics on a document corpus to perform the clustering. Here, presents a survey on these existing document clustering algorithms and proposes a framework for comparing them using a similarity measure with respect to a number of documents and processing time. Key Words: Cliques, Corpus, Information Retrieval, Text Mining, Thesaurus, Web Analysis. 1. INTRODUCTION Clustering in general is an important and useful technique that automatically organizes a collection with a substantial number of data objects into a much smaller number of coherent groups [1]. The goal of clustering is to find inherent structures in data, and organize them into meaningful subgroups for further study and analysis. Every year many clustering algorithms have published. They can be existing in very different research fields, and technologically advanced using totally different approaches and techniques. The cluster represents hidden pattern means search can be done by unsupervised learning,
192
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
named a data concept from theoretical or machine learning viewpoint. Clustering plays a vital role from a practical viewpoint in data mining applications such as information retrieval and text mining , scientific data exploration, spatial database applications, marketing, medical diagnostics, computational, cybernetics, genetics, marketing etc., The process of automatically organizing text documents into meaning full clusters or group, such that the documents in the other cluster are dissimilar, and are similar from the document in same clusters is known as document clustering. In text mining it is one of the significant task. There are many techniques introduced for clustering documents since there is fast growth in the field of the World Wide Web and computational technologies. Therefore, simple document clustering to more challenging tasks such as construction of granular taxonomies, and document summarization would require high quality information from raw text documents; which have many related types of objects. It is also one of the most important task in artificial intelligence and machine learning. It has received much attention in recent years [2], [3]. A number of metrics have been proposed to handle document clustering [4], [5], based on different distance measures. On the other hand, more than half a century after it was introduced, according to a latest study [6]; still, the simple algorithm k-means remains as one of the top 10 data mining algorithms. It is the most commonly used partitional clustering algorithm in real time. The k-means clustering technique uses the Euclidean distance; it decreases the sum of the squared Euclidean distance between the cluster centres and the corresponding data points. Due to the document space is represented by high dimensionality; it is desirable to eliminate computation difficulty using a lowdimensional representation of the documents. Unluckily, the Euclidean distance is a dissimilarity measure which defines the dissimilarities rather than similarities among the documents. Generally documents are represented using a model known as a Vector - Space Model. It is a popular model in the information retrieval domain [7] .In VSM model, each element in the domain is taken to be a dimension in a vector space. A collection is represented by vectors, with components along exactly those dimensions corresponding to the elements in the collection. Originally, the vector space model (VSM), introduced by Salton [8], is one of the oldest and most extensively studied models for text mining. This is so because it permits using theories and tools from the area of linear algebra along with a number of heuristics. A collection of n documents are represented by a term-by-document matrix (tdm) of n columns and m rows, where m is the number of terms used to index the collection. Each element aij of the matrix is a suitable measure of the importance of termi with respect to the document and the entire collection. Although numerous alternative weighting schemes have been proposed and extensively studied, there are some well-documented weaknesses that have motivated the development of new methods building on VSM. In VSM, terms are autonomous and accordingly overlook any semantic relations between them. This infers that the closeness between documents is not similar and redundancy increases the dimensionality and affects the performance of clustering algorithms. Based on the similarity between document, term similarity between documents is calculated. Here, we achieve high performance using the VSM model as domain during analysis of four algorithms. The rest of this paper is structured as follows. In Section 2 we describe a number of earlier proposed research works on document clustering. Section 3 provides a fundamental study of the text analysis using document decomposition in its terms and introduces formal procedures for the document-by-term matrix construction and similarity measure useful for performing clustering with the help of four clustering techniques. Section 4 gives an analysis of four clustering methods with the results. Section 5 concludes the paper with fewer discussions.
193
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
2. THE FUNDAMENTAL THEORY Many issues related to document clustering has introduced in the previous section. Now we briefly review a few essential topics to provide a sufficient background for understanding document clustering. I. S. Dhillon et al, 2003 [9] has proposed, Co-clustering by finding a pair of maps from rows to row-clusters and from columns into column-clusters with minimum mutual information loss. In this paper, though, we mainly focus on techniques that indeed do employ a specific measure. One of the most popular measures is Euclidean distance that is: Dist (di, dj) = ||di dj|| (1) It was used in the traditional k-means algorithm. The objective of k-means was to minimize the Euclidean distance between the cluster centroid and that of cluster objects min ||di Cr|| (2)
r=1 d S i r
However, in document clustering data is in a sparse and high-dimensional space, to perform clustering cosine similarity is more extensively used. It is also a popular similarity score in information retrieval and text mining which were presented by Manning C.D, 2009 et al [10]. Particularly, the similarity of two document vectors di and dj, Sim (di, dj), has been defined as the cosine angle between them. This equals to their inner product, for unit vectors, as represented: Sim (di, dj) = cos(di, dj)=ditdj (3) Cosine measure was used in a variant of k-means called spherical k-means. Vector space models were high-dimensional and sparse, and presented unique computational and statistical challenges not commonly encountered in low-dimensional dense data. Clustering was an invaluable tool to organize a vector space model and the associated document collection. Dhillon I., 2003 [9] had used the fast spherical k-means clustering algorithm to produce meaningful clusters with good, descriptive labels. While k-means aims to minimize Euclidean distance, spherical kmeans aims to maximize the cosine similarity between cluster centroid and items in that cluster. The major difference between cosine similarity and Euclidean distance, and therefore between spherical k-means and k-means, was that the former emphasized on vector directions, while the latter focused on vector magnitudes. Beside the direct application in spherical k-means, cosine of document vectors was also widely used in many other documents Y. Zhao and G. Karypis in 2004 [11] calculated the performance of different standard functions in the perspective of partitional clustering algorithms for document datasets. For document clustering they conducted an empirical study to compare a variety of standard functions. Their experimental results showed that there were a set of criterion functions that steadily outperform the rest, and that some of the newly suggested criterion functions believe to be the best overall results. The papers published covered many diverse areas in the document clustering field, those are the visualization of clustered document spaces (Allan et al., 2001) [13], efficient algorithm development (Larsen and Aone, 1999) [12], the document clustering application to browse large document corpus. In general to model vectors for text documents in the information space Text mining
194
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
research depend on a vector space model, first proposed by Salton (1971) [14]. The Term Frequency-Inverse Document Frequency (tfidf) term weighting scheme is simple but assumes independence among words in a document, it is not a major problem for statistical-based methods but poses difficulty in phrase-based analysis. Ingo Feinerer et al., [15] explain how typical application jobs can be carried out and gives a review on text mining abilities in R using a framework. In the following section, we analyze four clustering algorithms based on similarity measure using vector model. 3. DOCUMENT CLUSTERING TECHNIQUES The process of clustering follows the following steps: Step1. Define the domain for the clustering effort. If a document clustering is being performed, it is the determination of the set of items to be clustered. Step2. Once the domain is determined, determine the attributes of the objects to be clustered. If documents are being clustered, the clustering process may focus on specific zones within the items that are to be used to determine similarity. Step3. Determine the strength of the relationships between the objects. For documents, define a similarity function based upon word co-occurrences that determine the similarity between two items. Step4. At this point, the total set of objects and the strengths of the relationships between the objects have been determined. Collection of Data includes the processes as like crawling, indexing, filtering etc. that are used to collect the documents that need to be clustered, index them to store and retrieve in a better way, and filter them to remove the extra data, for example, stop words. 3.1 Pre-processing Pre-processing consists of certain steps. It takes a plain text document as input and output a set of tokens to be included in the vector model. These steps typically consist of: 1. 2. 3. 4. Filtering is the process of removing special characters and punctuation that are not thought to hold any discriminating power under the vector model. Tokenization splits sentences into individual tokens, typically words Stemming is the process of reducing words to their base form or stem. For example, the words connected", connection", connections" are all reduced to the stem connect." Stopword removal, a stopword is a term, which is not thought to convey any meaning as a dimension in the vector space. A typical method to remove stopwords is to compare each term with a compilation of known stopwords. Pruning removes words that appear with very low frequency throughout the corpus. The underlying assumption is that these words, even if they had any discriminating power, would form too small clusters to be useful. A prespecified threshold is typically used, e.g. a small fraction of the number of words in the corpus. Sometimes words that occur too frequently (e.g. in 40% or more of the documents) are also removed.
5.
195
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
Figure 1: Steps of pre-processing 3.2 Term Relation Method In the complete term relation method, the similarity between every term pair is calculated as a basis for determining the clusters. The easiest way to understand this approach is to consider the vector model. The vector model is represented by a matrix where the rows are individual items and the columns are the words in the items (documents). Figure 2 provides an example of a database with 30 items and 8 terms. To determine the relationship between items, a similarity measure is required that is in equation 4. The measure calculates the similarity between two items. The following simple measure is used: SIM (Itemi , Itemj )= ( Termk,i X Termk,j ) (4) Where k is summed across the set of all terms.
Figure 2: Vector example (Term- relation matrix) The results can be placed in a resultant m by m matrix, called an Item-Item Matrix (DOC-DOC Matrix), where m is the number of rows (items) in the original matrix. This simple formula is reflexive so that the matrix that is generated is symmetric. Using the data in Fig. 2, the Item-Item matrix produced is shown in Fig. 3. There are no values on the diagonal since that represents the auto-correlation of a word in itself. The threshold is used to specify similarity of two objects that belong to a same class. Its value effects on the generation of total number of clusters produced by clustering technique. The next step is to select a threshold that decides if two terms are considered similar enough to each other to be in the same class. In this example, the threshold value of 9 is used. Thus two
196
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
items are considered as similar if the similarity value between them is 9 or greater. This produces a new binary matrix called the Item Relationship matrix (Fig. 4) that defines which items are similar. The final step is creating clusters.
Figure 4: Item relationship matrix 3.3 Algorithms Implementation There are many different algorithms available to determine when two objects (words) are in the same cluster. The following algorithms are the most common: cliques, single link, stars and connected components. 3.3.1 Cliques Algorithm
Figure 5: Clusters generated by Clique Technique Clique technique requires all items in a cluster to be within the threshold of all other items. The methodology to create the clusters using cliques is: 0. Let i = 1 1. Select itemi and place it in a new class 2. Start with itemk where r = k = i + 1 3. Validate if itemk is within the threshold of all items within the current class 4. If not, let k = k + 1 5. If k > m (number of words) then r = r + 1 if r = m then go to 6 else k = r create a new class with itemi in it go to 3 else go to 3
197
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
6. If current class only has itemi in it and there are other classes with itemi in them then delete current class else i = i + 1 7. If i = m + 1 then go to 8 else go to 1 8. Eliminate any classes that duplicate or are subsets of other classes. A characteristic of this approach is that items can be found in multiple classes. Applying the algorithm to Fig. 4, the classes are created as shown in Fig.5. 3.3.2 Single Linkage Algorithm Single linkage method begins with the set of objects as discrete clusters; then, at each step combines the two most similar clusters. It is repeated by a nominal number of clusters have been reached. It is impossible for an item to be in two different clusters. This in effect partitions the set of items into the clusters. The algorithm is: 1. Select a item that is not in a class and place that item in a fresh class 2. Place all other items that are related to the item into new class 3. For each term entered into the class, do step 2 4. When no new items can be identified in step 2, go to step 1. Applying the algorithm for creating clusters using a single link to the Item Relationship Matrix, is in Fig. 4, the following classes are created as shown in Fig. 6:
Figure 6: Clusters generated by Single Linkage Technique 3.3.4 The Star Algorithm 1. 2. It selects an item and then places in the class all items that are related to that item. Items not yet in classes are selected as new seeds until all items are assigned to a class.
There are many different classes that can be created using the Star technique. If we always choose as the starting point for a class the lowest numbered item not already in a class, using Fig. 4, the classes are created as shown in Fig. 7. This technique allows items to be in multiple clusters .This could be eliminated by expanding the constraints to exclude any item that has already been selected for a previous cluster.
198
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
3.3.4 The String Algorithm 1. 2. It starts with an item and includes in the class one additional item that is similar to the item selected and not already in a class. The new item is then used as the new node and the process is repeated until no new items can be added because the item being analyzed does not have another item related to it or the items related to it are already in the class. A new class starts with any item not currently in any existing class. Using the additional guidelines to select the lowest numbered item similar to the current item and not to select any item already in an existing class It produces the classes as shown in Fig. 8:
3. 4.
Figure 8: Clusters generated by String Technique 4. ANALYSIS OF DOCUMENT CLUSTERING TECHNIQUES In this section we are going to analyze the four clustering techniques using similarity measure in order to classify multidimensional document space based on threshold by considering domain and processing time in milliseconds.
199
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
4.1 Number of Clusters The clique technique produces classes that have the strongest relationships between all of the words in the class. The clique algorithm produces more classes than the other techniques because the requirement for all items to be similar to all other items will reduce the number of items in a class. This will require more classes to include all the items. The single link technique partitions the items into classes. It produces the least number of classes and the weakest relationship between items. It is possible using the single link algorithm that two items that have a similarity value of zero will be in the same class. The other techniques lie between these two extremes.
Figure 10: Number of clusters generated by each technique at threshold is 9 The above Fig. 9 and Fig. 10 presented the number of clusters produced by four algorithms. The clique algorithm generates many clusters upon increasing the number of documents. The single linkage algorithm produces same and minimal number of clusters by giving different number of documents as an input. Whereas star and string techniques lies between these two as shown in above figures 9 and 10. Here, we have noticed that how the clustering algorithms would produce various clusters based on threshold value, which has been given during construction of item-item binary vector relation matrix. By comparing Fig. 9 and Fig. 10 we observed that if threshold value increases then the total number of clusters generated by clique algorithm also reduced. But star and string techniques have produced a less number of clusters when threshold value has increased. The single linkage algorithm has little effect upon variation of threshold.
200
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
4.2 Average Processing Time Processing time is the amount of time needed by a process to complete its procedure. Here we have presented in Fig. 11 and Fig. 12 how the average processing time varies for each clustering technique depending on the threshold value. The clique clustering technique needed much more processing time in order to produce clusters at threshold is low. But it needs the least amount of time when the threshold value is high and also we observed that if the number of documents increased then the processing time is also increased. Here the processing time for clique is exponentially proportional to the number of clusters. The remaining techniques need little more processing time upon increasing the documents. Except clique clustering technique, rest of all need approximately same processing time irrespective of threshold value. They are not affected by the threshold value which has been used in the clustering. String and star techniques required nearly the same amount of cluster processing time even different threshold values. Whereas single linkage needed some more time than String and star techniques in order to perform clustering on various threshold values. The selection of the technique is also governed by the density of the item relationship matrix. When the Item Relationship Matrix is sparse, then the constraint dependencies between items need to be relaxed such as in single link to create classes with a reasonable number of items. The single link algorithm maximizes recall but can cause selection of many non-relevant items. The single link assignment process has the least overhead in assignment of items to classes.
201
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
5. CONCLUSION In this paper, we presented and analyzed four document clustering methods based on similarity measure. This analysis surely states about the algorithms details, term-document metrics and performance with respect to processing time in milliseconds and a number of documents. As the number of clustering documents increased then the total number of clusters produced by each clustering algorithm is also increased. The processing time needed for document clustering is also increased. The clique technique produces much more number of clusters among four techniques and single linkage technique produced least number of clusters. The star and string techniques may lie between these two. But processing time would be changed upon varying the threshold value. The traditional term-based approaches will not provide better solutions as like Ontology-based and concept- based text clustering. As a future work, improvement over the existing system with better results which offer new information representation capabilities with different techniques like search result clustering, collection clustering and co-clustering can be attempted. The scope of document clustering has on various issues like incremental document clustering, generic topic detection etc. 6. ACKNOWLEDGEMENTS I am thankful to my esteemed guide Prof. R. Lakshmi Tulasi, HOD in Dept. of IT, QIS College of Engineering & Technology who has spared her valuable time and append novel ideas to guide me in limelight and for her motivation, help and continuous support, which made this journal successful. I am also thankful to her for helping me to overcome many problems faced during my research proposals and publications. REFERENCES A. K. Jain, P. J. Flynn, M. N. Murty Data Clustering: ACM Computing Surveys - A Review, Vol. 31, No. 3, pp. 265-321, Sept. 1999. [2] J. Han and R.T. Ng, Spatial Data Mining Efficient and Effective Clustering Methods , Proc. 20th International conference 1994,Very Huge Data Bases , pp. 144-155. [3] P. Pintelas and S. Kotsiantis , A Brief Survey on Recent Advances in Clustering:, WSEAS Trans. vol. 1, no. 1, pp. 73-81, 2004. [4] A.K. McCallum and L.D. Baker , Text Classification Words Distributional Clustering , Proc. 21st Ann. 1998, Intl ACM SIGIR Conf. Research and Development in IR, pp. 96-103. [5] Y. Gong, X. Liu and S. Zhu, Document Clustering with Model Selection and Cluster Refinement Capabilities, Proc. 25th Ann. 2002 Intl ACM SIGIR Conf. Research and Development in IR (SIGIR 02), pp. 191-198. [6] X. Wu, J. Ross Quinlan, V. Kumar, J. Ghosh, Q. Yang, G. J. McLachlan, H. Motoda, B. Liu , P. S. Yu, Z., H. Zhou, M. Steinbach, D. J Hand and D. Steinberg, Top 10 algorithms in data mining, Knowledge Inf. Syst., vol. 14, no. 1, pp. 137, 2007. [7] M. J. McGill, Introduction to Modern Information Retrieval, NY-1983. [8] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Comm. ACM, 18 (11): 613620,1975. [9] Dhillon I.S., D.S. Modha, S. Mallela, Information Theoretic CoClustering, Proc. Ninth ACM SIGKDD Intl Conf. KDD, pp. 89-98, 2003. [10] Manning C.D., P. Raghavan, and H. Schu tze, An Introduction to IR (Information Retrieval). 2009, Cambridge Univ. Press. [1]
202
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 4, Issue 6, November - December (2013), IAEME
[11] G. Karypis and Zhao Y., Theoretical and Empirical Comparisons of Selected Document Clustering Criterion Functions , Machine Learning, vol. 55, no. 3, pp. 311-331, June [12] C. Aone and Larsen, B. using linear-time document clustering -Fast and effective text mining. Proceedings of the 5th International Conference on ACM Special Interest Group on KDD, Aug. 15-18, CA, pp: 16-22, 1999. [13] Allan, J., R. Swan, A. Leuski, and D. Byrd, 2001. Evaluating ranked lists combinations and inter-document similarity visualizations. Int. J. Inform. Process. Manage., 37: 435-458 [14] Salton, G., 1971. Prentice-Hall, Englewood Cliffs, New Jersey. The SMART RS-Experiment in Automatic Document Processing. [15] Karatzoglou. A, Feinerer. I (2007), In R Decker, HJ Lenz (Eds.), Text Clustering with String Kernels in R, Advances in Data Analysis E.V., Free University at Berlin, March 8 -10, 2006. [16] Meghana. N.Ingole, M.S.Bewoor and S.H.Patil, Context Sensitive Text Summarization using Hierarchical Clustering Algorithm, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 322 - 329, ISSN Print: 0976 6367, ISSN Online: 0976 6375. [17] Prakasha S, Shashidhar Hr and Dr. G T Raju, A Survey on Various Architectures, Models and Methodologies for Information Retrieval, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 6367, ISSN Online: 0976 6375. [18] Roma V J, M S Bewoor and Dr.S.H.Patil, Automation Tool for Evaluation of the Quality of NLP Based Text Summary Generated Through Summarization and Clustering Techniques by Quantitative and Qualitative Metrics, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 77 - 85, ISSN Print: 0976 6367, ISSN Online: 0976 6375. [19] Rinal H. Doshi, Dr. Harshad B. Bhadka and Richa Mehta, Development of Pattern Knowledge Discovery Framework using Clustering Data Mining Algorithm, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 101 - 112, ISSN Print: 0976 6367, ISSN Online: 0976 6375. [20] Inje Bhushan V. and Prof. Mrs. Ujwalapatil, A Comparative Study on Different Types of Effective Methods in Text Mining: A Survey, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 2, 2013, pp. 535 - 542, ISSN Print: 0976 6367, ISSN Online: 0976 6375.
203