A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
A New Hierarchical Document Clustering Method: Gang Kou Yi Peng
Yi Peng1,2
School of Management and Economics University of Electronic Science and Technology of China Chengdu, P.R. China, 610054 2 CAS Research Center on Fictitious Economy and Data Sciences Beijing, P. R. China, 100080 pengyi@uestc.edu.cn effectively retrieve relevant documents over large corpora increase. As an unsupervised learning method, clustering is used in many fields, such as statistics, data mining, machine learning, and bioinformatics. When clustering is used in information retrieval, the assumption is that documents clustered together behave similarly with respect to relevance to information needs (Rijsbergen 1979). Some typical applications of clustering in information retrieval include search result clustering, corpus analysis, scatter-gather, language modeling, and cluster-based retrieval (Manning, Raghavan and Schtze 2008). Clustering can help information retrieval systems retrieve and rank documents by grouping similar documents into clusters. When the number of documents is very large, many popular clustering algorithms, such as the k-means which require extensive computing resources to do the analysis, may fail to find optimal solutions. The goal of this study is to propose a hierarchical clustering method to assist users in retrieving relevant documents from large corpora. The paper is organized as follows. Section 2 presents a hierarchical clustering method. Section 3 describes an experiment that examines the clustering method with a professional corpora. using objective and subjective measures. The last section concludes the paper with discussions and future research directions. II. A HIERARCHICAL CLUSTERING METHOD
I.
INTRODUCTION
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfy an information need from within large collections (usually on local computer servers or on the internet). (Manning, Raghavan and Schtze 2008, p. 3)
The idea of automatically retrieving useful information from large amounts of electronic text collections was initiated by Vannevar Bush in his seminal essay As We May Think (Bush 1945). From 1950s to 1980s, various information retrieval models and techniques have been developed in the field of Information Retrieval (Singhal 2001). The advances in digital data collection and storage technologies over the past two decades enable companies and organizations to store up huge amounts of electronic documents. The availability of real life large collections of electronic text present opportunities and challenges for those previously developed IR models and techniques. On the one hand, the abilities of these models and techniques to scale to very large collections of documents can be evaluated; on the other hand, deficiencies of many old techniques were identified and the demands for new techniques that can
Current information in a IR system is stored in electronic document repositories and retrievable only by text search. With the increase in the number of electronic documents, it is hard to manually organize, analyze and present these documents efficiently. When a user search documents from an IR system, two important criteria are the relevance of returned results and the length of waiting time. Clustering can group documents into clusters that are coherent internally and different externally. The clusters are primarily envisioned to be used for search and navigation and
1789
potentially for some form of visualization as well. Major clustering approaches can be categorized into partitioning, hierarchy, density-based, grid-based, and model-based algorithms (Han and Kamber, 2006). Clustering in information retrieval involves various concepts and techniques. This section describes concepts and techniques that were used in this paper. In particular, selected concepts and techniques of text preprocessing, retrieval model, and a clustering method are discussed in sequence. A. Text Preprocessing: Tokenization, Stop-words, and Stemming The goals of text preprocessing are to represent full-text documents in a suitable format and to optimize the performance of text mining algorithms by discarding irrelevant data (Mathiak and Eckstein, 2004). Text documents are first divided into a set of index terms or keywords. This division process is called tokenization. These index terms are then used to represent full-text documents, which refer to the titles and abstracts in this study. Different index terms have varying importance and this difference is expressed using weights. Each keyword in a document is associated with a weight. In this paper, a selfdeveloped C++ program was used to tokenize article titles and abstracts into keywords. Tokenization normally results in thousands or even tens of thousands of keywords. Among these keywords are some common words that do not contribute to the retrieval task and thus should be removed from the keyword list. Stopwords and stemming are two prevalent keywords reduction methods. Regardless of topics and research areas, there are always common words that occur frequently in all documents, such as articles, prepositions, and conjunctions. This type of words is refer to as Stop-words and are irrelevant for the purpose of retrieval. A stem is the portion of a word which is left after the removal of its affixes, i.e., prefixes and suffixes. Porters stemming algorithm (Porter 1980) has been widely used and was selected by this project to further reduce the dimensionality. B. Retrieval Model: Vector Space Model After converting full-text documents into a set of keywords and reducing indexing structure, retrieval models can be set up. In information retrieval, retrieval models can be classified into three categories: Boolean model, vector space model, and probabilistic model. Boolean model considers index terms are either present or absent in a document, and hence can not recognize partial matches. Vector space model represents text using a vector of terms (Salton, Wong, and Yang, 1975). Probabilistic IR model estimates the probability of relevance of documents for a query based on the probabilistic principle (Singhal 2001). For different
experiments, probabilistic model and vector space model may have different performance. Nevertheless, it has been shown that the vector space model is expected to outperform the probabilistic model with many cases. According to vector space model, a document can be represented as a vector: <(dr1, w1), (dr2, w2), (dr3, w3),(drn, wn)>, where dri denotes a keyword i used to describe the document r, and wi denotes the weight of the keyword i, which can be determined by frequency of use. A collection of n documents can be represented by a term-document matrix. An entry in the matrix corresponds to the weight of a term in that document; zero means the term doesnt exist in the document or has no significance in the document (BaezaYates and Ribeiro-Neto, 1999). Researchers have developed various weighting schemes to calculate weights of terms. A popular term weight is tf-idf weighting: wij=tfij idfi, where tfij is term frequency across the entire corpora: tfij = fij / max{fij} and fij is the frequency of term i in document j; idfi is the inverse document frequency of term i: idfi = log2(N/dfi), N is the total number of documents and dfi is the document frequency of term i, i.e. the number of documents containing term i. A term in a document with higher tf-idf weight is regarded as more indicative than terms with lower tf-idf weights. The reasoning behind tf-idf weighting is that a term occurring frequently in a document but rarely in the rest of the collection is indicative. In this paper, the frequencies of each index term within each abstract were counted using SQL and tf-idf weights were computed and stored in a termdocument matrix. C. A Clustering Method When dealing with very large corpus, clustering methods like the k-means (McQueen 1967), which require extensive computing resources to do the analysis, may fail to find optimal solutions. As Hearst pointed out in his paper --untangling text data mining, a mixture of computationallydriven models and user-guided analysis may open the door to exciting new results (Hearst 1999, p.6). This study proposes a clustering method that combines a hierarchical clustering algorithm and users previous search behavior. Based on search history, frequently retrieved documents were selected as initial centers of clusters. The top m percent of documents that are similar to a center document are assigned to that cluster. Then the center of each cluster can be determined by calculating the mean value of the documents for each cluster. A document that has not been assigned to any cluster is associated to a cluster, which has the shortest distance between the center of that cluster and the document. This method is able to find optimal solutions for very large text collections because it reduces the computation time by including only the top m percent of documents in the calculation of the centers of clusters.
1790
III.
Following previous discussion, documents were gathered from an anonymous text database together with documents retrieval history. The titles and abstracts were then preprocessed by removing stop-words and stemming. For each index term remains, tf, idf, and tf-idf weights were calculated. Based on tf, idf, and tf-idf weights, a termdocument matrix was created. A self-developed C++ program was used to do clustering analysis. The following procedure summarized the whole process: Clustering Procedure: Input: a list of documents D to be clustered, a list of most clicked/viewed or users specified documents CD, a list of cluster committees CC, number of clusters k Step 1: For each document c in CD, cluster the top-10 similar documents of c using tf-idf clustering; Store the top k clusters with the highest internal similarity in a list CL; Step 2: For each cluster cl in CL, Compute the centroid of cl; If cls centroid is not close to any of the committees previously added to CL by certain measure threshold, add cl to CC. Step 3: For each documents d in D, if d is not close to any committee in CL by certain distance threshold, add d to a list of residues R Step 4: If R is empty, we are done and return CL. Otherwise, go back to Step 1 using the input CC and replacing D with R and CD with a list of most clicked/viewed documents CR in R. Output: a list of cluster committees CC. END With the list of cluster committees CC, a future cluster request of documents be-comes a categorization problem and the incoming documents will be assigned to its closest cluster committees. This resembles K-means but the centroid of each cluster committees do not change because a document is not added to the cluster committee when it is added to a cluster.
IV.
This article presented a clustering method for very large text collections that combines user retrieval history and hierarchical clustering algorithm. In the future, an experiment will be conducted to examine the proposed clustering method using both objective measures, such as internal and external cluster similarities, and subjective measures, such as the judgments of coherence and utility of retrieved documents by human reviewers. ACKNOWLEDGMENT This work was supported by the Youth Fund of University of Electronic Science and Technology of China (UESTC) and the National Natural Science Foundation of China (NSFC) under the Grand No. 70621001, No. 70531040 and 973 Project #2004CB720103, Ministry of Science and Technology, China. REFERENCES
[1] [2] Baeza-Yates, R. and Ribeiro-Neto, B. (1999) Modern Information Retrieval. Addison-Wesley, Wokingham, UK. Benjamin, C.M., Wang, F.K., Ester, M. (2003) Hierarchical Document Clustering Using Frequent Itemsets, SIAM International Conference on Data Mining 2003. Boman, H. G. (Ed.) (1994) Antimicrobial Peptides, Ciba Foundation Symposium 186, John Wiley & Sons Ltd. Bush, V. (1945) As We May Think. Atlantic Monthly, 176:101108, July 1945. Edict virtual language centre, available at: http://www.edict.com.hk/TextAnalyser/wordlists.htm (as of Jan 27, 2005). Han, J. W. and Kamber, M. (2006) Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann Publishers, March 2006. Hearst, M.A. (1999) Untangling text data mining. In Proceedings of ACL99: the 37th annual meeting of the association for computational linguistics, University of Maryland, June 2026. Iliopoulos, I., Enright, A. J., Ouzounis, C. A. (2001) TEXTQUEST: Document Clustering of Medline Abstracts for Concept Discovery in Molecular Biology, Proc. PSB. Kogan, J., Nicholas, C., and Teboulle, M. (2003) Clustering Large and High Dimensional Data, ACM Conference on Information and Knowledge Management tutorial, November 2-8, New Orleans, Louisiana, USA.
[6] [7]
[8]
[9]
1791
[10] Manning, C. D. (2008) Raghavan, P. and Schtze, H., Introduction to Information Retrieval, Cambridge University Press. July 2008. [11] Mathiak, B. and Eckstein, S. (2004) Five Steps to Text Mining in Biomedical Literature, Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, Italy, 47-50. [12] McQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281-298. [13] Nahm, U. Y. (2001) A Roadmap to Text Mining and Web Mining, available at: http://www.cs.utexas.edu/users/pebronia/text-mining/. [14] Porter, A. (2002) Text Mining, Review of TPAC Technologies for ONR, ASDL-Aug. 2002, available at: http://intelligent-web.org/wsm/. [15] Porter, M.F. (1980) An algorithm for suffix stripping, Program, 14(3): 130-137. [16] PubMed (2005), provided by the National Library of Medicine, available at: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi (as of Jan 27, 2005). [17] PubMed Basics (2004), developed by NN/LM staff, funded by NLM, available at: http://nnlm.gov/nnlm/online/pubmed/pmtri.pdf. [18] Rijsbergen, C. J. Van (1979) Information Retrieval, Buttherwords, London. [19] Salton, G., A.Wong, and C. S. Yang (1975) A vector space model for information retrieval. Communications of the ACM, 18(11):613620. [20] SAS products and solutions (2005), Analytical intelligence: data and text mining, available at: http://www.sas.com/technologies/analytics/datamining/index.html. [21] Singhal, A. (2001) "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 3543. [22] SPSS 12.0 for Windows, SPSS Inc., http://www.spss.com. [23] Weiss, S., White, B., Apte, C. (2000) Lightweight Document Clustering, IBM Research Report RC-21684, available at: http://citeseer.ist.psu.edu/cache/papers/cs/14745/http:zSzzSzwww.res earch.ibm.comzSzdarzSzpaperszSzpdfzSzweiss_ldc_with_cover.pdf/ weiss00lightweight.pdf. [24] Zhao, Y. and Karypis, G. (2002) Clustering in Life Sciences, technical report, Computer Science and Engineering, University of Minnesota, available at: https://wwws.cs.umn.edu/tech_reports/index.cgi?selectedyear=2002& mode=printreport&report_id=02-016.
1792