PSO11

2010 First International Conference on Integrated Intelligent Computing
Evaluating the Performance of Similarity Measures

Used in Document Clustering and Information
Retrieval
R.Subhashini1 and V.Jawahar Senthil Kumar2
1
Research Scholar, Sathyabama University, Chennai-119, India
2
Assistant Professor, Anna University, Chennai, India
E-mail: 1subhaagopi@gmail.com
Abstract-This paper presents the results of an experimental documents are regarded as similar if they share similar
study of some similarity measures used for both Information thematic topics. When clustering is employed on web sites, we
Retrieval and Document Clustering. Our results indicate that the are usually more interested in clustering the component pages
cosine similarity measure is superior than the other measures
such as Jaccard measure, Euclidean measure that we tested. according to the type of information that is presented in the
Cosine Similarity measure is particularly better for text page. For instance, when dealing with universities web sites,
documents. Previously these measures are compared with the we may want to separate professors home pages from
conventional text datasets but the proposed system collects the students’ home pages, and pages for courses from pages for
datasets with the help of API and it returns the collection of XML research projects. This kind of clustering can benefit further
pages. These XML pages are parsed and filtered to get the web
document datasets. In this paper, we compare and analyze the analysis and utilize of the dataset such as information retrieval
effectiveness of these measures for these web document datasets. and information extraction, by grouping similar types of
information sources together. Accurate clustering requires a
Keywords-Document clustering; Web mining; similarity
measure, Information Retrieval precise definition of the closeness between a pair of objects, in
terms of either the pair-wise similarity or distance. A variety of
similarity or distance measures have been proposed and widely
I. INTRODUCTION applied, such as cosine similarity and the Jaccard correlation
coefficient. Meanwhile, similarity is often conceived in terms
Similarity measure is an important parameter in text mining. of dissimilarity or distance as well [1]. Measures such as
Similarity measure enables us to find the similarity or Euclidean distance and relative entropy have been applied in
commonality existing between the text documents. clustering to calculate the pair-wise distances.
Traditionally Term-Frequency (TF) approach has been used Traditional methods used text datasets but in this system,
for representation of text document where the document is web documents are collected using yahoo API[17] and
considered as a vector with a set of terms occurring in the similarity measures are implemented on it.BOSS (Build your
document and their frequencies. There are two problems Own Search Service) is Yahoo! & apos;s open search web
synonymy and polysemy, associated with the approach. services platform. The goal of BOSS is simple: to faster
Synonyms are terms with the same meaning with one another innovation in the search industry. Developers, start-ups, and
and two terms that can be interchanged in a given context are large Internet companies can use BOSS to build and launch
said to be synonymous relative to the context. For example car, web-scale search products that utilize the entire Yahoo! Search
automobile, and vehicle may be considered synonyms in a index. BOSS gives you access to Yahoo! & apos;s investments
particular document. Using thesaurus[16] or similar facility in crawling and indexing, ranking and relevancy algorithms,
synonymous terms can be grouped together. Polysemy refers to and powerful infrastructure. BOSS is a platform for the next
the lexical ambiguity when the same word is used to express generation of search innovation, serving hundreds of millions
two or more meanings in different contexts. To correctly of users across the Web. This API returns the XML document
distinguish between the occurrences of the same term in that dump the search results.XML files listing URLs for a site
different contexts with different meanings, knowledge rich along with additional metadata about each URL so that search
approach using Natural Language Processing (NLP)[17] is a engines can more intelligently crawl the site.
necessity. In order to achieve this goal, The system is designed by the
Text document clustering groups similar documents that to following scenario:
form a coherent cluster, while documents that are different 1) The application takes the user query.
have separated apart into different clusters. However, the 2) Submits it to the search engine API and gets its response as
definition of a pair of documents being similar or different is XML file.
not always clear and normally varies with the actual problem
setting. For example, when clustering research papers, two
978-0-7695-4152-5/10 $26.00 © 2010 IEEE 27

DOI 10.1109/ICIIC.2010.42
3) XML file is parsed and downloaded to get the web where IDFi denotes the inverse document frequency of term ti;
documents . N denotes the number of documents in the document set; ni
4) Similarity measures are performed on these web documents. denotes the number of documents containing term ti. The
Today, the information in the internet is growing weight wik of term ti in document dk can be calculated as
enormously and it is very difficult for users to get the relevant follows:
documents .While surfing, it takes a lot of time to find the
relevant information. Actually, it is a challenge in the field of W i k = tf i k ⁄ maxj tf j k X IDFi (3)
information retrieval. Hence a lot of research papers are
coming in this area. where tfik denotes the frequency of term ti in document dk and
This paper is organized as follows. The next section maxtfjk denotes the maximal frequency of terms in document
describes the Feature selection of the documents. Section 3 dk.
discusses about the Document Clustering. Section 4 presents
the Information Retrieval and Query similarity measures and III. DOCUMENT CLUSTERING
Section 5 explains experiment settings, evaluation approaches,
Document clustering [2] is a technology that puts all the
results and analysis. And Section 7 concludes and discusses
related documents in to groups and is useful for categorizing,
future work.
organizing, refining search results. Most of the current
document clustering methods are based on Vector Space
II. FEATURE SELECTION Model (VSD) model [11].In which , documents are represented
All methods of information retrieval require several steps of as feature vector of words. To achieve a more accurate
preprocessing of the data. First, any non-textual information document clustering, a more informative feature term, phrase-
such as HTML-tags and punctuation is removed from the orderded sequence of one or more words is used in Suffix Tree
documents. Then, stop words such as “I”, “am”, “and” etc. are Document (STD) model and was proposed by Zamir et al.[12]
also removed. A term is any sequence of characters separated Before clustering, a similarity/distance measure must be
from other terms by some delimiter. Note that a term may determined. The measure reflects the degree of closeness or
either be a single word or consist of several words. Typically, separation of the target objects and should correspond to the
the terms are reduced to their basic stem applying a stemming characteristics that are believed to distinguish the clusters
algorithm[15]. Most of the information retrieval methods rely embedded in the data. In many cases, these characteristics are
on the so-called vector-space model. In this model, each text dependent on the data or the problem context at hand, and there
document d is represented by a vector of frequencies of the is no measure that is universally best for all kinds of clustering
remaining m terms: problems. Moreover, choosing an appropriate similarity
measure is also crucial for cluster analysis, especially for a
d = (tf1,tf2,…………tfm ) (1) particular type of clustering algorithms Therefore,
understanding the effectiveness of different measures is of
Often, the document vectors are normalized to unit length to great importance in helping to choose the best one.
allow comparison of documents of different lengths. Note that A. Euclidean Distance
the vector-space has a very high dimensionality since even Euclidean distance is a standard metric for geometrical
after preprocessing there are typically still several thousands of problems. It is the ordinary distance between two points and
terms, in many text databases you have to deal with can be easily measured with a ruler in two- or three-
approximately 10,000 terms. Due to the high dimensionality, dimensional space. Euclidean distance is widely used in
most frequencies are zero for any single document. The clustering problems, including clustering text. It satisfies all the
important technology of retrieval and index based on above four conditions and therefore is a true metric. It is also
keywords is TFIDF (Term Frequency-Inverse Frequency), the default distance measure used with the K-means algorithm
TFIDF is one popular technology in information retrieval and .Measuring distance between text documents, given two
exploration. TFIDF is a statistical method to evaluate the documents da and db represented by their term vectors p,q
importance of one word or term in one of the files of respectively, the Euclidean distance of the two documents is
documents sets or knowledge warehouse. The traditional defined as
TFIDF algorithm, by Gerald Sahon and McGill, is a method to
describe the features of document for vector space information d(p, q) =((p1-q1)2 + (p2-q2)2 + …..….+(pn-qn)2 )1/2 (4)
retrieval paradigm. The inverse document frequency IDFi of a
term ti can be calculated as follows : where the term set is p = {p1, . . . , pn} and q = {q1,….,qn}. As
mentioned previously, we use the tfidf value as term weights,
IDFi = log10 (N/ ni ) (2) that is wp= tfidf(da, p).
28
B. Cosine Similarity provides definitions of different query similarity measures used
When documents are represented as term vectors, the in our experiments. Our method of constructing query clusters
similarity of two documents corresponds to the correlation based on different query similarity measures is also presented.
between the vectors. This is quantified as the cosine of the A. Content-based Similarity Measure
angle between vectors, that is, the so-called cosine similarity. Similarity Measures also plays an important role in the field
Cosine similarity is one of the most popular similarity measure of Information Retrieval. It used to retrieve the relevant
applied to text documents, such as in numerous information document from the web. Actually it is a challenging task in IR.
retrieval applications [4] and clustering too [3]. Given two It is also used to rank the results. This measure uses a hybrid
documents X, Y and their cosine similarity is method based on the analysis of query terms and query results.
Here, two queries are similar when (1) they contain one or
COS(x, y) = X * Y / │X ││Y│ (5)
more terms in common; or (2) they have results that contain
one or more items in common. In this paper we are going to
Where X,Y are m-dimensional vectors over the term set
T = {t1, . . . , tm}. Each dimension represents a term with its calculate the similarity only for the common terms and not for
weight in the document, which is non-negative. As a result, the the common results.
cosine similarity is non-negative and bounded between We borrow concepts from information retrieval [4] and
[0,1].An important property of the cosine similarity is its define a set of queries as Q={Q1, Q2…Qi, Qj…Qn}. A query
independence of document length. For example, combining Qj is converted to a term and weight vector shown in (7),
two identical copies of a document d to get a new pseudo where qi is an index term of Qj and wiQj represents the weight
document d0, the cosine similarity between d and d0 is 1, of the ith term in query Qj. To compute the term weight, we
which means that these two documents are regarded to be define the term frequency, tfiQj, as the number of occurrences
identical. In other words, documents with the same of term i in query Qj and the query frequency, qfi, as the
composition but different totals will be treated identically. number of queries in a collection of n queries that contains the
Strictly speaking, this does not satisfy the second condition of term i. Next, the inverse query frequency, iqfi, is expressed as
a metric, because after all the combination of two copies is a (8), in which n represents the total number of queries in the
different object from the original document. However, in query collection. We then compute wiQj based on (9):
practice, when the term vectors are normalized to a unit length
such as 1, and in this case the representation of d and d0 is the Qj = { <q1,W1q>;<q2,W2q>;………qi,Wiq> } (7)
same.
C. Jaccard Coefficient iqf i = log(n/qfi ) (8)
The Jaccard coefficient, which is sometimes referred to as
the Tanimoto coefficient, measures similarity as the wi Qj = tfi Qj * iqf i (9)
intersection divided by the union of the objects. For text
document, the Jaccard coefficient compares the sum weight of Taking the term weights into consideration, we can use any
shared terms to the sum weight of terms that are present in one of the standard similarity measures [9]. Here, we only
either of the two document but are not the shared terms. The present the cosine similarity measure since it is most
formal definition is: frequently used in information retrieval.
For any query q and a document d,
J(A,B) =│A∩ B│ / │AU B│ (6)
Sim(d,q) = cos(d,q) = d * q (10)
Where A & B represents the documents and the Jaccard
coefficient is a similarity measure and ranges between 0 and 1. In other words, let the n-dimensional vector of query-document
It is 1 when A = B and 0 when A and B are disjoint, where 1 similarities be:
means the two objects are the same and 0 means they are
completely different. The corresponding distance measure is Sim(A,q) = A*q (11)
DJ = 1 − SIMJ and we will use DJ instead in subsequent
experiments. Where A is our n X p term document matrix, and q is a p-
dimensional query vector.
IV. INFORMATION RETRIEVAL - QUERY B. Result URLs-based Similarity Measure
SIMILARITY MEASURES The results returned by search engines usually contain a
As discussed, our approach to query clustering uses a hybrid variety of information such as the title, abstract, topic, etc. This
method based on the analysis of query terms and query results. information can be used to compare the similarity between
Here, two queries are similar when (1) they contain one or queries. In our work, taking the cost of processing query results
more terms in common; or (2) they have results that contain into consideration, we consider the query results’ unique
one or more items in common. The remainder of this section identifiers (e.g. URLs) in determining the similarity between
29
queries as in [8,16]. Let U(Qj) represent a set of query result B. Evaluation Measures
URLs to query Qj: The number of clusters is set as the same with the number of
pre-assigned categories in the data set. The quality of a
U (Q ) {u , u 2 ,...........u } j = i i (12) clustering result was evaluated using two evaluation
measures—purity and entropy[10], which are widely used to
where ui represents the ith result URL for query Qj. We then evaluate the performance of unsupervised learning algorithms
define Rij as (13), which represents the common query results [5,6]. To begin with, each cluster is labeled with the majority
URL vector between Qi and Qj. Here u refers to the URLs that category that appears in that cluster. Moreover, if a category
belong to both U(Qi) and U(Qj). label has been assigned to a cluster, it still can be assigned to
other clusters if it is the dominant category in that cluster.
Rij = { u: u € U(Qi) ∩ U(Qj) } (13) Based on the cluster labels, the purity and entropy measures
are computed as follows. The purity measure evaluates the
Next, the similarity definition based on query result URLs can coherence of a cluster, that is, the degree to which a cluster
be stated as: contains documents from a single category. Given a particular
A query Qi is similar to query Qj if |Rij|>0, where the |Rij| is cluster Ci of size ni, the purity of Ci is formally defined as
the number of common result URLs in both queries. The
similarity measure can be expressed as (14): P(Ci ) = 1/ ni maxh (nih) (15)
SIMResult (Qi ,Qj) = │Rij│ ÷ Max( │UQi│,│UQj│) (14) where maxh(nih) is the number of documents that are from
the dominant category in cluster Ci and nih represents the
number of documents from cluster Ci assigned to category h.
where the |U(Qi)| is the number of result URLs in U(Qi).Note Purity can be interpreted as the classification rate under the
that this is only one possible method for calculating similarity assumption that all samples of the cluster are predicted to be
using result URLs. Other measures such as using the overlaps members of the actual dominant class for the cluster. For an
of document titles or domain names in the result URLs may be ideal cluster, which only contains documents from a single
used. category, its purity value is 1. In general, the higher the purity
value, the better the quality of the cluster is. As shown in Table
V. EXPERIMENTAL RESULTS 1, Euclidean distance performs worst while the performance of
the other measures are quite similar. On average, Cosine
A. Data Preparation measure is better than Jaccard in terms of purity, which means
Web documents dataset are collected with the help of Yahoo the clusters have higher purity scores.
API. Yahoo! Search BOSS (Build your Own Search Service) is
an initiative in Yahoo! Search to open up Yahoo!'s search
infrastructure and enable third parties to build revolutionary Table.1.Result of Evaluation in terms of Purity
search products leveraging their own data, content, technology,
social graph, or other assets. This release includes Web, News, Data Euclidean Cosine Jaccard
and Image Search as well as Spelling Suggestions. Developers
have expressed interest in altering the result order, removing
results they do not want, and blending in their own data. All of
20 News 0.32 0.64 0.61
these activities are allowed and encouraged with BOSS but not
in the existing search API.
SYNTAX reo 0.48 0.79 0.72

http://boss.yahooapis.com/ysearch/web/v1/{query}?appid={yo
urBOSSappid}[&param1=val1&param2=val2&etc]
wap 0.38 0.63 0.65
EX
http://boss.yahooapis.com/ysearch/web/v1/animals?appid=123
45&format=xml&start=1&count=10
Where the parameter start represents the ordinal position of

first result where first position is zero. The parameter count
represents the total number of results to return; maximum
value is 100.
30
URL’s based similarity measures should be combined to
0.9 provide accurate results.
0.8
REFERENCES
0.7
[1] G. Salton. Automatic Text Processing. Addison-Wesley, New York,
0.6 1989.
Precision
0.5 [2] M. Steinbach, G. Karypis, and V. Kumar. A Comparison of document

clustering techniques. In KDD Workshop on Text Mining, 2000.
0.4 [3] R. B. Yates and B. R. Neto. Modern Information Retrieval. ADDISON-
0.3 WESLEY, New York, 1999.
[4] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering
0.2
algorithms for document datasets. In Proceedings of the International
0.1 Conference on Information and Knowledge Management, 2002.
[5] Y. Zhao and G. Karypis. Empirical and theoretical comparisons of
0
selected criterion functions for document clustering. Machine Learning,
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 55(3), 2004.
[6] Milne, O. Medelyan, and I. H. Witten. Mining domain-specific thesauri
Queries from wikipedia: A case study In Proc. of the International Conference on
Web Intelligence (IEEE/WIC/ACM WI’2006), 2006.
Cosine Measure Jaccard Measure [7] N.S. Glance, Community search assistant ,Proceedings of the 6th ACM
International Conference on Intelligent User Interfaces (Santa Fe,
Euclidean Measure January 2001), 91-96.
[8] R.Z. Osmar & S. Alexander, Finding similar queries to satisfy searches
Figure.1. Performance of Similarity Measures based on query traces,Workshops of the 8th International Conference on
Object-Oriented Information Systems (Montpellier, September 2002),
For the purpose of experiment in Information Retrieval 207- 216.
[9] N.S. Glance, Community search assistant,Proceedings of the 6th ACM
model, only content based similarity measure is used. we International Conference on Intelligent User Interfaces (Santa Fe,
selected 100 queries from the Google AdWords [18] and then January 2001),91-96.
these queries are used to retrieve the 2000 documents from the [10] J. van Rijsbergen. Information Retrieval. Second Edition, utterworths,
London, 1979.
web using the Yahoo API [ 17] . This API returns the XML, [11] A Vector Space Model For Automatic Indexing,G. Salton, A. Wong and
that dump the search results and it contains the URL, Title, C. S. Yang, Cornell University.
Snippet, Description. From which, web documents are [12] O. Zamir and O. Etzioni. Web document clustering: A feasibility
demonstration. In Proceedings of SIGIR'98, University of Washington,
collected by parsing the XML document. After that we Seattle, USA, 1998.
calculated the similarity measures for these 100 queries and [13] Mehmet Ali Salahli, AN APPROACH FOR MEASURING SEMANTIC
the documents. More similar documents are considered as RELATEDNESS BETWEEN WORDS VIA RELATED TERMS,
Mathematical and Computational Applications, Vol. 14, No. 1, pp. 55-63,
relevant and accordingly they are ranked.. The similarity value 2009.
for each query with the original documents was used to get [14] Mrs. K. P. Supreethi Dr. E.V.Prasad ,Web Document Clustering
relavent documents. In Fig.1. the three similarity measures are Technique Using Case Grammar Structure, International Conference on
Computational Intelligence and Multimedia Applications 2007
compared with the sample of Eight queries. In this the [15] M. F. Porter. An algorithm for suffix stripping.Program, 14(3):130–137,
precision value of cosine measure is comparatively better for 1980.
all the queries. [16] R.Z. Osmar & S. Alexander, Finding similar queries to satisfy
searches based on query traces,Workshops of the 8th International
Conference on Object-Oriented Information Systems (Montpellier,
September 2002), 207- 216.
VI. CONCLUSION [17] http://developer.yahoo.com/search/boss/
[18] https://adwords.google.co.in/o/Targeting/Explorer?__u=1000000000&__
This Paper proves that the similarity measures plays an c=1000000000&stylePrefOverride=2
important role in retrieving the relevant document and also
to rank and cluster the documents. Metric distances (such as
Euclidean distance) are not appropriate for high-
dimensional, sparse domains. Cosine, correlation and
extended Jaccard measures are successful and perform
equivalently in capturing the similarities implicitly
indicated by manual categorizations. To conclude, this
investigation found that except for the Euclidean distance
measure, the other measures have comparable effectiveness
for the partitional text document clustering task. We can see
that there are three components that affect the final results
representation of the objects, distance or similarity
measures, and the clustering algorithm itself. My future plan
is to compare these measures in query clustering and web
search result clustering. Further, content based and result-
31

PSO11

Uploaded by

Copyright:

Available Formats

PSO11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PSO11

Uploaded by

Copyright:

Available Formats

2010 First International Conference on Integrated Intelligent Computing

Evaluating the Performance of Similarity Measures

978-0-7695-4152-5/10 $26.00 © 2010 IEEE 27

SYNTAX reo 0.48 0.79 0.72

Where the parameter start represents the ordinal position of

0.5 [2] M. Steinbach, G. Karypis, and V. Kumar. A Comparison of document

You might also like