PSO11
PSO11
PSO11
Abstract-This paper presents the results of an experimental documents are regarded as similar if they share similar
study of some similarity measures used for both Information thematic topics. When clustering is employed on web sites, we
Retrieval and Document Clustering. Our results indicate that the are usually more interested in clustering the component pages
cosine similarity measure is superior than the other measures
such as Jaccard measure, Euclidean measure that we tested. according to the type of information that is presented in the
Cosine Similarity measure is particularly better for text page. For instance, when dealing with universities web sites,
documents. Previously these measures are compared with the we may want to separate professors home pages from
conventional text datasets but the proposed system collects the students’ home pages, and pages for courses from pages for
datasets with the help of API and it returns the collection of XML research projects. This kind of clustering can benefit further
pages. These XML pages are parsed and filtered to get the web
document datasets. In this paper, we compare and analyze the analysis and utilize of the dataset such as information retrieval
effectiveness of these measures for these web document datasets. and information extraction, by grouping similar types of
information sources together. Accurate clustering requires a
Keywords-Document clustering; Web mining; similarity
measure, Information Retrieval precise definition of the closeness between a pair of objects, in
terms of either the pair-wise similarity or distance. A variety of
similarity or distance measures have been proposed and widely
I. INTRODUCTION applied, such as cosine similarity and the Jaccard correlation
coefficient. Meanwhile, similarity is often conceived in terms
Similarity measure is an important parameter in text mining. of dissimilarity or distance as well [1]. Measures such as
Similarity measure enables us to find the similarity or Euclidean distance and relative entropy have been applied in
commonality existing between the text documents. clustering to calculate the pair-wise distances.
Traditionally Term-Frequency (TF) approach has been used Traditional methods used text datasets but in this system,
for representation of text document where the document is web documents are collected using yahoo API[17] and
considered as a vector with a set of terms occurring in the similarity measures are implemented on it.BOSS (Build your
document and their frequencies. There are two problems Own Search Service) is Yahoo! & apos;s open search web
synonymy and polysemy, associated with the approach. services platform. The goal of BOSS is simple: to faster
Synonyms are terms with the same meaning with one another innovation in the search industry. Developers, start-ups, and
and two terms that can be interchanged in a given context are large Internet companies can use BOSS to build and launch
said to be synonymous relative to the context. For example car, web-scale search products that utilize the entire Yahoo! Search
automobile, and vehicle may be considered synonyms in a index. BOSS gives you access to Yahoo! & apos;s investments
particular document. Using thesaurus[16] or similar facility in crawling and indexing, ranking and relevancy algorithms,
synonymous terms can be grouped together. Polysemy refers to and powerful infrastructure. BOSS is a platform for the next
the lexical ambiguity when the same word is used to express generation of search innovation, serving hundreds of millions
two or more meanings in different contexts. To correctly of users across the Web. This API returns the XML document
distinguish between the occurrences of the same term in that dump the search results.XML files listing URLs for a site
different contexts with different meanings, knowledge rich along with additional metadata about each URL so that search
approach using Natural Language Processing (NLP)[17] is a engines can more intelligently crawl the site.
necessity. In order to achieve this goal, The system is designed by the
Text document clustering groups similar documents that to following scenario:
form a coherent cluster, while documents that are different 1) The application takes the user query.
have separated apart into different clusters. However, the 2) Submits it to the search engine API and gets its response as
definition of a pair of documents being similar or different is XML file.
not always clear and normally varies with the actual problem
setting. For example, when clustering research papers, two
28
B. Cosine Similarity provides definitions of different query similarity measures used
When documents are represented as term vectors, the in our experiments. Our method of constructing query clusters
similarity of two documents corresponds to the correlation based on different query similarity measures is also presented.
between the vectors. This is quantified as the cosine of the A. Content-based Similarity Measure
angle between vectors, that is, the so-called cosine similarity. Similarity Measures also plays an important role in the field
Cosine similarity is one of the most popular similarity measure of Information Retrieval. It used to retrieve the relevant
applied to text documents, such as in numerous information document from the web. Actually it is a challenging task in IR.
retrieval applications [4] and clustering too [3]. Given two It is also used to rank the results. This measure uses a hybrid
documents X, Y and their cosine similarity is method based on the analysis of query terms and query results.
Here, two queries are similar when (1) they contain one or
COS(x, y) = X * Y / │X ││Y│ (5)
more terms in common; or (2) they have results that contain
one or more items in common. In this paper we are going to
Where X,Y are m-dimensional vectors over the term set
T = {t1, . . . , tm}. Each dimension represents a term with its calculate the similarity only for the common terms and not for
weight in the document, which is non-negative. As a result, the the common results.
cosine similarity is non-negative and bounded between We borrow concepts from information retrieval [4] and
[0,1].An important property of the cosine similarity is its define a set of queries as Q={Q1, Q2…Qi, Qj…Qn}. A query
independence of document length. For example, combining Qj is converted to a term and weight vector shown in (7),
two identical copies of a document d to get a new pseudo where qi is an index term of Qj and wiQj represents the weight
document d0, the cosine similarity between d and d0 is 1, of the ith term in query Qj. To compute the term weight, we
which means that these two documents are regarded to be define the term frequency, tfiQj, as the number of occurrences
identical. In other words, documents with the same of term i in query Qj and the query frequency, qfi, as the
composition but different totals will be treated identically. number of queries in a collection of n queries that contains the
Strictly speaking, this does not satisfy the second condition of term i. Next, the inverse query frequency, iqfi, is expressed as
a metric, because after all the combination of two copies is a (8), in which n represents the total number of queries in the
different object from the original document. However, in query collection. We then compute wiQj based on (9):
practice, when the term vectors are normalized to a unit length
such as 1, and in this case the representation of d and d0 is the Qj = { <q1,W1q>;<q2,W2q>;………qi,Wiq> } (7)
same.
C. Jaccard Coefficient iqf i = log(n/qfi ) (8)
The Jaccard coefficient, which is sometimes referred to as
the Tanimoto coefficient, measures similarity as the wi Qj = tfi Qj * iqf i (9)
intersection divided by the union of the objects. For text
document, the Jaccard coefficient compares the sum weight of Taking the term weights into consideration, we can use any
shared terms to the sum weight of terms that are present in one of the standard similarity measures [9]. Here, we only
either of the two document but are not the shared terms. The present the cosine similarity measure since it is most
formal definition is: frequently used in information retrieval.
For any query q and a document d,
J(A,B) =│A∩ B│ / │AU B│ (6)
Sim(d,q) = cos(d,q) = d * q (10)
Where A & B represents the documents and the Jaccard
coefficient is a similarity measure and ranges between 0 and 1. In other words, let the n-dimensional vector of query-document
It is 1 when A = B and 0 when A and B are disjoint, where 1 similarities be:
means the two objects are the same and 0 means they are
completely different. The corresponding distance measure is Sim(A,q) = A*q (11)
DJ = 1 − SIMJ and we will use DJ instead in subsequent
experiments. Where A is our n X p term document matrix, and q is a p-
dimensional query vector.
IV. INFORMATION RETRIEVAL - QUERY B. Result URLs-based Similarity Measure
SIMILARITY MEASURES The results returned by search engines usually contain a
As discussed, our approach to query clustering uses a hybrid variety of information such as the title, abstract, topic, etc. This
method based on the analysis of query terms and query results. information can be used to compare the similarity between
Here, two queries are similar when (1) they contain one or queries. In our work, taking the cost of processing query results
more terms in common; or (2) they have results that contain into consideration, we consider the query results’ unique
one or more items in common. The remainder of this section identifiers (e.g. URLs) in determining the similarity between
29
queries as in [8,16]. Let U(Qj) represent a set of query result B. Evaluation Measures
URLs to query Qj: The number of clusters is set as the same with the number of
pre-assigned categories in the data set. The quality of a
U (Q ) {u , u 2 ,...........u } j = i i (12) clustering result was evaluated using two evaluation
measures—purity and entropy[10], which are widely used to
where ui represents the ith result URL for query Qj. We then evaluate the performance of unsupervised learning algorithms
define Rij as (13), which represents the common query results [5,6]. To begin with, each cluster is labeled with the majority
URL vector between Qi and Qj. Here u refers to the URLs that category that appears in that cluster. Moreover, if a category
belong to both U(Qi) and U(Qj). label has been assigned to a cluster, it still can be assigned to
other clusters if it is the dominant category in that cluster.
Rij = { u: u € U(Qi) ∩ U(Qj) } (13) Based on the cluster labels, the purity and entropy measures
are computed as follows. The purity measure evaluates the
Next, the similarity definition based on query result URLs can coherence of a cluster, that is, the degree to which a cluster
be stated as: contains documents from a single category. Given a particular
A query Qi is similar to query Qj if |Rij|>0, where the |Rij| is cluster Ci of size ni, the purity of Ci is formally defined as
the number of common result URLs in both queries. The
similarity measure can be expressed as (14): P(Ci ) = 1/ ni maxh (nih) (15)
SIMResult (Qi ,Qj) = │Rij│ ÷ Max( │UQi│,│UQj│) (14) where maxh(nih) is the number of documents that are from
the dominant category in cluster Ci and nih represents the
number of documents from cluster Ci assigned to category h.
where the |U(Qi)| is the number of result URLs in U(Qi).Note Purity can be interpreted as the classification rate under the
that this is only one possible method for calculating similarity assumption that all samples of the cluster are predicted to be
using result URLs. Other measures such as using the overlaps members of the actual dominant class for the cluster. For an
of document titles or domain names in the result URLs may be ideal cluster, which only contains documents from a single
used. category, its purity value is 1. In general, the higher the purity
value, the better the quality of the cluster is. As shown in Table
V. EXPERIMENTAL RESULTS 1, Euclidean distance performs worst while the performance of
the other measures are quite similar. On average, Cosine
A. Data Preparation measure is better than Jaccard in terms of purity, which means
Web documents dataset are collected with the help of Yahoo the clusters have higher purity scores.
API. Yahoo! Search BOSS (Build your Own Search Service) is
an initiative in Yahoo! Search to open up Yahoo!'s search
infrastructure and enable third parties to build revolutionary Table.1.Result of Evaluation in terms of Purity
search products leveraging their own data, content, technology,
social graph, or other assets. This release includes Web, News, Data Euclidean Cosine Jaccard
and Image Search as well as Spelling Suggestions. Developers
have expressed interest in altering the result order, removing
results they do not want, and blending in their own data. All of
20 News 0.32 0.64 0.61
these activities are allowed and encouraged with BOSS but not
in the existing search API.
30
URL’s based similarity measures should be combined to
0.9 provide accurate results.
0.8
REFERENCES
0.7
[1] G. Salton. Automatic Text Processing. Addison-Wesley, New York,
0.6 1989.
Precision
31