swj248 PDF
swj248 PDF
swj248 PDF
Abstract. This work falls in the areas of information retrieval and semantic web, and aims to improve the evaluation of web
search tools. Indeed, the huge number of information on the web as well as the growth of new inexperienced users creates new
challenges for information retrieval; certainly the current search engines (such as Google, Bing and Yahoo) offer an efficient
way to browse the web content. However, this type of tool does not take into account the semantic driven by the query terms
and document words. This paper proposes a new semantic based approach for the evaluation of information retrieval systems;
the goal is to increase the selectivity of search tools and to improve how these tools are evaluated. The test of the proposed
approach for the evaluation of search engines has proved its applicability to real search tools. The results showed that semantic
evaluation is a promising way to improve the performance and behavior of search engines as well as the relevance of the re-
sults that they return.
Keywords: Information Retrieval, Semantic Web, Ontology, Results Ranking, Web Search Engines.
*
Corresponding author. E-mail: [email protected]
tool. We finally present the experimental our ap- 3. Theoretical foundations of the proposed
proach and the discussion of the obtained results. approach
2.2. Ontologies, a clear need in IR Retrieve the results returned by search engines
Check the information content of each re-
It is natural that works relating to ontology inte- turned page.
gration in IRS are growing. A first solution is to Project the user query on the linguistic re-
build ontology from the corpus on which IR tasks source, the WordNet ontology in our case.
will be performed [8] [6]. A second solution is the Measure the results relevance by calculat-
reuse of existing resources. In this case, ontologies ing the relevance degree of each of them.
are generally chosen from the knowledge domain Generate a semantic rank of results according
that they address [1], [10]. Ontologies as a sup- to the calculated relevance based on their de-
port for the modeling of IRS have been studied in a gree of informativeness.
previous article [2]. In general, the contribution Assign a score to each search engine based on
of ontologies in an IRS can be understood at three its position in the new ranking.
levels:
This system is based partly on a linguistic re-
In the document indexing process: by combin- source (WordNet ontology) for the query semantic
ing it with the techniques of natural language projection and on the other hand, a calculation
processing, the documents in the database will model for measuring the relev-
be summarized and linked to the ontology ance 'document/ query' (the vectorial model). In the
concepts. If this step has been properly done, following we are justifying our choices in terms
the search would be easier in the future. This of the chosen linguistic resource and the used IR
principle was already used in our work [3]. model.
At the queries reformulation level in order to
improve the initial user queries. This aspect 3.1. Choice of information retrieval model
was also used as a complement to our proposal
[3]. The role of an IR model is to provide a formali-
In the information filtering process, this as- zation of the information finding process. The defi-
pect will be the subject of the contribution nition of an information retrieval model led to the
that we present in this paper. The idea is to determination of a theoretical framework. On this
use ontology to add the semantic dimension to theoretical framework the representation of infor-
the evaluation process. This can be done by mation units and the formalization of the system
extracting the query terms and their semantic relevance function are based.
projection using the WordNet ontology on
the set of returned documents. The result of 3.1.1. Summary of IR models
this projection is used to extract concepts re- We have given as part of our previous work [4]
lated to each term, thus building a semantic an overview of the most common information re-
vector which will be the base of the results trieval models. We remind the basics of each of
classification. This vector is used primarily them in order to center our choice on the model that
for creating the query vector and document fits best with our proposal. Figure 1 shows the three
vector used by the vectorial model that we IR model that we studied.
adopted.
similarity between vectors. The term weighting
Boolean Vectorial scheme and the similarity measures used in con-
model model junction with this model are:
Term Weighting: It measures the importance of a
IR model term in a document. In this context, several weight-
ing techniques have been developed, most of them
are based on "TF" and "Idf" factors [9], that com-
bine local and global term weights:
Finally, the probabilistic model uses a mathemat- Dist(Q k , Dj ) = ∑Ti=1 qki − dji (2)
ical model based on the theory of probability. In
general, the probabilistic model has the advantage
The cosine measure to measure the similarity
of unifying the representations of documents and
of documents and query. This measure is also
concepts. However, the model is based on assump-
called the document correlation Dj relative to
tions of independence of variables not always veri-
the query terms Qk.
fied, tainting the measures of similarity of inaccura-
∑T
i =1 q ki d ji
cy. RSV Q k , Dj = (3)
2
∑T 2 T
i=1 q ki ∑i =1 d ji
3.1.2. Principles and motivations of the chosen
model
In the semantic evaluation approach that we pro- 3.2. Choice of linguistic resource
pose, we opted for the vectorial model, this choice
is mainly motivated by three reasons: first, the con- We thought, initially, to use domain ontology in
sistency of its representation "Query/Document", the medical or geographic field and exploit collec-
then the order induced by the similarity function tions of documents related to these fields. But we
that it uses, and finally the easy possibilities that it realized that this kind of ontology is generally de-
offers to adjust the weighting functions to improve veloped by companies for their own needs. At least,
search results. they are not available on the Internet. Moreover,
More precisely in our case, the vectorial model is few of them have a terminology component (terms
based on a semantic vector composed of concepts associated with concepts). So, our choice was
rather than words. This semantic vector is the result oriented to the WordNet ontology.
of the semantic projection of the query on the WordNet is an electronic lexical network devel-
WordNet ontology. This model therefore allowed oped since 1985 at the Princeton University by a
us to build "query vectors" and «document vectors" linguists and psycholinguists team of the Cognitive
on the basis of coefficients calculated using a Science Laboratory. The advantage of WordNet is
weighting function. It was also the basis for mea- the diversity of the information that it contains
suring the similarity between the query vector and (large coverage of the English language, definition
those of documents using a calculation function of of each meaning, sets of synonyms and various
semantic relations). In addition, WordNet is freely
usable. User
WordNet covers the majority of nouns, verbs, ad- Query
jectives and adverbs of the English language. They
structure it into a nodes and links network. The
nodes consist of sets of synonyms (called synsets). WordNet
Search Engines
A term can be a single word or a collocation. Table
1 provides statistics on the number of words and Decomposition Synsets
concepts in WordNet in its version 3.0. Bing Google Yahoo in terms Extraction
Total Pairs
Category Words Concepts Processing HTML code Construction of the semantic vectors
Word Sense
noun 117 798 82 115 146 312
Identification of
verb 11 529 13 767 25 047 Semantic vector
relevant tags SPM
adjective 21 479 18 156 30 002
adverb 4 481 3 621 5 580
Extraction of textual content
Total 155 287 117 659 206 941
Construction of Construction of
Results document vectors the query vector
WordNet concepts are linked by semantic rela- IEM
Table 2
The effectiveness comparison of two search engines
Google Yahoo
Classical Semantic Classical Semantic
Overall
7,62 8,29 6,93 7,02
average
Figure 3: The developed tool Simple
8,15 8,82 7,76 7,89
scenarios
Search Area : allows the user to express his in-
i Complexes
6,19 6.94 5,23 5,52
scenarios
formation need in the form of a query, and
then send this query to search engines.
Ontology Area: to display the synonyms and
hypernyms of the query in a tree structure.
This area also allows the selection of the way
in which the ranking will be made.
Results Area: the part in which the system dis-
di
plays the results provided by search engines.
Ranking Zone: This part consists of a set of
buttons; the click on a button means the choice
of ranking results accordingg to the kind se-s
lected by the user.
Extraction Area: shows the current status of
the parser in the extraction phase.
Fig. 4. The effectiveness comparison of two search engines
This first result confirms the quality Regarding to the dead links level, the
of Google which is generally the most efficient test reveals the effort of the two engines to main-
one and returns the best services to the user: the tain their index and avoid pointing to deleted or
search engine of Sergey Brin moved pages. On this crite-
and Larry Page had scored higher on almost all rion veryclearly (9.60 and 9.11 for semantic rank-
the queries made. But the difference of the overall ing and 9.67 and 9.32 for classic ranking) Google
average to Yahoo is not significant: only 0.69 of precedes Yahoo for 0.49 and 0.34 point. This crite-
10 points in the case of classical ranking and rion shows a slight advance of the semantic ranking
1.27 for the semantic ranking separate the compared one to classical.
two search engines. And this difference is reduced In terms of redundant results, again Google and
to 0.43 and 0.93 point in the case of simple que- Yahoo are doing well with a score
ries whereas it increases in the case of complex of respectively 8.27 and 7.55 for the classical rank-
search scenarios (0.96 and 1.42 point). ing and 7.72 and 7.02 for the semantic one. Ergo-
We also find that the three criteria and in the case nomically moreover , Google gets
of the two search engines, semantic ranking always a higher score with a more relevant outcome: When
brings a gain in efficiency compared to the classic- it displays on a page two links that point to the
al one. same site (but different pages), it takes care to paste
the two results and displays the second with a slight
6.2.2. Performance by criteria shift to the right. Visually, the user can see that the
two results are related. Whereas Yahoo makes no
Table 3 effort to cluster the results of the
Comparison of the two search engines effectiveness by criteria
same site. Contrary to what was expected for this
criterion, the classical rinsing gives bet-
Google Yahoo ter scores compared to the semantic one, it is be-
Classical Semantic Classical Semantic cause the number of synonyms retrieved from the
The results relev- ontology increases the frequency of query
5,72 6,12 5,06 7,66
ance terms in the returned documents, which pro-
Rate of
the not dead links
9,60 9,67 9,11 9,32 motes links arriving from the same site.
Rate of the non-
8,27 7,92 7,55 7,02
Regarding to the parasite pages (pages list-
redundant results ing only promotional links), Google is
Rate of
the not parasites 9,33 9,37 8,59 8,86 more effective than Yahoo to deal this kind
pages of useless pages in advancing the user
search otherwise these distort engine results (as
merely advertising and often poorly tar-
geted) . Scores are 9.33 and 8.59 for the classical
ranking and 9.37 and 8.86 for the semantic
one, so we see a better result in the case of semantic
ranking.
7. Conclusion
References
[1] M. Baziz, M. Boughanem, N. Aussenac-Gilles, C. Chris-
ment, Semantic Cores for Representing Documents in IR,
In Proceedings of the 20th ACM Symposium on Applied
Computing, pp. 1020-1026, ACM Press ISBN: 1-58113-
964-0, (2005).
[2] A. Bouramoul, The Semantic Dimension in Information
Retrieval, from Document Indexing to Query Reformula-
tion. In Knowledge Organization journal (KO) – ISSN: 0943
– 7444, Vol.38, No.5: 425-438. Ergon-Verlag – würzburg
Germany, (2011).
[3] A. Bouramoul, M-K. Kholladi, B-L. Doan, How Ontology
Can be Used to Improve Semantic Information Retrieval:
The AnimSe Finder Tool. In International Journal of Com-
puter Applications (IJCA) – ISSN : 0975 -
8887, Vol.21, No.9 : 48-54. FCS – US, (2011).
[4] A. Bouramoul, Recherche d’Information Contextuelle
et Sémantique sur le Web. PhD Thesis in computer
science, Constantine university – Algeria & Supelec –
France,(2011).
[5] T.R. Gruber, A translation approach to portable ontology
specifications, Knowledge Acquisition, 5 (2), pp 199-220,
(1993).
[6] S. Koo, S.Y. Lim, S.J. Lee, Building an Ontology based on
Hub Words for Informational Retrieval, In Proceedings of
the IEEE/WIC International Conference on Web Intelli-
gence, (2003).
[7] C. Pruski, Une approche adaptative pour la recherche d'in-
formation sur le Web PhD Thesis in computer science. Uni-
versité du Luxembourg et Université Paris-Sud, (2009).
[8] J. Saias, P. Quaresma, A Methodology to Create Ontology-
Based Information Retrieval Systems, In Proceedings of the
EPIA Conference, pp 424-434, (2003).
[9] P. Soucy, G.W. Mineau, Beyond TFIDF Weighting for Text
Categorization in the Vector Space Model. In Proceedings
of the 19th International Joint Conference on Artificial Intel-
ligence (IJCAI 2005), Edinburgh, Scotland, (2005).
[10]D. Vallet, M. Fernández, P. Castells, An Ontology-Based
Information Retrieval Model, In Proceedings of the 2nd Eu-
ropean Semantic Web Conference, pp 455-470, (2005).