swj248 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

An ontology-based approach for semantic

ranking of the web search engines results


Editor(s): Name Surname, University, Country
Solicited review(s): Name Surname, University, Country
Open review(s): Name Surname, University, Country

Abdelkrim Bouramoul a,*, Mohamed-Khireddine Kholladi a and Bich-Liên Doab


a
Computer Science Department, Misc Laboratory, University of Mentouri Constantine. B.P. 325, Constantine
25017, Algeria
b
Computer Science Department, SUPELEC. Rue Joliot-Curie, 91192 Gif Sur Yvette, France.

Abstract. This work falls in the areas of information retrieval and semantic web, and aims to improve the evaluation of web
search tools. Indeed, the huge number of information on the web as well as the growth of new inexperienced users creates new
challenges for information retrieval; certainly the current search engines (such as Google, Bing and Yahoo) offer an efficient
way to browse the web content. However, this type of tool does not take into account the semantic driven by the query terms
and document words. This paper proposes a new semantic based approach for the evaluation of information retrieval systems;
the goal is to increase the selectivity of search tools and to improve how these tools are evaluated. The test of the proposed
approach for the evaluation of search engines has proved its applicability to real search tools. The results showed that semantic
evaluation is a promising way to improve the performance and behavior of search engines as well as the relevance of the re-
sults that they return.

Keywords: Information Retrieval, Semantic Web, Ontology, Results Ranking, Web Search Engines.

1. Introduction ble elements representing the domain where the prob-


lematic of this work is located.
Information Retrieval (IR) is a domain that is in- In this context, several questions arise regarding
terested in the structure, analysis, organization, sto- the improvement of the information retrieval process,
rage, search and discovery of information. The chal- and the manner in which returned results are eva-
lenge is to find in the large amount of available doc- luated. So, is to find solutions for the two following
uments; those that best fit the user needs. The opera- questions: How can we improve information retrieval
tionalization of IR is performed by software tools by taking semantics into account? And how can we
called Information Retrieval Systems (IRS), these ensure a semantic evaluation of the responses re-
systems are designed to match the user needs repre- turned by information retrieval tools?
sentation with the document content representation This paper is organized as follows: We
by means of a matching function. The evaluation of present initially similar work and we give the prin-
IRS is to measure its performance regarding to the ciple of the proposed approach, we define its parame-
user needs, for this purpose evaluation methods ters in terms of the chosen information search model
widely adopted in IR are based on models that pro- and the used linguistic resource. We present then the
vide a basis for comparative evaluation of different developed modules to build the general architec-
system effectiveness by means of common resources. ture of our proposal and we describe the developed
IR, the IRS and evaluation of IRS are three insepara-

*
Corresponding author. E-mail: [email protected]
tool. We finally present the experimental our ap- 3. Theoretical foundations of the proposed
proach and the discussion of the obtained results. approach

2. Related Work We present in this section the theoretical basis on


which our proposal is based.
2.1. Ontology definition These features guide the semantic evalua-
tion approach that we propose. In this paper we are
Several definitions of the ontology have emerged interested specifically in the semantic evaluation of
in the last twenty years, but the most referenced and the results returned by search engines. For this pur-
synthetic one is probably that given by Gruber: "on- pose, our choice is fixed on three search engines
tology is an explicit specification of a conceptuali- (Google, Yahoo and Bing). This choice is moti-
zation" [5]. Based on this definition, ontologies are vated by their popularity in the Web community on
used in the IR field to represent shared and more or the one hand and the degree of selectivity that they
less formal domain descriptions in order to add a offer on the other. More precisely, our sys-
semantic layer to the IRS. tem allows to:

2.2. Ontologies, a clear need in IR  Retrieve the results returned by search engines
 Check the information content of each re-
It is natural that works relating to ontology inte- turned page.
gration in IRS are growing. A first solution is to  Project the user query on the linguistic re-
build ontology from the corpus on which IR tasks source, the WordNet ontology in our case.
will be performed [8] [6]. A second solution is the  Measure the results relevance by calculat-
reuse of existing resources. In this case, ontologies ing the relevance degree of each of them.
are generally chosen from the knowledge domain  Generate a semantic rank of results according
that they address [1], [10]. Ontologies as a sup- to the calculated relevance based on their de-
port for the modeling of IRS have been studied in a gree of informativeness.
previous article [2]. In general, the contribution  Assign a score to each search engine based on
of ontologies in an IRS can be understood at three its position in the new ranking.
levels:
This system is based partly on a linguistic re-
 In the document indexing process: by combin- source (WordNet ontology) for the query semantic
ing it with the techniques of natural language projection and on the other hand, a calculation
processing, the documents in the database will model for measuring the relev-
be summarized and linked to the ontology ance 'document/ query' (the vectorial model). In the
concepts. If this step has been properly done, following we are justifying our choices in terms
the search would be easier in the future. This of the chosen linguistic resource and the used IR
principle was already used in our work [3]. model.
 At the queries reformulation level in order to
improve the initial user queries. This aspect 3.1. Choice of information retrieval model
was also used as a complement to our proposal
[3]. The role of an IR model is to provide a formali-
 In the information filtering process, this as- zation of the information finding process. The defi-
pect will be the subject of the contribution nition of an information retrieval model led to the
that we present in this paper. The idea is to determination of a theoretical framework. On this
use ontology to add the semantic dimension to theoretical framework the representation of infor-
the evaluation process. This can be done by mation units and the formalization of the system
extracting the query terms and their semantic relevance function are based.
projection using the WordNet ontology on
the set of returned documents. The result of 3.1.1. Summary of IR models
this projection is used to extract concepts re- We have given as part of our previous work [4]
lated to each term, thus building a semantic an overview of the most common information re-
vector which will be the base of the results trieval models. We remind the basics of each of
classification. This vector is used primarily them in order to center our choice on the model that
for creating the query vector and document fits best with our proposal. Figure 1 shows the three
vector used by the vectorial model that we IR model that we studied.
adopted.
similarity between vectors. The term weighting
Boolean Vectorial scheme and the similarity measures used in con-
model model junction with this model are:
Term Weighting: It measures the importance of a
IR model term in a document. In this context, several weight-
ing techniques have been developed, most of them
are based on "TF" and "Idf" factors [9], that com-
bine local and global term weights:

Probabilistic  TF (Term Frequency): This measure is propor-


model
tional to the frequency of the word in the doc-
ument (local weighting).
Fig. 1. Information Retrieval Models
 Idf (Inverse Document Frequency): This factor
The Boolean model is based on the keywords measures the importance of a term in the entire
manipulation. On the one hand a document (D) is collection (total weight).
represented by a combination of keywords, on the
other hand a query (R) is represented by a logical The "TF*Idf" measure gives a good approxima-
expression composed of words connected by Boo- tion of the word importance in the document, espe-
lean operators (AND, OR, NOT). The Boolean cially in corpora with a similar amount of docu-
model uses the exact pairing mode; it returns only ments. However, it ignores an important aspect of
documents corresponding exactly to the query. This the document: its length. For this reason we used
model is widely used for both bibliographic data- the following standard formula [7]:
bases and for web search engines. ∑ occ (w)
The vector model recommends the representation TFDi = (1)
card Di
of user queries and documents as vectors in the
space generated by all the terms. Formally, the doc- Similarity measure: Two measures of similarity of
uments and queries are vectors in a vectorial space each document from the same query are calculated
of dimension N and represented as follows: by our system:

 The distance measure in a vectorial space:

Finally, the probabilistic model uses a mathemat- Dist(Q k , Dj ) = ∑Ti=1 qki − dji (2)
ical model based on the theory of probability. In
general, the probabilistic model has the advantage
 The cosine measure to measure the similarity
of unifying the representations of documents and
of documents and query. This measure is also
concepts. However, the model is based on assump-
called the document correlation Dj relative to
tions of independence of variables not always veri-
the query terms Qk.
fied, tainting the measures of similarity of inaccura-
∑T
i =1 q ki d ji
cy. RSV Q k , Dj = (3)
2
∑T 2 T
i=1 q ki ∑i =1 d ji
3.1.2. Principles and motivations of the chosen
model
In the semantic evaluation approach that we pro- 3.2. Choice of linguistic resource
pose, we opted for the vectorial model, this choice
is mainly motivated by three reasons: first, the con- We thought, initially, to use domain ontology in
sistency of its representation "Query/Document", the medical or geographic field and exploit collec-
then the order induced by the similarity function tions of documents related to these fields. But we
that it uses, and finally the easy possibilities that it realized that this kind of ontology is generally de-
offers to adjust the weighting functions to improve veloped by companies for their own needs. At least,
search results. they are not available on the Internet. Moreover,
More precisely in our case, the vectorial model is few of them have a terminology component (terms
based on a semantic vector composed of concepts associated with concepts). So, our choice was
rather than words. This semantic vector is the result oriented to the WordNet ontology.
of the semantic projection of the query on the WordNet is an electronic lexical network devel-
WordNet ontology. This model therefore allowed oped since 1985 at the Princeton University by a
us to build "query vectors" and «document vectors" linguists and psycholinguists team of the Cognitive
on the basis of coefficients calculated using a Science Laboratory. The advantage of WordNet is
weighting function. It was also the basis for mea- the diversity of the information that it contains
suring the similarity between the query vector and (large coverage of the English language, definition
those of documents using a calculation function of of each meaning, sets of synonyms and various
semantic relations). In addition, WordNet is freely
usable. User
WordNet covers the majority of nouns, verbs, ad- Query
jectives and adverbs of the English language. They
structure it into a nodes and links network. The
nodes consist of sets of synonyms (called synsets). WordNet
Search Engines
A term can be a single word or a collocation. Table
1 provides statistics on the number of words and Decomposition Synsets
concepts in WordNet in its version 3.0. Bing Google Yahoo in terms Extraction

Table 1 Semantic alignment


Web HTML page
Characteristics of the Wordnet 3.0 ontology SM Terms/Synonyms,Hypernyms,

Total Pairs
Category Words Concepts Processing HTML code Construction of the semantic vectors
Word Sense
noun 117 798 82 115 146 312
Identification of
verb 11 529 13 767 25 047 Semantic vector
relevant tags SPM
adjective 21 479 18 156 30 002
adverb 4 481 3 621 5 580
Extraction of textual content
Total 155 287 117 659 206 941
Construction of Construction of
Results document vectors the query vector
WordNet concepts are linked by semantic rela- IEM

tions. The basic relationship between the terms of


Classification results Documents Query
the same synset is the synonymy. Moreover the according to the distance vector vector
different synsets are linked by various semantic
relations such as subsumption or hyponymy- New ranking Calculation of similarity measures
hyperonymy relation, and the meronymy-
holonymie composition relationship. Scoring of the search
Score
engine CM

Results Presentation with


4. Presentation of the proposed approach Search engine
semantic content details
score
RM PM
In order to ensure a coherent modeling of our
proposal, we have created a number of modules Fig. 2. General architecture of the proposed approach
where each of them ensures a separate functionality.
The combination of these modules has allowed us However, we note in this context that this num-
then to build the general architecture of the system. ber can be expanded to cover all the returned results.
These modules are interrelated in the sense that the Logically, the consequence is that the processing
outputs of each module are the inputs of the next. time will be longer in this case.
Figure 2 shows how the different modules are con-
nected to define the general architecture describing
our approach. 4.2. Information Extraction Module (IEM)
We will present in the following these modules,
specifically we will describe the inputs, outputs and This module supports the extraction of informa-
the principle of operation of each of them. tion content of web pages returned by the search
module. This is mainly to recover the information
contained in the HTML tags describing respectively
4.1. Search Module (SM) the title, abstract, and URL of each result. This
treatment is performed for the first two pages con-
In order to implement our proposal, our choice taining the 20 results returned by each of the three
was fixed on the three search engines (Google, Ya- search engines.
hoo and Bing), who now represent the most used Indeed, the results page returned by a search en-
search tools by the Web community. gine, in its raw state, contains HTML formatting
The search module transmits the user query to and representation tags, these latter do not provide
search engines Google, Yahoo and Bing, and re- useful information, and they should not be taken
trieves the first 20 responses returned by each of into account by the evaluation. In this context, we
them. This set of results represents the information precede with the purification (cleaning) of the re-
content to be evaluated. The choice of the top 20 sulting html pages before collecting the URLs of
results is justified by the fact that they represent the pages to visit (those which are to be evaluated).
links that are usually visited by the user on all the The difference in the structure and the format
returned results. They are those that contain the used by all three search engines forced us to im-
most relevant answers. plement an HTML parser for each of them to adapt
the purification process and the recovery to the b. "Document/query" matching: The comparison
structure of the one that the engine uses. Once the between the document vector and the query vector
purification process is complete, the page corres- sums up to calculating a score that represents the
ponding to each link is opened and its contents are relevance of the document regarding to the query.
treated to prepare the data for evaluation. This This value is calculated based on the distance for-
treatment is provided by the extraction module and mula (2) and the correlation formula (3).
includes: The matching function is very closely related to the
 Parsing the HTML code of the current page query term weighting and the documents to be eva-
from the URL in question. luated.
 Treatment of HTML tags: the page code (its
information content) must be processed to re- 4.5. Ranking Module (RM)
trieve only the content that is behind the tags
found useful in our case. The role of the similarity function is to order
documents before they are returned to the user. In-
4.3. Semantic Projection Module (SPM) deed, users are generally interested in examinating
the initial search result. Therefore, if the desired
In order to take semantics into account when ge- documents are not presented in this section of re-
nerating the new classification, we associate with sults, users consider the search engine as badly ad-
each query term the set of words that are semanti- justed to their information needs, and the results
cally related. The idea is to project the query terms that it returns will be considered as irrelevant. In
on the ontology concepts using the two semantic this context, the role of the ranking module is to
relations, 'synonymy' and 'hypernonymie' to extract finalize the semantic evaluation process by adapting
the different query senses. Thereafter, all the con- the system relevance to the user’s one.
cepts that are recovered for each term are used in At this stage of the evaluation process, each doc-
conjunction with the term itself during the weight- ument is described by two similarity values
ing by the calculation module. The aim is to pro- generated by the calculation module. Based on the
mote a document that contains words that are se- distance between the document vector and the
mantically close to what the user is looking for, query vector, the ranking module performs the
even if those words do not exist as terms in the scheduling of the results so that the document with
query. the lowest distance value, and therefore the higher
We use for this purpose, the WordNet ontology relevance will be ranked first until all results are
according to the following: Initially we access the properly arranged.
part of the ontology containing the concepts and This module also supports the relevance measure
semantic relations, the latter are used to retrieve all of the search engine itself. This is done by assign-
synsets relating to each terms of the query. These ing to each of the three search engines (Google,
synsets are finally used to build the semantic vector Yahoo and Bing) a relevance score. This score is
that contains for each query term the appropriate calculated by comparing the ranking results pro-
synonyms and hypernyms. duced by each search engine to the new semantic
ranking generated by our approach.
4.4. Calculation Module
4.6. Presentation Module (PM)
Once the text content and the semantic vector are
built, the calculation module performs the construc- The search engine results are generally presented
tion of the documents and query vectors based on as a list of links accompanied by title and abstract
coefficients calculated by using the appropriate describing the content of each page. These results,
weighting function (Formula 1). The calculation before being presented to the user, must be ordered
module then measures the similarity between these according to the relevance score assigned by the
two vectors using the similarity calculation func- algorithms of each search engine.
tions between vectors (Formula 2 and 3). The oper- In the approach that we propose, with respect to
ation of this module is performed in two steps: our general principle to display the search results,
the presentation module supports the display part
a. Term weighting: This step takes into account the when the results are processed. Specifically, this
weight of terms in the documents. It proceeds as module provides a summary of the search session
follows: as follows:
 A dij coefficient of the Dj document vector  All results in response to the query, where
measures the weight of term i in document j, each result is represented by a triplet (title, ab-
according to the formula (1) stract, URL). These results are semantically
 A qi coefficient of query vector Q measures ranked according to the principle of the pro-
the weight of term i in all documents. posed approach.
 The semantic relevance score associated with 6. Test of the proposed approach (à dev)
each result.
 The set of concepts related to each query term. 6.1. The used method
These concepts are retrieved from the Word-
Wor
Net ontology and presented
esented as a tree. The objective of this experimentation is to meas-
ure the contribution of the inclusion of semantics in
the ranking of results returned by search en-
5. The developed tool gines. The idea is to display results according
to two different ways: first, a default ranking as was
To demonstrate the applicability of the proposed proposed by the search engine we
approach, we have developed a tool for call ‘classical ranking’’ and a second ranking gener-
the semantic evaluation of the results returned ated by our system scheduling results according to
by search engines. To this end, it was necessary to the ontology-driven approach that we propose, we
develop a simple interface to allow the user to per- refer to this ranking by ‘semantic
semantic ranking
ranking’. This
form certain checks on the current evalu
evalua- test aims to measure the users’ satisfaction by cocom-
tion session. This interface is based on paring for the same set of queries both types
the following components: of result rankings.
To this end, we are interested in the first
 The global view that summarizes the state 20 results to measure each search engine perfor-
and the initial ranking of all mance according to the two ranking types (classical
the responses returned by the three search en- and semantic). We also treated the case
gines Google,Yahoo and Bing. of redundant results, parasite links and dead links.
 The formulation of the query and the vari- We have studied the results of Google and Ya-
ous concepts after its projection on ontology. hoo from a series of 25 search scenarios including
 The ability to choose the type of ranking to be 15 simple scenarios covering the range of current
made. needs of a user (they were simple applications of
Figure 3 shows the main window of this tool. thematic travel, consumption, news and culture)
and 10 complex scenarios (rar(rare word or specialized
search). In total 25 queries and 500 results
were screened to a scoring grid.

6.2. Results and Discussion

6.2.1. General Performance

Table 2
The effectiveness comparison of two search engines

Google Yahoo
Classical Semantic Classical Semantic
Overall
7,62 8,29 6,93 7,02
average
Figure 3: The developed tool Simple
8,15 8,82 7,76 7,89
scenarios
 Search Area : allows the user to express his in-
i Complexes
6,19 6.94 5,23 5,52
scenarios
formation need in the form of a query, and
then send this query to search engines.
 Ontology Area: to display the synonyms and
hypernyms of the query in a tree structure.
This area also allows the selection of the way
in which the ranking will be made.
 Results Area: the part in which the system dis-
di
plays the results provided by search engines.
 Ranking Zone: This part consists of a set of
buttons; the click on a button means the choice
of ranking results accordingg to the kind se-s
lected by the user.
 Extraction Area: shows the current status of
the parser in the extraction phase.
Fig. 4. The effectiveness comparison of two search engines
This first result confirms the quality Regarding to the dead links level, the
of Google which is generally the most efficient test reveals the effort of the two engines to main-
one and returns the best services to the user: the tain their index and avoid pointing to deleted or
search engine of Sergey Brin moved pages. On this crite-
and Larry Page had scored higher on almost all rion veryclearly (9.60 and 9.11 for semantic rank-
the queries made. But the difference of the overall ing and 9.67 and 9.32 for classic ranking) Google
average to Yahoo is not significant: only 0.69 of precedes Yahoo for 0.49 and 0.34 point. This crite-
10 points in the case of classical ranking and rion shows a slight advance of the semantic ranking
1.27 for the semantic ranking separate the compared one to classical.
two search engines. And this difference is reduced In terms of redundant results, again Google and
to 0.43 and 0.93 point in the case of simple que- Yahoo are doing well with a score
ries whereas it increases in the case of complex of respectively 8.27 and 7.55 for the classical rank-
search scenarios (0.96 and 1.42 point). ing and 7.72 and 7.02 for the semantic one. Ergo-
We also find that the three criteria and in the case nomically moreover , Google gets
of the two search engines, semantic ranking always a higher score with a more relevant outcome: When
brings a gain in efficiency compared to the classic- it displays on a page two links that point to the
al one. same site (but different pages), it takes care to paste
the two results and displays the second with a slight
6.2.2. Performance by criteria shift to the right. Visually, the user can see that the
two results are related. Whereas Yahoo makes no
Table 3 effort to cluster the results of the
Comparison of the two search engines effectiveness by criteria
same site. Contrary to what was expected for this
criterion, the classical rinsing gives bet-
Google Yahoo ter scores compared to the semantic one, it is be-
Classical Semantic Classical Semantic cause the number of synonyms retrieved from the
The results relev- ontology increases the frequency of query
5,72 6,12 5,06 7,66
ance terms in the returned documents, which pro-
Rate of
the not dead links
9,60 9,67 9,11 9,32 motes links arriving from the same site.
Rate of the non-
8,27 7,92 7,55 7,02
Regarding to the parasite pages (pages list-
redundant results ing only promotional links), Google is
Rate of
the not parasites 9,33 9,37 8,59 8,86 more effective than Yahoo to deal this kind
pages of useless pages in advancing the user
search otherwise these distort engine results (as
merely advertising and often poorly tar-
geted) . Scores are 9.33 and 8.59 for the classical
ranking and 9.37 and 8.86 for the semantic
one, so we see a better result in the case of semantic
ranking.

7. Conclusion

In this paper, we have presented our contribution


for the semantic evaluation of results returned by
search engines. This approach is not specific to a
particular type of research tool; it is rather generic
Fig. 5. Comparison of the two search engines effectiveness by
criteria because the ontology that we used is not specific to
a particular domain.
With respect to the relevance of the results, the The structuring of the proposed approach into a
difference between the two search en- set of modules aims to define a modular and flexi-
gines (0.66 point for classical ranking and 1.54 for ble architecture in the sense that any adjustment
semantic ranking) is remarkably larger than that of or change in one module does not affect the func-
the total score. This is explained in particular by tioning of other modules. Our proposal consists of
the more relevant results for complex six modules that provide the following functionali-
searches in Google. However, both are above the ty: First, the recovery of web pages containing the
average for that criterion. We also note that for responses of search engines and the extraction of
both search engines, the semantic ranking improves information that will be evaluated. Thereafter it will
the relevance of the results especially in the project the query terms on the concepts of the on-
case of Yahoo, where the gain in terms of relevance tology. The evaluation itself has to con-
amounts to 2.60 points. struct documents and query vectors to generate
a semantic ranking of results returned by search
engines according to the used similarity func-
tions. Finally, the results of the evaluation
are presented to the user.

References
[1] M. Baziz, M. Boughanem, N. Aussenac-Gilles, C. Chris-
ment, Semantic Cores for Representing Documents in IR,
In Proceedings of the 20th ACM Symposium on Applied
Computing, pp. 1020-1026, ACM Press ISBN: 1-58113-
964-0, (2005).
[2] A. Bouramoul, The Semantic Dimension in Information
Retrieval, from Document Indexing to Query Reformula-
tion. In Knowledge Organization journal (KO) – ISSN: 0943
– 7444, Vol.38, No.5: 425-438. Ergon-Verlag – würzburg
Germany, (2011).
[3] A. Bouramoul, M-K. Kholladi, B-L. Doan, How Ontology
Can be Used to Improve Semantic Information Retrieval:
The AnimSe Finder Tool. In International Journal of Com-
puter Applications (IJCA) – ISSN : 0975 -
8887, Vol.21, No.9 : 48-54. FCS – US, (2011).
[4] A. Bouramoul, Recherche d’Information Contextuelle
et Sémantique sur le Web. PhD Thesis in computer
science, Constantine university – Algeria & Supelec –
France,(2011).
[5] T.R. Gruber, A translation approach to portable ontology
specifications, Knowledge Acquisition, 5 (2), pp 199-220,
(1993).
[6] S. Koo, S.Y. Lim, S.J. Lee, Building an Ontology based on
Hub Words for Informational Retrieval, In Proceedings of
the IEEE/WIC International Conference on Web Intelli-
gence, (2003).
[7] C. Pruski, Une approche adaptative pour la recherche d'in-
formation sur le Web PhD Thesis in computer science. Uni-
versité du Luxembourg et Université Paris-Sud, (2009).
[8] J. Saias, P. Quaresma, A Methodology to Create Ontology-
Based Information Retrieval Systems, In Proceedings of the
EPIA Conference, pp 424-434, (2003).
[9] P. Soucy, G.W. Mineau, Beyond TFIDF Weighting for Text
Categorization in the Vector Space Model. In Proceedings
of the 19th International Joint Conference on Artificial Intel-
ligence (IJCAI 2005), Edinburgh, Scotland, (2005).
[10]D. Vallet, M. Fernández, P. Castells, An Ontology-Based
Information Retrieval Model, In Proceedings of the 2nd Eu-
ropean Semantic Web Conference, pp 455-470, (2005).

You might also like