Semantic Indexing using WordNet Senses
R a d a Mihalcea a n d D a n Moldovan
D e p a r t m e n t of C o m p u t e r Science a n d Engineering
S o u t h e r n M e t h o d i s t University
Dallas, Texas, 75275-0122
{rada, m o l d o v a n } @ s e a s . s m u . e d u
Abstract
We describe in this paper a boolean
Information l~.etrieval system that
adds word semantics to the classic
word based indexing. Two of the
main tasks of our system, namely
the indexing and retrieval components, are using a combined wordbased and sense-based approach.
The key to our system is a methodology for building semantic representations of open text, at word and collocation level. This new technique,
called semantic indexing, shows improved effectiveness over the classic
word based indexing techniques.
1
Introduction
The main problem with the traditional
boolean word-based approach to Information
Retrieval (IR) is that it usually returns too
many results or wrong results to be useful.
Keywords have often multiple lexical functionalities (i.e. can have various parts of
speech) or have several semantic senses. Also,
relevant information can be missed by not
specifying the exact keywords.
The solution is to include more information
in the documents to be indexed, such as to
enable a system to retrieve documents based
on the words, regarded as lexical strings, or
based on the semantic meaning of the words.
With this idea in mind, we designed an
IR system which performs a combined wordbased and sense-based indexing and retrieval.
The inputs to ~ systems consist of a question/query and a set of documents from which
35
the information has to be retrieved. We add
lexical and semantic information to both the
query and the documents, during a preprocessing phase in which the input question
and the texts are disambiguated. The disambiguation process relies on contextual information, and identify the meaning of the words
based on WordNet 1 (FeUbaum, 1998) senses.
As described in the fourth section, we have
opted for a disambiguation algorithm which
is semi-complete (it dis~mbiguates about 55%
of the nouns and verbs), but is highly precise
(over 9 2 ~ accuracy), instead of using a complete b u t less precise disambiguation. A part
of speech tag is also appended to each word.
After adding these lexical and semantic tags
to the words, the documents are ready to be
indexed: the index is created using the words
as lexical strings (to ensure a word-based retrieval), and the semantic tags (for the sensebased retrieval).
Once the index is created, an input query is
~n~wered using the document retrieval component of our system. First, the query is fully
disambiguated; then, it is adapted to a specific format which incorporates semantic information, as found in the index, and uses
the AND and OR operators implemented in
the retrieval module.
Hence, using semantic indexing, we try to
solve the two main problems of the m systems
described earlier. (1) relevant information is
not missed by not specifying the exact keywords; with the new tags added to the words,
we also retrieve words which are semantically
related to the input keywords; (2) using the
sense-based component of our retrieval sysXWordNet 1.6 is used in o u r s y s t e m .
tern, the number of results returned from a
search can be reduced, by specifying exactly
the lexical functionality a n d / o r the meaning
of an input keyword.
The system was tested using the Cranfield standard test collection. This collection consists of 1400 docllments, SGML formated, from the aerodynamics field. From
the 225 questions associated with this data
set, we have randomly selected 50 questions
and build for each of t h e m three types of
queries: (1) a query that uses only keywords
selected from the question, s t e m m e d using the
WordNet stemmer2; (2) a query that uses the
keywords from the question and the synsets
3 for these keywords and (3) a query that
uses the keywords from the question, the
synsets for these keywords and the synsets for
the keywords hypernyms. All these types of
queries have been r u n against the semantic
index described in this paper. Comparative
results indicate the performance benefits of a
retrieval system that uses a combined wordbased and synset-based indexing and retrieval
over the classic word based indexing.
2
Related
how concept identification can improve II:t
systems.
To our knowledge, the most intensive work
in this direction was performed by Woods
(Woods, 1997), at Sun Microsystems Laboratories. He creates some custom built ontological taxonomies based on subsumtion and
morphology for the purpose of indexing and
retrieving documents. Comparing the performance of the system that uses conceptual
indexing, with the performance obtained using classical retrieval techniques, resulted in
an increased performance and recall. He defines also a new measure, called success rate
which indicates if a question has an answer
in the top ten documents returned by a retrieval system. The success rate obtained in
the case of conceptual indexing was 60%, respect to a m a x i m u m of 45~0 obtained using
other retrieval systems. This is a signi~cant
improvement and shows that semantics can
have a strong impact on the effectiveness of
IR systems.
The experiments described in (Woods,
1997) refer to small collections of text, as
for example the Unix manual pages (about
10MB of text). But, as shown in (Ambroziak,
1997), this is not a limitation; conceptual indexing can be successfully applied to much
larger text collections, and even used in Web
browsing.
Work
There are three main approaches reported
in the literature regarding the incorporation of semantic information into IR systems:
(1)conceptual inde~ng, (2) query expansion
and (3) semantic indexing. T h e former is
based on ontological taxonomies, while the
last two make use of Word Sense Disambiguation aigorithm~.
2.1
2.2
Q u e r y Expungion
Query expansion has been proved to have
positive effects in retrieving relevant information (Lu and Keefer, 1994). T h e purpose of
query extension can be either to broaden the
set of documents retrieved or to increase the
retrieval precision. In the former case, the
query is expanded with terms similar with
the words from the original query, while in
the second case the expansion procedure adds
completely new terms.
There are two main techniques used in expanding an original query. T h e first one considers the use of Machine Readable Dictionary; (Moldovan and Mihaicea, 2000) and
(Voorhees, 1994) are making use of WordNet
to enlarge the query such as it includes words
Conceptual indexlr~g
The usage of concepts for document indexing is a relatively new trend within the IR
field. Concept matching is a technique that
has been used in limited domains, like the legal field were conceptual indexing has been
applied by (Stein, 1997). The F E R R E T system (Mauldin, 1991) is another example of
2WordNet stemmer = words are stemmed based o n
WordNet definitions (using t h e morphstr function)
3The words i n WordNet are organized in synonym
sets, called synsets. A synset is associated with a particular sense of a word, and thus we use sense-based
and synset-based interchangeably.
36
which are semantically related to the concepts
from the original query. T h e basic semantic
relation used in their systems is the synonymy
relation. This technique requires the disambiguation of the words in the input query and
it was reported that this m e t h o d can be useful
if the sense disambiguation is highly accurate.
The other technique for query expan.qion is
to use relevance feedback, as used in SMART
(Buckley et al., 1994).
2.3
Semantic indexing
The usage of word senses in the process of
document indexing is a pretty much debated
field of discussions. The basic idea is to index word meanings, rather t h a n words taken
as lexical strings. A survey of the efforts of
incorporating WSD into IR is presented in
(Sanderson, 2000). Experiments performed
by different researchers led to various, sometime contradicting results. Nevertheless, the
conclusion which can be drawn from all these
experiments is that a highly accurate Word
Sense Disambiguation algorithm is needed in
order to obtain an increase in the performance
of IR systems.
Ellen Voorhees (Voorhees, 1998) (Voorhees,
1999) tried to resolve word ambiguity in the
collection of documents, as well as in the
query, and then she compared the results obtained with the performance of a standard
run. Even if she used different weighting
schemes, the overall results have shown a
degradation in IR effectiveness when word
meanings were used for indexing. Still, as she
pointed out, the precision of the WSD technique has a dramatic influence on these results. She states that a better WSD can lead
to an increase in IR performance.
A rather "artificial" experiment in the same
direction of semantic indexing is provided in
(Sanderson, 1994). He uses pseudo-words
to test the utility of disambiguation in IR.
A pseudo-word is an artificially created ambiguous word, like for example "banana-door"
(pseudo-words have been introduced for the
first time in (Yarowsky, 1993), as means of
testing WSD accuracy without the costs associated with the acquisition of sense tagged
37
corpora). Different levels of ambiguity were
introduced in the set of documents prior to indexing. The conclusion drawn was that WSD
has little impact on IR performance, to the
point that only a WSD algorithm with over
90% precision could help IR systems.
T h e reasons for the results obtained by
Sanderson have been discussed in (Schutze
and Pedersen, 1995). They argue that the
usage of pseudo-words does not always provide an accurate measure of the effect of WSD
over IR performance. It is shown that in the
case of pseudo-words, high-frequency word
types have the majority of senses of a pseudoword, i.e. the word ambiguity is not realistically modeled. More than this, (Schutze and
Pedersen, 1995) performed experiments which
have shown that semantics can actually help
retrieval performance. They reported an increase in precision of up to 7% when sense
based indexing is used alone, and up to 14%
for a combined word based and sense based
indexing.
One of the largest studies regarding the
applicability of word semantics to IR is reported by Krovetz (Krovetz and Croft, 1993),
(Krovetz, 1997). When talking about word
ambiguity, he collapses both the morphological and semantic aspects of ambiguity, and
refers t h e m as polysemy and homonymy. He
shows that word senses should be used in addition to word based indexing, rather than
indexing on word senses alone, basically because of the uncertainty involved in sense disambiguation. He had extensively studied the
effect of lexical ambiguity over ~ the experiments described provide a clear indication
t h a t word meanings can improve the performance of a retrieval system.
(Gonzalo et al., 1998) performed experiments in sense based indexing: they used the
SMART retrieval system and a manually disambiguated collection (Semcor). It turned
out t h a t indexing by synsets can increase recall up to 29% respect to word based indexing.
Part of their experiments was the simulation
of a WSD algorithm with error rates of 5%,
10%, 20%, 30% and 60%: they found that error rates of u p to 10% do not substantially af-
fect precision, and a system with WSD errors
below 30% still perform better t h a n a standard run. The results of their experiments
are encouraging, and proved that an accurate
WSD algorithm can significantly help IR systems.
We propose here a system which tries
to combine the benefits of word-based and
synset-based indexing.
Both words and
synsets are indexed in the input text, and the
retrieval is then performed using either one or
both these sources of information. The key to
our system is a WSD method for open text.
3
enables the retrieval of the words, regarded as lexical strings, or the retrieval
of the synset of the words (this actually
means the retrieval of the given sense of
the word and its synonyms).
.
R e t r i e v a l module, which retrieves documents, based on an input query. As
we are using a combined word-based and
synset-based indexing, we can retrieve
documents containing either (1) the input keywords, (2) the input keywords
with an assigned sense or (3) synonyms
of the input keywords.
System Architecture
4
There are three main modules used by this
system:
Word
Sense Dis~mbiguation
As stated earlier, the WSD is performed for
both the query and the documents from which
we have to retrieve information.
The WSD algorithm used for this purpose
is an iterative algorithm; it was for the first
time presented in (Mihalcea and Moldovan,
2000). It determines, in a given text, a set of
nouns and verbs which can be disambiguated
with high precision. The semantic tagging is
performed using the senses defined in WordNet.
In this section, we present the various
methods used to identify the correct sense of a
word. Then, we describe the main algorithm
in which these procedures are invoked in an
iterative manner.
PROCEDUP.~ 1. This procedure identifies the
proper nonn.q in the text, and marked t h e m
as having sense ~1.
Example.
c CHudson,, is identified as a
proper noun and marked with sense #1.
PROCEDURE 2. Identify the words having
only one sense in WordNet (monosemous
words). Mark t h e m with sense #1.
Example. The noun s u b c o ~ a i t t e e has one
sense defined in WordNet. Thus, it is a
monosemous word and can be marked as having sense #1.
PROCEDURE 3. For a given word Wi, at position i in the text, form two pairs, one with
the word before W~ (pair Wi-l-Wi) and the
other one with the word after Wi (pair WiWi+i). Determiners or conjunctions cannot
1. W o r d S e n s e D i s ~ r n b i g u a t i o n (WSD)
module, which performs a semi-complete
but precise disambiguation of the words
in the documents. Besides semantic information, this module also adds part of
speech tags to each word and stems the
word using the WordNet stemmlug algorithm. Every document in the input set
of documents is processed with this module. The output is a new document in
which each word is replaced with the new
format
PoslStemlPOSlO.f.f set
where: Pos is the position of the word
in the text; Stem is the stemmed form of
the word; POS is the part of speech and
Offset is the offset of the WordNet synset
in which this word occurs.
In the case when no sense is assigned by
the WSD module or if the word cannot
be found in WordNet, the last field is left
empty.
2. I n d e x i n g module, which indexes the
documents, after they are processed by
the WSD module. From the new format of a word, as returned by the WSD
function, the Stem and, separately, the
Offset{POS are added to the index. This
38
be part of these pairs. Then, we extract all
the occurrences of these pairs found within
the semantic tagged corpus formed with the
179 texts from SemCor(Miller et al., 1993). If,
in all the occurrences, the word Wi has only
one sense # k , and the number of occurrences
of this sense is larger t h a n 3, then mark the
word Wi as having sense # k .
Example. Consider t h e word a p p r o v a l in
the text fragment ' ' c o m m i t t e e a p p r o v a l
o f ' '. The pairs formed are ' ~cown-ittee
a p p r o v a l ' ' and ' ~a p p r o v a l o f ' '. No occurrences of the first pair are found in the
corpus. Instead, there are four occurrences of
the second pair, and in all these occurrences
the sense of a p p r o v a l is sense # 1 . Thus,
a p p r o v a l is marked with sense # 1 .
PROCEDURE 4. For a given noun N in the
text, determine the noun-context of each of
its senses. This noun-context is actually a list
of nouns which can occur within the context
of a given sense i of the noun N . In order to
form the noun-context for every sense Ni, we
are determining all the concepts in the hypern y m synsets of Ni. Also, using SemCor, we
determine all the nouns which occur within a
window of 10 words respect to Ni.
All of these nouns, determined using WordNet and SemCor, constitute the noun-context
of Ni. We can now calculate the number of
common words between this noun-context and
the original text in which the noun N is found.
Applying this procedure to all the senses of
the noun N will provide us with an ordering
over its possible senses. We pick up the sense
i for the noun N which: (1) is in the top of
this ordering and (2) has the distance to the
next sense in this ordering larger than a given
threshold.
Example. The word d i a m e t e r , as it appears
in the document 1340 from the Cranfield collection, has two senses. The common words
found between the noun-contexts of its senses
and the text are: for d i a m e t e r # l : { property,
hole, ratio } and for d i a m e t e r # 2 : { form}.
For this text, the threshold was set to 1, and
thus we pick d:i.ameter#1 as the correct sense
(there is a difference larger than 1 between
the number of nouns in the two sets).
39
PROCEDURE 5. Find words which are semantically connected to the already disambiguated words for which the connection distance is 0. The distance is computed based
on t h e Word_Net hierarchy; two words are semantically connected at a distance of 0 if they
belong to the same synset.
Example.
Consider these two words appearing in the text to b e disambiguated:
a u t h o r i z e and c l e a r . The verb a u t h o r i z e
is a monosemous word, and thus it is disambiguated with procedure 2. One of the senses
of t h e verb c l e a r , namely sense # 4 , appears
in t h e same synset with a u t h o r i z e # I , and
thus c l e a r is marked as having sense # 4 .
PROCEDURE 6. Find words which are semantically connected, and for which the connection distance is 0. This procedure is weaker
t h a n procedure 5: none of the words considered by this procedure are already disamo
biguated. We have to consider all the senses
of b o t h words in order to determine whether
or not the distance between t h e m is 0, and
this makes this procedure computationally intensive.
Example. For the words m e a s u r e and b i l l ,
b o t h of them ambiguous, this procedure tries
to find two possible senses for these words,
which are at a distance of 0, i.e. they belong to the same synset. T h e senses found
are m e a s u r e # 4 and b i l l # l , and thus the two
words are marked with their corresponding
senses.
PROCEDURE 7. F i n d words which are semantically connected to the already disambiguated words, a n d for which the connection
distance is m a x i m u m 1. Again, the distance
is c o m p u t e d based on the WordNet hierarchy; two words are semantically connected at
a m a x i m u m distance of 1 if they are synonyms
or t h e y belong to a hypernymy/hyponymy relation.
Example. Consider t h e nouns s u b c o m m i t t e e
and committee.
T h e first one is disarm
biguated with procedure 2, and thus it is
marked with sense # 1 . The word c o m m i t t e e
with its sense # 1 is semantically linked with
t h e word s u b c o m m i t t e e by a hypernymy relation. Hence, we semantically tag this word
with sense ~1.
PROCEDURE 8. Find words which are semantically connected between them, and for
which the connection distance is m a x i m u m 1.
This procedure is similar with procedure 6:
both words are ambiguous, and thus all their
senses have to be considered in the process of
finding the distance between them.
Example. The words g i f t and d o n a t i o n
are both ambiguous. This procedure finds
g i f t with sense # 1 as being the hypernym
of d o n a t i o n , also with sense ~ 1 . Therefore,
both words are disambiguated and marked
with their assigned senses.
The procedures presented above are applied
iteratively. This allows us to identify a set of
nouns and verbs which can be disambiguated
with high precision. About 55% of the nouns
and verbs are disambiguated with over 92%
accuracy.
Algorithm
Step 1. Pre-process the text. This implies
tokenization and part-of-speech tagging. The
part-of-speech tagging task is performed with
high accuracy using an improved version of
Brill's tagger (Brill, 1992). At this step, we
also identify the complex nominals, based on
WordNet definitions. For example, the word
sequence ' 'pipeline
companies' ' is found
in WordNet and thus it is identified as a single
concept. There is also a list of words which
we do not a t t e m p t to dis~.mbiguate. These
words are marked with a special flag to indicate that they should not be considered in
the disrtmbiguation process. So far, this list
consists of three verbs: be, have, do.
Step 2. Initi~]i~.e the Set of Disambiguated
Words (SDW) with the empty set SDW={}.
Initialize the Set of Ambiguous Words (SAW)
with the set formed by all the nouns and verbs
in the input text.
Step 3. Apply procedure 1. The named entities identified here are removed from SAW
and added to SDW.
Step 4. Apply procedure 2. The monosemous
words found here axe removed from SAW and
added to SDW.
Step 5. Apply procedure 3. This step allows
us to disambiguate words based on their oc-
currence in the semantically tagged corpus.
The words whose sense is identified with this
procedure are removed from SAW and added
to SDW.
Step 6. Apply procedure 4. This will identify
a set of nouns which can be disambiguated
b a n d on their noun-contexts.
Step 7. Apply procedure 5. This procedure
tries to identify a synonymy relation between
the words from SAW and SDW. T h e words
disambiguated are removed from SAW and
added to SDW.
Step 8. Apply procedure 6. This step is different from the previous one, as the synonymy
relation is sought among words in SAW (no
SDW words involved). The words disambiguated are removed from SAW and added
to SDW.
Step 9. Apply procedure 7. This step tries
to identify words from SAW which are linked
at a distance of m a x i m u m 1 with the words
from SDW. Remove the words dis ambiguated
from SAW and add t h e m to SDW.
Step 10. Apply procedure 8. This procedure
finds words from SAW connected at a distance
of m a x i m u m I. As in step 8, no words from
SDW are involved. The words disambiguated
are removed from SAW and added to SDW.
Results
To determine the accuracy a n d the recall
of the disambiguation method presented here,
we have performed tests on 6 randomly selected files from SemCor. T h e following files
have been used: br-a01, br-a02, br-k01, brk18, br-m02, br-r05. Each of these files was
split into smaller files with a m a x i m u m of 15
lines each. This size limit is based on our
observation that small contexts reduce the
applicability of procedures 5-8, while large
contexts become a source of errors. Thus,
we have created a benchmark with 52 texts,
on which we have tested the disambiguation
method.
In table 1, we present the results obtalned
for these 52 texts. T h e first cohlmn indicates
the file for which the results are presented.
T h e average number of no, ms and verbs considered by the disambiguation algorithm for
each of these files is shown in the second col-
40
Table I: Results for the W S D algorithm applied on 52 texts
No.
File
words
br-a01
132
br-a02
135
br-k01
-68.1
br-k18
60.4
br-m02
63
br-r05
72.5
AVERAGE 88.5
Proc.l+2
No. Ace.
40
100%
49
100%
17.2 100%
18.1 100%
17.3 100%
14.3 100%
25.9 100%
Proc.3
No.
Ace.
43
99.7~
5 2 . 5 98.5%
2 3 . 3 99.7%
2 0 . 7 99.1%
2 0 . 3 98.1%
16.6 98.1%
2 9 . 4 98.8%
umn. In columns 3 and 4, there are presented
the average number of words disambiguated
with procedures 1 and 2, and the accuracy
obtained with these procedures. Column 5
and 6 present the average number of words
disambiguated and the accuracy obtained after applying procedure 3 (cumulative results).
The cumulative results obtained after applying procedures 3, 4 and 5, 6 and 7, are shown
in columns 7 and 8, 9 and 10, respectively
columns 10 and 11.
T h e novelty of this m e t h o d consists of the
fact that the disambiguation process is done
in an iterative manner. Several procedures,
described above, are applied such as to build
a set of words which are disambiguated with
high accuracy: 55% of the nouns and verbs
are disambiguated with a precision of 92.22%.
T h e most important improvements which
are expected to be achieved on the W S D problem are precision and speed. In the case of
our approach to W S D , we can also talk a b o u t
the need for an increased fecal/, meaning t h a t
we want to obtain a larger number of words
which can be disambiguated in the input text.
T h e precision of more t h a n 92% obtained
during our experiments is very high, considering the fact that Word.Net, which is the dictionary used for sense identification, is very
fine grained and sometime the senses are very
close to each other. The accuracy obtained is
close to the precision achieved by humans in
sense disambiguation.
5
Indexing
and Retrieval
The indexing process takes a group of document files and produces a new index. Such
things as unique document identifiers, proper
41
Proc.4
No.
Acc.
58.5 94.6%
68.6
94%
3 8 . 1 97.4%
26.6 96.9%
26.1 95%
27
93.2%
40.8 95.2%
Proc.5+6
No.
Ace.
63.8 92.7%
75.2 92.4%
40.3 97.4%
2 7 . 8 95.3%
26.8 94.9%
3 0 . 2 91.5%
44
94%
Proc.7+8
No.
Acc.
73.2 89.3%
8 1 . 2 91.4%
4 1 . 8 96.4%
2 9 . 8 93.2%
30.1 93.9%
3 4 . 2 89.1%
48.4 92.2%
S G M L tags, and other artificial constructs are
ignored. In the current version of the system,
w e are using only the A N D and O R boolean
operators. Future versions will consider the
implementation of the N O T and N E A R operators.
T h e information obtained from the W S D
module is used b y the main indexing process,
where the word stem and location are indexed
along with t h e WordNet synset (if present).
Collocations are indexed at each location that
a m e m b e r of the collocation occurs.
All elements of the document are indexed.
This includes, b u t is not limited to, dates,
numbers, document identifiers, the s t e m m e d
words, collocations, WordNet synsets (if
available), and even those terms which other
indexers consider stop words. The only items
currently excluded from the index are punct u a t i o n marks which are not part of a word
or collocation.
T h e benefit of this form of indexing is t h a t
documents m a y be retrieved using s t e m m e d
words, or using synset offsets. Using synset
offset values has the added benefit of retrieving documents which do not contain the original s t e m m e d word, b u t do contain synonyms
of the original word.
T h e retrieval process is limited to the use of
t h e Boolean operators A N D and OR. There
is an auxiliary front end to the retrieval engine which allows the user to enter a textual
query, such as, "What financial institutions
are .found along the banks of the Nile?" The
auxiliary front end will then use the W S D to
disambiguate the query and build a Boolean
query for the standard retrieval engine.
For the preceding example, the auxil-
(financiaLinstitution OR 60031M[NN) AND
(bank OR 68002231NN) AND (Nile OR
68261741NN) where the numbers in the pre-
catenated with the AND operator among
them.
iary front end would build the query:
.
vious query represent the offsets of the synsets
in which the words with their determined
meaning occur.
Once a list of documents meeting the query
requirements has been determined, the complete text of each matching document is retrieved and presented to the user.
6
QwNHyperOfSset. Keywords from the
question, stemmed based on WordNet,
concatenated using the OR operator with
the associated synset offset and with the
offset of the hypernym synset, and concatenated with the AND operator among
them.
All these types of queries are r u n against
the semantic index created based on words
and synset offsets. We denote these rime with
RWNStem, RWNOyfset and RWNHyperOffset.
The three query formats, for the given question, are presented below:
QwNstern. effect AND surface AND mass
AND flow AND interaction
QwNoyyset. (effect OR 77661441NN) AND
(surface OR 3447223[NN) AND (mass O R
392343651NN) AND (transfer O R 1320951NN)
AND (interaction OR 78405721NN)
QWNHyperOffset (effect OR 77661441NN OR
20461]NN)
AND
(surface
OR
3447223]NN OR 119371NN) AND (mass OR.
39234351NN OR 3912591[NN) AND (transfer
O R 1320951NN OR. 1304701NN) AND (interaction OR. 784057£~NN OR. 7770957~NN)
An Example
Consider, for example, the following question: "Has anyone investigated the effect of
surface mass transfer on hypersonic viscous
interactionsf'. The question processing involves part of speech tagging, stemming and
word sense disambiguation.
The
question
becomes: "Has anyone investigate I VB1535831
the effectlNN17766144 o/surfacelN~3447223
massl NN139234 35
transferl Nhq132095
on hypersoniclJJ viscouslJJ interactionlNNl
7840572".
The selection of the keywords is not an
easy task, and it is performed using the set
of 8 heuristics presented in (Moldovan et al.,
1999). Because of space limitations, we are
not going to detail here the heuristics and the
algorithm used for keywords selection. The
main idea is that an initial nnmber of keywords is determined using a subset of these
heuristics. If no documents are retrieved,
more keywords are added, respectively a too
large number of documents will imply that
some of the keywords are dropped in the reversed order in which they have been entered.
For each question, three types of query are
formed, using the AND and OR. operators.
Using the first type of query, 7 documents
were found out of which 1 was considered
to be relevant. W i t h the second and third
types of query, we obtained 11, respectively
17 documents, out of which 4 were found relevant, and actually contained the answer to
t h e question.
(sample answer) ... the present report gives an a c -
of the development o] an approzimate theory to
the problem of hypersonic strong viscous interaction
on a fiat plate with mass-transfer at the plate surface.
the disturbanceflow region is divided into inviscid and
viscous flo~ regions .... (craniield0305).
count
1. QwNstem. Keywords from the question,
stemmed based on WordNet, concatenated with the AND operator.
77
Results
T h e system was tested on the Cranfield collection, including 1400 documents, SGML
formated 4. From the 225 questions provided
2. QwNoffset. Keywords from the question, stemmed based on WordNet, concatenated using the OR. operator with
the associated synset offset, and con-
4Demo available online at
http://pdpl 3.seas.smu.edu/rada/sem.ind./
42
with this collection, we randomly selected 50
questions and used them to create a benchmark against which we have performed the
three runs described in the previous sections:
R W N S t e m , R W N O f f s e t and 1-~W N H y p e r O f f s e t .
For each of .these questions, the system
forms three types of queries, as described
above. Below, we present 10 of these questions and show the results obtained in Table
2.
I . H a s a n y o n e investigated the effect of surface mass trans-
f e r on hypersonic ~'L~cwas interactions?
$.
What is the combined effect of surface heat and mass
transfer on hypersonic flow?
3. What are the existing solutions f o r hypersonic viscous interactions over an insulated fiat plate?
4.
What controls leading-edge a t t a c h m e n t at transonic ve-
locities ?
5. What are wind-tunnel corrections f o r a two-dimensional
aerofoil mounted off-centre in a tunnel?
6.
What is the present state of the theory of quasi-conical
flows ?
7. References on the methods available for accurately estimating aerodynamic heat transfer to conical bodies for both
laminar and turbulent flow.
8. What parameters can seriously influence natural transition from laminar to turbulent flow on a model in a wind
tunnel?
9.
Can a satisfactory e~perimental technique be devel-
oped f o r measuring oscillatory derivatives on slender stingmo unted models in supersonic wind tunnels?
I0. Recent data on shock-induced boundary-layer separation.
Three measures are used in the evaluation
of the system performance: (1) precision, de..
fined as the number of relevant documents retrieved over the total number of documents
retrieved; (2) real/, defined as the number
of relevant documents retrieved over the total
number of relevant documents found in the
collection and (3) F-measure, which combines
both the precision and recall into a single formula:
Fmeas~re = (32 + l'O) * P * R
• P) + R
where P is the precision, R is the recall and
is the relative importance given to recall
over precision. In our case, we consider both
43
precision and recall of equal importance, and
thus the factor fl in our evaluation is 1.
The tests over the entire set of 50 questions
led to 0.22 precision and 0.25 recall when the
WordNet stemmer is used, 0.23 precision and
0.29 recall when using a combined word-based
and synset-based indexing. The usage of hypernym synsets led to a recall of 0.32 and a
precision of 0.21.
The relative gain of the combined wordbased and synset-based indexing respect to
the basic word-based indexing was 16% increase in recall and 4% increase in precision.
When using the hypernym synsets, there is a
28% increase in recall, with a 9% decrease in
precision.
The conclusion of these experiments is that
indexing by synsets, in addition to the classic word-based indexing, can actually improve
IR effectiveness. More than that, this is the
first time to our knowledge when a WSD algorithm for open text was actually used to automaticaUy disambiguate a collection of texts
prior to indexing, with a disambiguation accuracy high enough to actually increase the
recall and precision of an IR system.
An issue which can be raised here is the efficiency of such a system: we have introduced
a WSD stage into the classic IR process and it
is well known that WSD algorithm.~ are usually computationally intensive; on the other
side, the disambiguation of a text collection
is a process which can be highly parallelized,
and thus this does not constitute a problem
anymore.
8
Conclusions
The full understanding of text is still an elusive goal. Short of that, semantic indexing
offers an improvement over current IR techniques. The key to semantic indexing is fast
WSD of large collections of documents.
In this paper we offer a WSD method for
open domains that is fast and accurate. Since
only 55% of the words can be disambiguated
so far, we use a hybrid indexing approach that
combines word-based and sense-based indexing. The senses in WordNet are fine grain and
the WSD method has to cope with this. The
Table 2: Results for 10 questions run against the three indices created on the Cranlleld collection. The bottom
line shows the results for the entire set of questions.
Question
number
recall
precision
Lmeasure
1
2
3
4
5
6
7
8
9
10
Avo/50
0.08
0.06
0.47
0.25
0.33
0.00
0.17
0.20
0.67
0.29
0.25
0.14
0.17
0.70
0.60
0.50
0.00
0.17
0.II
0.50
0.07
0.22
0.05
0.04
0.28
0.18
0.20
0.00
0.09
0.07
0.29
0.06
0.09
.RW N Stcm
recall
0.31
0.25
0.47
Query type
Rw No f f ~et
precision f-measure
0.36
0.44
0.70
0.60
0.25
0.00
0.67
0.29
0.II
0.50
0.07
0.17
0.16
0.28
0.18
0.20
0.00
0.09
0.07
0.29
0.06
0.29
0.23
0.11
0.25
1.00
0.00
0.17
0.20
0.17
recall
0.31
0.25
0.53
0.25
1.00
RW l~H~ r O y.fset
precc~mn f-measure
0.24
0.31
0.67
0.60
0.19
0.14
0.14
0.30
0.18
0.16
0.00
0.00
0.00
0.17
0.20
1.00
0.29
0.17
0.11
0.38
0.06
0.09
0.07
0.28
0.05
0.32
0.21
0.10
can improve text retrieval. In Proceedings
of COLING-ACL '98 Workshop on Usage of
Word.Net in Natural Language Processing Systems, Montreal, Canada, August.
W S D algorithm presented here is new for the
N L P community and proves to b e well suited
for a task such as semantic indexing.
The continuously increasing amount of information available t o d a y requires more and
more sophisticated I R techniques, and semantic indexing is one of the new trends when trying to improve I R effectiveness. W i t h semantic indexing, the search m a y b e expanded to
other forms of semantically related concepts
as done by Woods (Woods, 1997). Finally,
semantic indexing can have an impact on the
semantic Web technology t h a t is under consideration (Hellman, 1999).
R. HeUman. 1999. A semantic approach adds
meaning to the Web. Computer, pages 13-16.
R. Krovetz and W.B. Croft. 1993. Lexical ambiguity and in_formation retrieval. A CM Transactions on Information Systems, 10(2):115--141.
R. Krovetz. 1997. Homonymy and polysemy in information retrieval. In Proceedings of the 35th
Annual Meeting of the Association for Computational Linguistics (A CL-97}, pages 72-79.
X.A. Lu and R.B. Keefer. 1994. Query expansion/reduction and its impact on retrieval effectiveness. In The Text REtrieval Conference
(TREC-3), pages 231-240.
References
J. Ambroziak. 1997. Conceptually assisted Web
browsing. In Sixth International World Wide
Web conference, Santa Clara, CA. full paper
available online at http://www.scope.gmd.de[
info/www6/posters/702/guide2.html.
M.L. Mauldin.
1991. Retrieval performance
in FERRET: a conceptual information retrieval system. In Proceedings of the lSth
International A CM-SIGIR Conference on Research and Development in Information Retrieval, pages 347-355, Chicago, IL, October.
E. Brill. 1992. A simple rule-based part of speech
tagger. In Proceedings of the 3rd Conference on
Applied Natural Language Processing, Trento,
Italy.
R. Mihalcea and D.I. Moldovan. 2000. An iterative approach to word sense disambiguation.
In Proceedings of FLAIRS-2000, pages 219-223,
Orlando, FL, May.
C. Buckley, G. Salton, J. Allan, and A. Singhal.
1994. Automatic query expansion using smart:
Trec 3. In Proceedings of the Text REtrieval
Conference (TREC-3), pages 69--81.
G. Miller, C. Leacock, T. Randee, and R. Bunker.
1993. A semantic concordance. In Proceedings
of the 3rd DARPA Workshop on Human Language Technology, pages 303-308, Plaln~boro,
New Jersey.
C. Fellbaurn. 1998. WordNet, An Electronic Lexical Database. The MIT Press.
D Moldovan and tL Mihalcea. 2000. Using WordNet and lexical operators to improve Internet
searches. IEEE Internet Computing, 4(1):34-43.
J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarran. 1998. Indexing with WordNet synsets
44
D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Goodrum, R. Girju, and V. Rus. 1999.
LASSO: A tool for surfing the answer net. In
Proceedings of the Text Retrieval Conference
(TREU-8), November.
M. Sanderson. 1994. Word sense disambiguation
and information retrieval. In Proceedings of the
17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 142-151, SpringerVerlag.
M. Sanderson. 2000. Retrieving with good sense.
Information Retrieval, 2(1):49--69.
H. Schutze and J. Pedersen. 1995. Information retrieval based on word senses. In Proceedings of
the 4th Annual Symposium on Document Analysis and Information Retrieval, pages 161-175.
J.A. Stein. 1997. Alternative methods of indexing legal material: Development of a conceptual
index. In Proceedings of the Conference "Law
Via the Internet g7", Sydney, Australia.
E.M. Voorhees. 1994. Query expansion using
lexical-semantic relations. In Proceedings of the
17th Annual International ACM SIGIR, Conference on Research and Development in Information Retrieval, pages 61-69, Dublin, Ireland.
E.M. Voorhees. 1998. Using WordNet for text
retrieval. In WordNet, An Electronic Lexical
Database, pages 285-303. The MIT Press.
E.M. Voorhees. 1999. Natural language proeessing and information retrieval. In Infor-
mation Extraction: towards scalable, adaptable
systems. Lecture notes in Artificial Intelligence,
#1714, pages 32-48.
W.A. Woods. 1997. Conceptual indexing: A
better way to organize knowledge. Technical Report SMLI TR-97-61, Sun Mierosysterns Laboratories, April.
available online
http:l/www.sun.comI researeh/techrep/
at:
1997/abstract-61.html.
D. Yarowsky. 1993. One sense per collocation.
In Proceedings o] the ARPA Human Language
Technology Workshop.
45