Academia.eduAcademia.edu

Semantic indexing using WordNet senses

2000, Proceedings of the ACL-2000 workshop …

We describe in this paper a boolean Information l~.etrieval system that adds word semantics to the classic word based indexing. Two of the main tasks of our system, namely the indexing and retrieval components, are using a combined wordbased and sense-based approach.

Semantic Indexing using WordNet Senses R a d a Mihalcea a n d D a n Moldovan D e p a r t m e n t of C o m p u t e r Science a n d Engineering S o u t h e r n M e t h o d i s t University Dallas, Texas, 75275-0122 {rada, m o l d o v a n } @ s e a s . s m u . e d u Abstract We describe in this paper a boolean Information l~.etrieval system that adds word semantics to the classic word based indexing. Two of the main tasks of our system, namely the indexing and retrieval components, are using a combined wordbased and sense-based approach. The key to our system is a methodology for building semantic representations of open text, at word and collocation level. This new technique, called semantic indexing, shows improved effectiveness over the classic word based indexing techniques. 1 Introduction The main problem with the traditional boolean word-based approach to Information Retrieval (IR) is that it usually returns too many results or wrong results to be useful. Keywords have often multiple lexical functionalities (i.e. can have various parts of speech) or have several semantic senses. Also, relevant information can be missed by not specifying the exact keywords. The solution is to include more information in the documents to be indexed, such as to enable a system to retrieve documents based on the words, regarded as lexical strings, or based on the semantic meaning of the words. With this idea in mind, we designed an IR system which performs a combined wordbased and sense-based indexing and retrieval. The inputs to ~ systems consist of a question/query and a set of documents from which 35 the information has to be retrieved. We add lexical and semantic information to both the query and the documents, during a preprocessing phase in which the input question and the texts are disambiguated. The disambiguation process relies on contextual information, and identify the meaning of the words based on WordNet 1 (FeUbaum, 1998) senses. As described in the fourth section, we have opted for a disambiguation algorithm which is semi-complete (it dis~mbiguates about 55% of the nouns and verbs), but is highly precise (over 9 2 ~ accuracy), instead of using a complete b u t less precise disambiguation. A part of speech tag is also appended to each word. After adding these lexical and semantic tags to the words, the documents are ready to be indexed: the index is created using the words as lexical strings (to ensure a word-based retrieval), and the semantic tags (for the sensebased retrieval). Once the index is created, an input query is ~n~wered using the document retrieval component of our system. First, the query is fully disambiguated; then, it is adapted to a specific format which incorporates semantic information, as found in the index, and uses the AND and OR operators implemented in the retrieval module. Hence, using semantic indexing, we try to solve the two main problems of the m systems described earlier. (1) relevant information is not missed by not specifying the exact keywords; with the new tags added to the words, we also retrieve words which are semantically related to the input keywords; (2) using the sense-based component of our retrieval sysXWordNet 1.6 is used in o u r s y s t e m . tern, the number of results returned from a search can be reduced, by specifying exactly the lexical functionality a n d / o r the meaning of an input keyword. The system was tested using the Cranfield standard test collection. This collection consists of 1400 docllments, SGML formated, from the aerodynamics field. From the 225 questions associated with this data set, we have randomly selected 50 questions and build for each of t h e m three types of queries: (1) a query that uses only keywords selected from the question, s t e m m e d using the WordNet stemmer2; (2) a query that uses the keywords from the question and the synsets 3 for these keywords and (3) a query that uses the keywords from the question, the synsets for these keywords and the synsets for the keywords hypernyms. All these types of queries have been r u n against the semantic index described in this paper. Comparative results indicate the performance benefits of a retrieval system that uses a combined wordbased and synset-based indexing and retrieval over the classic word based indexing. 2 Related how concept identification can improve II:t systems. To our knowledge, the most intensive work in this direction was performed by Woods (Woods, 1997), at Sun Microsystems Laboratories. He creates some custom built ontological taxonomies based on subsumtion and morphology for the purpose of indexing and retrieving documents. Comparing the performance of the system that uses conceptual indexing, with the performance obtained using classical retrieval techniques, resulted in an increased performance and recall. He defines also a new measure, called success rate which indicates if a question has an answer in the top ten documents returned by a retrieval system. The success rate obtained in the case of conceptual indexing was 60%, respect to a m a x i m u m of 45~0 obtained using other retrieval systems. This is a signi~cant improvement and shows that semantics can have a strong impact on the effectiveness of IR systems. The experiments described in (Woods, 1997) refer to small collections of text, as for example the Unix manual pages (about 10MB of text). But, as shown in (Ambroziak, 1997), this is not a limitation; conceptual indexing can be successfully applied to much larger text collections, and even used in Web browsing. Work There are three main approaches reported in the literature regarding the incorporation of semantic information into IR systems: (1)conceptual inde~ng, (2) query expansion and (3) semantic indexing. T h e former is based on ontological taxonomies, while the last two make use of Word Sense Disambiguation aigorithm~. 2.1 2.2 Q u e r y Expungion Query expansion has been proved to have positive effects in retrieving relevant information (Lu and Keefer, 1994). T h e purpose of query extension can be either to broaden the set of documents retrieved or to increase the retrieval precision. In the former case, the query is expanded with terms similar with the words from the original query, while in the second case the expansion procedure adds completely new terms. There are two main techniques used in expanding an original query. T h e first one considers the use of Machine Readable Dictionary; (Moldovan and Mihaicea, 2000) and (Voorhees, 1994) are making use of WordNet to enlarge the query such as it includes words Conceptual indexlr~g The usage of concepts for document indexing is a relatively new trend within the IR field. Concept matching is a technique that has been used in limited domains, like the legal field were conceptual indexing has been applied by (Stein, 1997). The F E R R E T system (Mauldin, 1991) is another example of 2WordNet stemmer = words are stemmed based o n WordNet definitions (using t h e morphstr function) 3The words i n WordNet are organized in synonym sets, called synsets. A synset is associated with a particular sense of a word, and thus we use sense-based and synset-based interchangeably. 36 which are semantically related to the concepts from the original query. T h e basic semantic relation used in their systems is the synonymy relation. This technique requires the disambiguation of the words in the input query and it was reported that this m e t h o d can be useful if the sense disambiguation is highly accurate. The other technique for query expan.qion is to use relevance feedback, as used in SMART (Buckley et al., 1994). 2.3 Semantic indexing The usage of word senses in the process of document indexing is a pretty much debated field of discussions. The basic idea is to index word meanings, rather t h a n words taken as lexical strings. A survey of the efforts of incorporating WSD into IR is presented in (Sanderson, 2000). Experiments performed by different researchers led to various, sometime contradicting results. Nevertheless, the conclusion which can be drawn from all these experiments is that a highly accurate Word Sense Disambiguation algorithm is needed in order to obtain an increase in the performance of IR systems. Ellen Voorhees (Voorhees, 1998) (Voorhees, 1999) tried to resolve word ambiguity in the collection of documents, as well as in the query, and then she compared the results obtained with the performance of a standard run. Even if she used different weighting schemes, the overall results have shown a degradation in IR effectiveness when word meanings were used for indexing. Still, as she pointed out, the precision of the WSD technique has a dramatic influence on these results. She states that a better WSD can lead to an increase in IR performance. A rather "artificial" experiment in the same direction of semantic indexing is provided in (Sanderson, 1994). He uses pseudo-words to test the utility of disambiguation in IR. A pseudo-word is an artificially created ambiguous word, like for example "banana-door" (pseudo-words have been introduced for the first time in (Yarowsky, 1993), as means of testing WSD accuracy without the costs associated with the acquisition of sense tagged 37 corpora). Different levels of ambiguity were introduced in the set of documents prior to indexing. The conclusion drawn was that WSD has little impact on IR performance, to the point that only a WSD algorithm with over 90% precision could help IR systems. T h e reasons for the results obtained by Sanderson have been discussed in (Schutze and Pedersen, 1995). They argue that the usage of pseudo-words does not always provide an accurate measure of the effect of WSD over IR performance. It is shown that in the case of pseudo-words, high-frequency word types have the majority of senses of a pseudoword, i.e. the word ambiguity is not realistically modeled. More than this, (Schutze and Pedersen, 1995) performed experiments which have shown that semantics can actually help retrieval performance. They reported an increase in precision of up to 7% when sense based indexing is used alone, and up to 14% for a combined word based and sense based indexing. One of the largest studies regarding the applicability of word semantics to IR is reported by Krovetz (Krovetz and Croft, 1993), (Krovetz, 1997). When talking about word ambiguity, he collapses both the morphological and semantic aspects of ambiguity, and refers t h e m as polysemy and homonymy. He shows that word senses should be used in addition to word based indexing, rather than indexing on word senses alone, basically because of the uncertainty involved in sense disambiguation. He had extensively studied the effect of lexical ambiguity over ~ the experiments described provide a clear indication t h a t word meanings can improve the performance of a retrieval system. (Gonzalo et al., 1998) performed experiments in sense based indexing: they used the SMART retrieval system and a manually disambiguated collection (Semcor). It turned out t h a t indexing by synsets can increase recall up to 29% respect to word based indexing. Part of their experiments was the simulation of a WSD algorithm with error rates of 5%, 10%, 20%, 30% and 60%: they found that error rates of u p to 10% do not substantially af- fect precision, and a system with WSD errors below 30% still perform better t h a n a standard run. The results of their experiments are encouraging, and proved that an accurate WSD algorithm can significantly help IR systems. We propose here a system which tries to combine the benefits of word-based and synset-based indexing. Both words and synsets are indexed in the input text, and the retrieval is then performed using either one or both these sources of information. The key to our system is a WSD method for open text. 3 enables the retrieval of the words, regarded as lexical strings, or the retrieval of the synset of the words (this actually means the retrieval of the given sense of the word and its synonyms). . R e t r i e v a l module, which retrieves documents, based on an input query. As we are using a combined word-based and synset-based indexing, we can retrieve documents containing either (1) the input keywords, (2) the input keywords with an assigned sense or (3) synonyms of the input keywords. System Architecture 4 There are three main modules used by this system: Word Sense Dis~mbiguation As stated earlier, the WSD is performed for both the query and the documents from which we have to retrieve information. The WSD algorithm used for this purpose is an iterative algorithm; it was for the first time presented in (Mihalcea and Moldovan, 2000). It determines, in a given text, a set of nouns and verbs which can be disambiguated with high precision. The semantic tagging is performed using the senses defined in WordNet. In this section, we present the various methods used to identify the correct sense of a word. Then, we describe the main algorithm in which these procedures are invoked in an iterative manner. PROCEDUP.~ 1. This procedure identifies the proper nonn.q in the text, and marked t h e m as having sense ~1. Example. c CHudson,, is identified as a proper noun and marked with sense #1. PROCEDURE 2. Identify the words having only one sense in WordNet (monosemous words). Mark t h e m with sense #1. Example. The noun s u b c o ~ a i t t e e has one sense defined in WordNet. Thus, it is a monosemous word and can be marked as having sense #1. PROCEDURE 3. For a given word Wi, at position i in the text, form two pairs, one with the word before W~ (pair Wi-l-Wi) and the other one with the word after Wi (pair WiWi+i). Determiners or conjunctions cannot 1. W o r d S e n s e D i s ~ r n b i g u a t i o n (WSD) module, which performs a semi-complete but precise disambiguation of the words in the documents. Besides semantic information, this module also adds part of speech tags to each word and stems the word using the WordNet stemmlug algorithm. Every document in the input set of documents is processed with this module. The output is a new document in which each word is replaced with the new format PoslStemlPOSlO.f.f set where: Pos is the position of the word in the text; Stem is the stemmed form of the word; POS is the part of speech and Offset is the offset of the WordNet synset in which this word occurs. In the case when no sense is assigned by the WSD module or if the word cannot be found in WordNet, the last field is left empty. 2. I n d e x i n g module, which indexes the documents, after they are processed by the WSD module. From the new format of a word, as returned by the WSD function, the Stem and, separately, the Offset{POS are added to the index. This 38 be part of these pairs. Then, we extract all the occurrences of these pairs found within the semantic tagged corpus formed with the 179 texts from SemCor(Miller et al., 1993). If, in all the occurrences, the word Wi has only one sense # k , and the number of occurrences of this sense is larger t h a n 3, then mark the word Wi as having sense # k . Example. Consider t h e word a p p r o v a l in the text fragment ' ' c o m m i t t e e a p p r o v a l o f ' '. The pairs formed are ' ~cown-ittee a p p r o v a l ' ' and ' ~a p p r o v a l o f ' '. No occurrences of the first pair are found in the corpus. Instead, there are four occurrences of the second pair, and in all these occurrences the sense of a p p r o v a l is sense # 1 . Thus, a p p r o v a l is marked with sense # 1 . PROCEDURE 4. For a given noun N in the text, determine the noun-context of each of its senses. This noun-context is actually a list of nouns which can occur within the context of a given sense i of the noun N . In order to form the noun-context for every sense Ni, we are determining all the concepts in the hypern y m synsets of Ni. Also, using SemCor, we determine all the nouns which occur within a window of 10 words respect to Ni. All of these nouns, determined using WordNet and SemCor, constitute the noun-context of Ni. We can now calculate the number of common words between this noun-context and the original text in which the noun N is found. Applying this procedure to all the senses of the noun N will provide us with an ordering over its possible senses. We pick up the sense i for the noun N which: (1) is in the top of this ordering and (2) has the distance to the next sense in this ordering larger than a given threshold. Example. The word d i a m e t e r , as it appears in the document 1340 from the Cranfield collection, has two senses. The common words found between the noun-contexts of its senses and the text are: for d i a m e t e r # l : { property, hole, ratio } and for d i a m e t e r # 2 : { form}. For this text, the threshold was set to 1, and thus we pick d:i.ameter#1 as the correct sense (there is a difference larger than 1 between the number of nouns in the two sets). 39 PROCEDURE 5. Find words which are semantically connected to the already disambiguated words for which the connection distance is 0. The distance is computed based on t h e Word_Net hierarchy; two words are semantically connected at a distance of 0 if they belong to the same synset. Example. Consider these two words appearing in the text to b e disambiguated: a u t h o r i z e and c l e a r . The verb a u t h o r i z e is a monosemous word, and thus it is disambiguated with procedure 2. One of the senses of t h e verb c l e a r , namely sense # 4 , appears in t h e same synset with a u t h o r i z e # I , and thus c l e a r is marked as having sense # 4 . PROCEDURE 6. Find words which are semantically connected, and for which the connection distance is 0. This procedure is weaker t h a n procedure 5: none of the words considered by this procedure are already disamo biguated. We have to consider all the senses of b o t h words in order to determine whether or not the distance between t h e m is 0, and this makes this procedure computationally intensive. Example. For the words m e a s u r e and b i l l , b o t h of them ambiguous, this procedure tries to find two possible senses for these words, which are at a distance of 0, i.e. they belong to the same synset. T h e senses found are m e a s u r e # 4 and b i l l # l , and thus the two words are marked with their corresponding senses. PROCEDURE 7. F i n d words which are semantically connected to the already disambiguated words, a n d for which the connection distance is m a x i m u m 1. Again, the distance is c o m p u t e d based on the WordNet hierarchy; two words are semantically connected at a m a x i m u m distance of 1 if they are synonyms or t h e y belong to a hypernymy/hyponymy relation. Example. Consider t h e nouns s u b c o m m i t t e e and committee. T h e first one is disarm biguated with procedure 2, and thus it is marked with sense # 1 . The word c o m m i t t e e with its sense # 1 is semantically linked with t h e word s u b c o m m i t t e e by a hypernymy relation. Hence, we semantically tag this word with sense ~1. PROCEDURE 8. Find words which are semantically connected between them, and for which the connection distance is m a x i m u m 1. This procedure is similar with procedure 6: both words are ambiguous, and thus all their senses have to be considered in the process of finding the distance between them. Example. The words g i f t and d o n a t i o n are both ambiguous. This procedure finds g i f t with sense # 1 as being the hypernym of d o n a t i o n , also with sense ~ 1 . Therefore, both words are disambiguated and marked with their assigned senses. The procedures presented above are applied iteratively. This allows us to identify a set of nouns and verbs which can be disambiguated with high precision. About 55% of the nouns and verbs are disambiguated with over 92% accuracy. Algorithm Step 1. Pre-process the text. This implies tokenization and part-of-speech tagging. The part-of-speech tagging task is performed with high accuracy using an improved version of Brill's tagger (Brill, 1992). At this step, we also identify the complex nominals, based on WordNet definitions. For example, the word sequence ' 'pipeline companies' ' is found in WordNet and thus it is identified as a single concept. There is also a list of words which we do not a t t e m p t to dis~.mbiguate. These words are marked with a special flag to indicate that they should not be considered in the disrtmbiguation process. So far, this list consists of three verbs: be, have, do. Step 2. Initi~]i~.e the Set of Disambiguated Words (SDW) with the empty set SDW={}. Initialize the Set of Ambiguous Words (SAW) with the set formed by all the nouns and verbs in the input text. Step 3. Apply procedure 1. The named entities identified here are removed from SAW and added to SDW. Step 4. Apply procedure 2. The monosemous words found here axe removed from SAW and added to SDW. Step 5. Apply procedure 3. This step allows us to disambiguate words based on their oc- currence in the semantically tagged corpus. The words whose sense is identified with this procedure are removed from SAW and added to SDW. Step 6. Apply procedure 4. This will identify a set of nouns which can be disambiguated b a n d on their noun-contexts. Step 7. Apply procedure 5. This procedure tries to identify a synonymy relation between the words from SAW and SDW. T h e words disambiguated are removed from SAW and added to SDW. Step 8. Apply procedure 6. This step is different from the previous one, as the synonymy relation is sought among words in SAW (no SDW words involved). The words disambiguated are removed from SAW and added to SDW. Step 9. Apply procedure 7. This step tries to identify words from SAW which are linked at a distance of m a x i m u m 1 with the words from SDW. Remove the words dis ambiguated from SAW and add t h e m to SDW. Step 10. Apply procedure 8. This procedure finds words from SAW connected at a distance of m a x i m u m I. As in step 8, no words from SDW are involved. The words disambiguated are removed from SAW and added to SDW. Results To determine the accuracy a n d the recall of the disambiguation method presented here, we have performed tests on 6 randomly selected files from SemCor. T h e following files have been used: br-a01, br-a02, br-k01, brk18, br-m02, br-r05. Each of these files was split into smaller files with a m a x i m u m of 15 lines each. This size limit is based on our observation that small contexts reduce the applicability of procedures 5-8, while large contexts become a source of errors. Thus, we have created a benchmark with 52 texts, on which we have tested the disambiguation method. In table 1, we present the results obtalned for these 52 texts. T h e first cohlmn indicates the file for which the results are presented. T h e average number of no, ms and verbs considered by the disambiguation algorithm for each of these files is shown in the second col- 40 Table I: Results for the W S D algorithm applied on 52 texts No. File words br-a01 132 br-a02 135 br-k01 -68.1 br-k18 60.4 br-m02 63 br-r05 72.5 AVERAGE 88.5 Proc.l+2 No. Ace. 40 100% 49 100% 17.2 100% 18.1 100% 17.3 100% 14.3 100% 25.9 100% Proc.3 No. Ace. 43 99.7~ 5 2 . 5 98.5% 2 3 . 3 99.7% 2 0 . 7 99.1% 2 0 . 3 98.1% 16.6 98.1% 2 9 . 4 98.8% umn. In columns 3 and 4, there are presented the average number of words disambiguated with procedures 1 and 2, and the accuracy obtained with these procedures. Column 5 and 6 present the average number of words disambiguated and the accuracy obtained after applying procedure 3 (cumulative results). The cumulative results obtained after applying procedures 3, 4 and 5, 6 and 7, are shown in columns 7 and 8, 9 and 10, respectively columns 10 and 11. T h e novelty of this m e t h o d consists of the fact that the disambiguation process is done in an iterative manner. Several procedures, described above, are applied such as to build a set of words which are disambiguated with high accuracy: 55% of the nouns and verbs are disambiguated with a precision of 92.22%. T h e most important improvements which are expected to be achieved on the W S D problem are precision and speed. In the case of our approach to W S D , we can also talk a b o u t the need for an increased fecal/, meaning t h a t we want to obtain a larger number of words which can be disambiguated in the input text. T h e precision of more t h a n 92% obtained during our experiments is very high, considering the fact that Word.Net, which is the dictionary used for sense identification, is very fine grained and sometime the senses are very close to each other. The accuracy obtained is close to the precision achieved by humans in sense disambiguation. 5 Indexing and Retrieval The indexing process takes a group of document files and produces a new index. Such things as unique document identifiers, proper 41 Proc.4 No. Acc. 58.5 94.6% 68.6 94% 3 8 . 1 97.4% 26.6 96.9% 26.1 95% 27 93.2% 40.8 95.2% Proc.5+6 No. Ace. 63.8 92.7% 75.2 92.4% 40.3 97.4% 2 7 . 8 95.3% 26.8 94.9% 3 0 . 2 91.5% 44 94% Proc.7+8 No. Acc. 73.2 89.3% 8 1 . 2 91.4% 4 1 . 8 96.4% 2 9 . 8 93.2% 30.1 93.9% 3 4 . 2 89.1% 48.4 92.2% S G M L tags, and other artificial constructs are ignored. In the current version of the system, w e are using only the A N D and O R boolean operators. Future versions will consider the implementation of the N O T and N E A R operators. T h e information obtained from the W S D module is used b y the main indexing process, where the word stem and location are indexed along with t h e WordNet synset (if present). Collocations are indexed at each location that a m e m b e r of the collocation occurs. All elements of the document are indexed. This includes, b u t is not limited to, dates, numbers, document identifiers, the s t e m m e d words, collocations, WordNet synsets (if available), and even those terms which other indexers consider stop words. The only items currently excluded from the index are punct u a t i o n marks which are not part of a word or collocation. T h e benefit of this form of indexing is t h a t documents m a y be retrieved using s t e m m e d words, or using synset offsets. Using synset offset values has the added benefit of retrieving documents which do not contain the original s t e m m e d word, b u t do contain synonyms of the original word. T h e retrieval process is limited to the use of t h e Boolean operators A N D and OR. There is an auxiliary front end to the retrieval engine which allows the user to enter a textual query, such as, "What financial institutions are .found along the banks of the Nile?" The auxiliary front end will then use the W S D to disambiguate the query and build a Boolean query for the standard retrieval engine. For the preceding example, the auxil- (financiaLinstitution OR 60031M[NN) AND (bank OR 68002231NN) AND (Nile OR 68261741NN) where the numbers in the pre- catenated with the AND operator among them. iary front end would build the query: . vious query represent the offsets of the synsets in which the words with their determined meaning occur. Once a list of documents meeting the query requirements has been determined, the complete text of each matching document is retrieved and presented to the user. 6 QwNHyperOfSset. Keywords from the question, stemmed based on WordNet, concatenated using the OR operator with the associated synset offset and with the offset of the hypernym synset, and concatenated with the AND operator among them. All these types of queries are r u n against the semantic index created based on words and synset offsets. We denote these rime with RWNStem, RWNOyfset and RWNHyperOffset. The three query formats, for the given question, are presented below: QwNstern. effect AND surface AND mass AND flow AND interaction QwNoyyset. (effect OR 77661441NN) AND (surface OR 3447223[NN) AND (mass O R 392343651NN) AND (transfer O R 1320951NN) AND (interaction OR 78405721NN) QWNHyperOffset (effect OR 77661441NN OR 20461]NN) AND (surface OR 3447223]NN OR 119371NN) AND (mass OR. 39234351NN OR 3912591[NN) AND (transfer O R 1320951NN OR. 1304701NN) AND (interaction OR. 784057£~NN OR. 7770957~NN) An Example Consider, for example, the following question: "Has anyone investigated the effect of surface mass transfer on hypersonic viscous interactionsf'. The question processing involves part of speech tagging, stemming and word sense disambiguation. The question becomes: "Has anyone investigate I VB1535831 the effectlNN17766144 o/surfacelN~3447223 massl NN139234 35 transferl Nhq132095 on hypersoniclJJ viscouslJJ interactionlNNl 7840572". The selection of the keywords is not an easy task, and it is performed using the set of 8 heuristics presented in (Moldovan et al., 1999). Because of space limitations, we are not going to detail here the heuristics and the algorithm used for keywords selection. The main idea is that an initial nnmber of keywords is determined using a subset of these heuristics. If no documents are retrieved, more keywords are added, respectively a too large number of documents will imply that some of the keywords are dropped in the reversed order in which they have been entered. For each question, three types of query are formed, using the AND and OR. operators. Using the first type of query, 7 documents were found out of which 1 was considered to be relevant. W i t h the second and third types of query, we obtained 11, respectively 17 documents, out of which 4 were found relevant, and actually contained the answer to t h e question. (sample answer) ... the present report gives an a c - of the development o] an approzimate theory to the problem of hypersonic strong viscous interaction on a fiat plate with mass-transfer at the plate surface. the disturbanceflow region is divided into inviscid and viscous flo~ regions .... (craniield0305). count 1. QwNstem. Keywords from the question, stemmed based on WordNet, concatenated with the AND operator. 77 Results T h e system was tested on the Cranfield collection, including 1400 documents, SGML formated 4. From the 225 questions provided 2. QwNoffset. Keywords from the question, stemmed based on WordNet, concatenated using the OR. operator with the associated synset offset, and con- 4Demo available online at http://pdpl 3.seas.smu.edu/rada/sem.ind./ 42 with this collection, we randomly selected 50 questions and used them to create a benchmark against which we have performed the three runs described in the previous sections: R W N S t e m , R W N O f f s e t and 1-~W N H y p e r O f f s e t . For each of .these questions, the system forms three types of queries, as described above. Below, we present 10 of these questions and show the results obtained in Table 2. I . H a s a n y o n e investigated the effect of surface mass trans- f e r on hypersonic ~'L~cwas interactions? $. What is the combined effect of surface heat and mass transfer on hypersonic flow? 3. What are the existing solutions f o r hypersonic viscous interactions over an insulated fiat plate? 4. What controls leading-edge a t t a c h m e n t at transonic ve- locities ? 5. What are wind-tunnel corrections f o r a two-dimensional aerofoil mounted off-centre in a tunnel? 6. What is the present state of the theory of quasi-conical flows ? 7. References on the methods available for accurately estimating aerodynamic heat transfer to conical bodies for both laminar and turbulent flow. 8. What parameters can seriously influence natural transition from laminar to turbulent flow on a model in a wind tunnel? 9. Can a satisfactory e~perimental technique be devel- oped f o r measuring oscillatory derivatives on slender stingmo unted models in supersonic wind tunnels? I0. Recent data on shock-induced boundary-layer separation. Three measures are used in the evaluation of the system performance: (1) precision, de.. fined as the number of relevant documents retrieved over the total number of documents retrieved; (2) real/, defined as the number of relevant documents retrieved over the total number of relevant documents found in the collection and (3) F-measure, which combines both the precision and recall into a single formula: Fmeas~re = (32 + l'O) * P * R • P) + R where P is the precision, R is the recall and is the relative importance given to recall over precision. In our case, we consider both 43 precision and recall of equal importance, and thus the factor fl in our evaluation is 1. The tests over the entire set of 50 questions led to 0.22 precision and 0.25 recall when the WordNet stemmer is used, 0.23 precision and 0.29 recall when using a combined word-based and synset-based indexing. The usage of hypernym synsets led to a recall of 0.32 and a precision of 0.21. The relative gain of the combined wordbased and synset-based indexing respect to the basic word-based indexing was 16% increase in recall and 4% increase in precision. When using the hypernym synsets, there is a 28% increase in recall, with a 9% decrease in precision. The conclusion of these experiments is that indexing by synsets, in addition to the classic word-based indexing, can actually improve IR effectiveness. More than that, this is the first time to our knowledge when a WSD algorithm for open text was actually used to automaticaUy disambiguate a collection of texts prior to indexing, with a disambiguation accuracy high enough to actually increase the recall and precision of an IR system. An issue which can be raised here is the efficiency of such a system: we have introduced a WSD stage into the classic IR process and it is well known that WSD algorithm.~ are usually computationally intensive; on the other side, the disambiguation of a text collection is a process which can be highly parallelized, and thus this does not constitute a problem anymore. 8 Conclusions The full understanding of text is still an elusive goal. Short of that, semantic indexing offers an improvement over current IR techniques. The key to semantic indexing is fast WSD of large collections of documents. In this paper we offer a WSD method for open domains that is fast and accurate. Since only 55% of the words can be disambiguated so far, we use a hybrid indexing approach that combines word-based and sense-based indexing. The senses in WordNet are fine grain and the WSD method has to cope with this. The Table 2: Results for 10 questions run against the three indices created on the Cranlleld collection. The bottom line shows the results for the entire set of questions. Question number recall precision Lmeasure 1 2 3 4 5 6 7 8 9 10 Avo/50 0.08 0.06 0.47 0.25 0.33 0.00 0.17 0.20 0.67 0.29 0.25 0.14 0.17 0.70 0.60 0.50 0.00 0.17 0.II 0.50 0.07 0.22 0.05 0.04 0.28 0.18 0.20 0.00 0.09 0.07 0.29 0.06 0.09 .RW N Stcm recall 0.31 0.25 0.47 Query type Rw No f f ~et precision f-measure 0.36 0.44 0.70 0.60 0.25 0.00 0.67 0.29 0.II 0.50 0.07 0.17 0.16 0.28 0.18 0.20 0.00 0.09 0.07 0.29 0.06 0.29 0.23 0.11 0.25 1.00 0.00 0.17 0.20 0.17 recall 0.31 0.25 0.53 0.25 1.00 RW l~H~ r O y.fset precc~mn f-measure 0.24 0.31 0.67 0.60 0.19 0.14 0.14 0.30 0.18 0.16 0.00 0.00 0.00 0.17 0.20 1.00 0.29 0.17 0.11 0.38 0.06 0.09 0.07 0.28 0.05 0.32 0.21 0.10 can improve text retrieval. In Proceedings of COLING-ACL '98 Workshop on Usage of Word.Net in Natural Language Processing Systems, Montreal, Canada, August. W S D algorithm presented here is new for the N L P community and proves to b e well suited for a task such as semantic indexing. The continuously increasing amount of information available t o d a y requires more and more sophisticated I R techniques, and semantic indexing is one of the new trends when trying to improve I R effectiveness. W i t h semantic indexing, the search m a y b e expanded to other forms of semantically related concepts as done by Woods (Woods, 1997). Finally, semantic indexing can have an impact on the semantic Web technology t h a t is under consideration (Hellman, 1999). R. HeUman. 1999. A semantic approach adds meaning to the Web. Computer, pages 13-16. R. Krovetz and W.B. Croft. 1993. Lexical ambiguity and in_formation retrieval. A CM Transactions on Information Systems, 10(2):115--141. R. Krovetz. 1997. Homonymy and polysemy in information retrieval. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (A CL-97}, pages 72-79. X.A. Lu and R.B. Keefer. 1994. Query expansion/reduction and its impact on retrieval effectiveness. In The Text REtrieval Conference (TREC-3), pages 231-240. References J. Ambroziak. 1997. Conceptually assisted Web browsing. In Sixth International World Wide Web conference, Santa Clara, CA. full paper available online at http://www.scope.gmd.de[ info/www6/posters/702/guide2.html. M.L. Mauldin. 1991. Retrieval performance in FERRET: a conceptual information retrieval system. In Proceedings of the lSth International A CM-SIGIR Conference on Research and Development in Information Retrieval, pages 347-355, Chicago, IL, October. E. Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento, Italy. R. Mihalcea and D.I. Moldovan. 2000. An iterative approach to word sense disambiguation. In Proceedings of FLAIRS-2000, pages 219-223, Orlando, FL, May. C. Buckley, G. Salton, J. Allan, and A. Singhal. 1994. Automatic query expansion using smart: Trec 3. In Proceedings of the Text REtrieval Conference (TREC-3), pages 69--81. G. Miller, C. Leacock, T. Randee, and R. Bunker. 1993. A semantic concordance. In Proceedings of the 3rd DARPA Workshop on Human Language Technology, pages 303-308, Plaln~boro, New Jersey. C. Fellbaurn. 1998. WordNet, An Electronic Lexical Database. The MIT Press. D Moldovan and tL Mihalcea. 2000. Using WordNet and lexical operators to improve Internet searches. IEEE Internet Computing, 4(1):34-43. J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarran. 1998. Indexing with WordNet synsets 44 D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Goodrum, R. Girju, and V. Rus. 1999. LASSO: A tool for surfing the answer net. In Proceedings of the Text Retrieval Conference (TREU-8), November. M. Sanderson. 1994. Word sense disambiguation and information retrieval. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 142-151, SpringerVerlag. M. Sanderson. 2000. Retrieving with good sense. Information Retrieval, 2(1):49--69. H. Schutze and J. Pedersen. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pages 161-175. J.A. Stein. 1997. Alternative methods of indexing legal material: Development of a conceptual index. In Proceedings of the Conference "Law Via the Internet g7", Sydney, Australia. E.M. Voorhees. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR, Conference on Research and Development in Information Retrieval, pages 61-69, Dublin, Ireland. E.M. Voorhees. 1998. Using WordNet for text retrieval. In WordNet, An Electronic Lexical Database, pages 285-303. The MIT Press. E.M. Voorhees. 1999. Natural language proeessing and information retrieval. In Infor- mation Extraction: towards scalable, adaptable systems. Lecture notes in Artificial Intelligence, #1714, pages 32-48. W.A. Woods. 1997. Conceptual indexing: A better way to organize knowledge. Technical Report SMLI TR-97-61, Sun Mierosysterns Laboratories, April. available online http:l/www.sun.comI researeh/techrep/ at: 1997/abstract-61.html. D. Yarowsky. 1993. One sense per collocation. In Proceedings o] the ARPA Human Language Technology Workshop. 45